BOLT: 증류 없이 언어 모델에서 장기 사슬의 부트스트랩

초록

대규모 언어 모델(LLMs)인 OpenAI의 o1과 같은 LLMs는 놀라운 추론 능력을 보여주었습니다. o1은 질문에 답하기 전에 긴 사고 체인(LongCoT)을 생성합니다. LongCoT는 LLMs가 문제를 분석하고 계획을 세우며, 반성하고 효과적으로 되돌아가는 능력을 제공합니다. 이러한 행동들은 LLM이 복잡한 문제를 해결할 수 있도록 돕습니다. o1의 출시 이후, 많은 팀들이 그의 LongCoT와 추론 능력을 복제하려 시도했습니다. 이들은 주로 기존의 LongCoT 능력을 가진 모델들(예: OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview)의 데이터로 지식 증류에 의존하며, 이는 이러한 추론 능력을 체계적으로 개발하는 데 상당한 불확실성을 남깁니다. 데이터 도메인 측면에서, 이러한 연구들은 주로 수학에 초점을 맞추고 있으며, 일부는 코딩을 포함하고 있지만, 그 일반화 능력은 제한되어 있습니다. 본 논문은 LLM의 LongCoT 능력을 o1과 같은 모델이나 비용이 많이 드는 인간 주석 없이 활성화하는 새로운 접근 방식을 소개합니다. 우리는 표준 instruct 모델에서 LongCoT를 부트스트랩하는 방식인 BOLT(부트스트랩 LongCoT)을 사용합니다. BOLT에는 세 단계가 포함되어 있습니다: 1) 표준 instruct 모델에서 문맥 학습을 통한 LongCoT 데이터 부트스트랩; 2) LongCoT 지도 미세 조정; 3) LongCoT 능력을 더욱 세밀하게 개선하기 위한 온라인 훈련. BOLT에서는 부트스트랩 단계에서 몇 가지 문맥 예제만 구축해야 합니다. 실험에서는 10가지 예제를 생성하여 이 방법의 실행 가능성을 증명했습니다. 우리는 LongCoT를 부트스트랩하기 위해 Llama-3.1-70B-Instruct를 사용하고, 다양한 모델 규모(7B, 8B, 70B)에 우리의 방법을 적용했습니다. 우리는 다양한 벤치마크(Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500)에서 높은 성능을 달성했는데, 이는 다양한 작업 해결 및 추론 능력을 평가합니다.

English

Large language models (LLMs), such as o1 from OpenAI, have demonstrated remarkable reasoning capabilities. o1 generates a long chain-of-thought (LongCoT) before answering a question. LongCoT allows LLMs to analyze problems, devise plans, reflect, and backtrack effectively. These actions empower LLM to solve complex problems. After the release of o1, many teams have attempted to replicate its LongCoT and reasoning capabilities. In terms of methods, they primarily rely on knowledge distillation with data from existing models with LongCoT capacities (e.g., OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview), leaving significant uncertainties on systematically developing such reasoning abilities. In terms of data domains, these works focus narrowly on math while a few others include coding, limiting their generalizability. This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations, where we bootstrap LongCoT (BOLT) from a standard instruct model. BOLT involves three stages: 1) LongCoT data bootstrapping with in-context learning on a standard instruct model; 2) LongCoT supervised finetuning; 3) online training to further refine LongCoT capacities. In BOLT, only a few in-context examples need to be constructed during the bootstrapping stage; in our experiments, we created 10 examples, demonstrating the feasibility of this approach. We use Llama-3.1-70B-Instruct to bootstrap LongCoT and apply our method to various model scales (7B, 8B, 70B). We achieve impressive performance on a variety of benchmarks, Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500, which evaluate diverse task-solving and reasoning capabilities.

BOLT: 증류 없이 언어 모델에서 장기 사슬의 부트스트랩

BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation

초록

Summary

Support