BOLT：在语言模型中进行长序列推理的自举方法，无需蒸馏

BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation

February 6, 2025

作者: Bo Pang, Hanze Dong, Jiacheng Xu, Silvio Savarese, Yingbo Zhou, Caiming Xiong

cs.AI

摘要

大型语言模型（LLMs），如OpenAI的o1，展示了出色的推理能力。o1在回答问题之前生成了一个长的思维链（LongCoT）。LongCoT使LLMs能够有效地分析问题，制定计划，反思和回溯。这些行为赋予LLM解决复杂问题的能力。在o1发布后，许多团队尝试复制其LongCoT和推理能力。在方法上，他们主要依赖于使用来自具有LongCoT能力的现有模型的数据进行知识蒸馏（例如OpenAI-o1，Qwen-QwQ，DeepSeek-R1-Preview），这在系统地开发这种推理能力方面存在重大不确定性。在数据领域方面，这些工作主要集中在数学上，而少数其他工作包括编码，从而限制了其泛化能力。本文介绍了一种新方法，可以实现LLM的LongCoT能力，而无需从类似o1的模型或昂贵的人工注释中蒸馏，我们从标准指导模型中引导LongCoT（BOLT）。BOLT包括三个阶段：1）使用标准指导模型上的上下文学习引导LongCoT数据；2）LongCoT监督微调；3）在线训练以进一步完善LongCoT能力。在BOLT中，在引导阶段只需要构建少量上下文示例；在我们的实验中，我们创建了10个示例，展示了这种方法的可行性。我们使用Llama-3.1-70B-Instruct引导LongCoT，并将我们的方法应用于各种模型规模（7B，8B，70B）。我们在各种基准测试中取得了令人印象深刻的表现，包括Arena-Hard，MT-Bench，WildBench，ZebraLogic，MATH500，评估不同任务解决和推理能力。

English

Large language models (LLMs), such as o1 from OpenAI, have demonstrated remarkable reasoning capabilities. o1 generates a long chain-of-thought (LongCoT) before answering a question. LongCoT allows LLMs to analyze problems, devise plans, reflect, and backtrack effectively. These actions empower LLM to solve complex problems. After the release of o1, many teams have attempted to replicate its LongCoT and reasoning capabilities. In terms of methods, they primarily rely on knowledge distillation with data from existing models with LongCoT capacities (e.g., OpenAI-o1, Qwen-QwQ, DeepSeek-R1-Preview), leaving significant uncertainties on systematically developing such reasoning abilities. In terms of data domains, these works focus narrowly on math while a few others include coding, limiting their generalizability. This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations, where we bootstrap LongCoT (BOLT) from a standard instruct model. BOLT involves three stages: 1) LongCoT data bootstrapping with in-context learning on a standard instruct model; 2) LongCoT supervised finetuning; 3) online training to further refine LongCoT capacities. In BOLT, only a few in-context examples need to be constructed during the bootstrapping stage; in our experiments, we created 10 examples, demonstrating the feasibility of this approach. We use Llama-3.1-70B-Instruct to bootstrap LongCoT and apply our method to various model scales (7B, 8B, 70B). We achieve impressive performance on a variety of benchmarks, Arena-Hard, MT-Bench, WildBench, ZebraLogic, MATH500, which evaluate diverse task-solving and reasoning capabilities.

Summary

AI-Generated Summary