ChatPaper.aiChatPaper

LLM 可以轻松地从演示中学会推理 结构才是重要的,而非内容!

LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!

February 11, 2025
作者: Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
cs.AI

摘要

大推理模型(LRMs)通过遵循包含反思、回溯和自我验证的长思维链(Long CoT)来解决复杂的推理问题。然而,引发长思维链所需的训练技术和数据要求仍然知之甚少。在这项研究中,我们发现大语言模型(LLM)可以通过数据高效的监督微调(SFT)和参数高效的低秩适应(LoRA)有效地学习长思维链推理。仅通过17k个长思维链训练样本,Qwen2.5-32B-Instruct模型在广泛的数学和编码基准测试中取得了显著的改进,包括在AIME 2024上的56.7%(+40.0%)和在LiveCodeBench上的57.0%(+8.1%),与专有o1-preview模型的得分44.6%和59.1%相媲美。更重要的是,我们发现长思维链的结构对学习过程至关重要,而个别推理步骤的内容影响微乎其微。影响内容的扰动,如在错误样本上训练或删除推理关键词,对性能几乎没有影响。相比之下,破坏长思维链中的逻辑一致性的结构修改,如洗牌或删除推理步骤,会显著降低准确性。例如,在长思维链样本上训练的模型即使有错误答案,其准确性仍然比完全正确样本训练低3.2%。这些见解加深了我们对如何引发LLMs中的推理能力的理解,并突出了有效训练下一代推理模型的关键考虑因素。这是我们之前发布的Sky-T1-32B-Preview模型的学术论文。代码可在https://github.com/NovaSky-AI/SkyThought找到。
English
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model's score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at https://github.com/NovaSky-AI/SkyThought.

Summary

AI-Generated Summary

PDF372February 12, 2025