在LLMs中揭秘长链推理

摘要

推理计算的规模化增强了大型语言模型（LLMs）中的推理，长链式思维（CoTs）使得后退和错误更正等策略成为可能。强化学习（RL）已成为开发这些能力的关键方法，然而，长CoTs出现的条件仍不清楚，RL训练需要谨慎的设计选择。在这项研究中，我们系统地调查了长CoT推理的机制，确定了使模型能够生成长CoT轨迹的关键因素。通过广泛的监督微调（SFT）和RL实验，我们提出了四个主要发现：（1）虽然SFT并非必需，但它简化了训练并提高了效率；（2）随着训练计算量的增加，推理能力往往会出现，但其发展并不是一定的，因此，奖励塑造对于稳定CoT长度的增长至关重要；（3）扩展可验证的奖励信号对RL至关重要。我们发现，利用带有过滤机制的嘈杂、从网络提取的解决方案具有强大潜力，特别适用于STEM推理等超出分布（OOD）任务；以及（4）像错误更正这样的核心能力在基础模型中本质上是存在的，但通过RL有效地激励这些技能以应对复杂任务需要大量计算，并且衡量它们的出现需要一种细致的方法。这些见解为优化训练策略以增强LLMs中长CoT推理提供了实用指导。我们的代码可在以下链接找到：https://github.com/eddycmu/demystify-long-cot。

English

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.

在LLMs中揭秘长链推理

Demystifying Long Chain-of-Thought Reasoning in LLMs

摘要

Summary

Support