LLM에서의 긴 사고 연쇄 추론 해부하기

초록

추론 컴퓨팅 규모 확장은 대규모 언어 모델(LLMs)에서 추론을 향상시키며, 긴 사고 체인(CoTs)은 backtracking 및 오류 수정과 같은 전략을 가능하게 합니다. 강화 학습(RL)은 이러한 능력을 개발하는 데 중요한 방법으로 등장했지만, 긴 CoTs가 발생하는 조건은 여전히 명확하지 않으며, RL 훈련은 신중한 설계 선택을 필요로 합니다. 본 연구에서는 긴 CoT 추론의 메커니즘을 체계적으로 조사하여 모델이 긴 CoT 궤적을 생성할 수 있도록 하는 주요 요소를 식별합니다. 광범위한 지도 미세 조정(SFT) 및 RL 실험을 통해 우리는 네 가지 주요 결과를 제시합니다: (1) SFT가 엄격히 필요하지는 않지만 훈련을 단순화하고 효율성을 향상시킵니다; (2) 추론 능력은 훈련 컴퓨팅 증가와 함께 나타나지만, 그 발전은 보장되지 않으며, 보상 형성이 CoT 길이 증가를 안정화하는 데 중요합니다; (3) 검증 가능한 보상 신호의 규모 확장은 RL에 중요합니다. 우리는 잡음이 많은 웹에서 추출된 솔루션을 필터링 메커니즘과 결합하여 OOD 작업(예: STEM 추론)에 특히 강점을 보이는 것으로 발견했습니다; 그리고 (4) 오류 수정과 같은 핵심 능력은 기본 모델에 내재되어 있지만, 이러한 기술을 효과적으로 장려하는 것은 RL을 통해 복잡한 작업에 대해 상당한 컴퓨팅을 요구하며, 그 발생을 측정하는 데는 세심한 접근이 필요합니다. 이러한 통찰력은 LLMs에서 긴 CoT 추론을 향상시키기 위한 훈련 전략을 최적화하는 데 실용적인 지침을 제공합니다. 우리의 코드는 다음에서 이용 가능합니다: https://github.com/eddycmu/demystify-long-cot.

English

Scaling inference compute enhances reasoning in large language models (LLMs), with long chains-of-thought (CoTs) enabling strategies like backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the mechanics of long CoT reasoning, identifying the key factors that enable models to generate long CoT trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we present four main findings: (1) While SFT is not strictly necessary, it simplifies training and improves efficiency; (2) Reasoning capabilities tend to emerge with increased training compute, but their development is not guaranteed, making reward shaping crucial for stabilizing CoT length growth; (3) Scaling verifiable reward signals is critical for RL. We find that leveraging noisy, web-extracted solutions with filtering mechanisms shows strong potential, particularly for out-of-distribution (OOD) tasks such as STEM reasoning; and (4) Core abilities like error correction are inherently present in base models, but incentivizing these skills effectively for complex tasks via RL demands significant compute, and measuring their emergence requires a nuanced approach. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs. Our code is available at: https://github.com/eddycmu/demystify-long-cot.

LLM에서의 긴 사고 연쇄 추론 해부하기

Demystifying Long Chain-of-Thought Reasoning in LLMs

초록

Support