사토리: 행동 연쇄 사고를 통한 강화 학습이 LLM 추론을 자기 회귀적 탐색을 통해 향상시킵니다.

초록

대형 언어 모델(LLMs)은 다양한 영역에서 놀라운 추론 능력을 보여주었습니다. 최근 연구에 따르면 테스트 시간 계산을 증가시킴으로써 LLM의 추론 능력이 향상된다는 것이 밝혀졌습니다. 이는 일반적으로 외부 LLM 확인자에 의해 안내되는 추론 시에 광범위한 샘플링을 포함하며, 결과적으로 이중 플레이어 시스템을 형성합니다. 외부 안내에도 불구하고, 이 시스템의 효과는 단일 LLM이 복잡한 작업에 대처할 수 있는 잠재력을 보여줍니다. 따라서 우리는 새로운 연구 문제를 제시합니다: 단일 LLM의 추론 능력을 근본적으로 향상시키기 위해 검색 능력을 내재화할 수 있을까요? 본 연구는 자기 반성 및 새로운 전략의 자가 탐구를 포함한 확장된 추론 과정을 위한 사후 훈련 LLM에 초점을 맞춘 직교 방향을 탐구합니다. 이를 달성하기 위해, 우리는 COAT(Chain-of-Action-Thought) 추론을 제안하고 1) COAT 추론 형식을 내재화하기 위한 소규모 형식 조정 단계, 2) 강화 학습을 활용한 대규모 자가 개선 단계로 구성된 이중 단계 훈련 패러다임을 제안합니다. 저희 방법은 오픈 소스 모델과 데이터를 기반으로 훈련된 7B LLM인 Satori를 도출합니다. 광범위한 경험적 평가 결과, Satori가 수학적 추론 벤치마크에서 최고 수준의 성능을 달성하면서 도메인 외 작업에 대한 강력한 일반화 능력을 보여줍니다. 코드, 데이터 및 모델은 완전히 오픈 소스로 제공될 것입니다.

English

Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains. Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities. This typically involves extensive sampling at inference time guided by an external LLM verifier, resulting in a two-player system. Despite external guidance, the effectiveness of this system demonstrates the potential of a single LLM to tackle complex tasks. Thus, we pose a new research problem: Can we internalize the searching capabilities to fundamentally enhance the reasoning abilities of a single LLM? This work explores an orthogonal direction focusing on post-training LLMs for autoregressive searching (i.e., an extended reasoning process with self-reflection and self-exploration of new strategies). To achieve this, we propose the Chain-of-Action-Thought (COAT) reasoning and a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning. Our approach results in Satori, a 7B LLM trained on open-source models and data. Extensive empirical evaluations demonstrate that Satori achieves state-of-the-art performance on mathematical reasoning benchmarks while exhibits strong generalization to out-of-domain tasks. Code, data, and models will be fully open-sourced.

사토리: 행동 연쇄 사고를 통한 강화 학습이 LLM 추론을 자기 회귀적 탐색을 통해 향상시킵니다.

Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search

초록

Summary

Support