rStar-Math: 작은 LLMs가 자기 진화된 심층 사고로 수학 추론을 습득할 수 있습니다.

초록

우리는 rStar-Math를 제시하여 작은 언어 모델(SLM)이 우수한 모델로부터의 증류 없이도 OpenAI o1의 수학 추론 능력을 견줄하거나 심지어 능가할 수 있다는 것을 입증합니다. rStar-Math는 Monte Carlo Tree Search (MCTS)를 통해 "심층적 사고"를 수행함으로써 이를 달성합니다. 여기서 수학 정책 SLM은 SLM 기반의 프로세스 보상 모델에 의해 안내되는 테스트 시간 검색을 수행합니다. rStar-Math는 두 SLM을 훈련하는 과정에서 발생하는 도전에 대응하기 위해 세 가지 혁신을 도입합니다: (1) 정책 SLM을 훈련하는 데 사용되는 단계별 검증된 추론 경로를 생성하기 위해 광범위한 MCTS 롤아웃을 수행하는 혁신적인 코드 보강 CoT 데이터 합성 방법; (2) 단계별 점수 주석을 피하고 더 효과적인 프로세스 선호 모델 (PPM)을 얻는 새로운 프로세스 보상 모델 훈련 방법; (3) 정책 SLM과 PPM을 처음부터 구축하고 추론 능력을 향상시키기 위해 반복적으로 진화시키는 자체 진화 레시피. 747k개의 수학 문제에 대한 수백만 개의 합성 솔루션을 통해 4회의 자체 진화를 통해 rStar-Math는 SLM의 수학 추론을 최첨단 수준으로 끌어올립니다. MATH 벤치마크에서는 Qwen2.5-Math-7B를 58.8%에서 90.0%로, Phi3-mini-3.8B를 41.4%에서 86.4%로 개선하여 o1-preview를 +4.5% 및 +0.9% 초과합니다. USA 수학 올림피아드(AIME)에서 rStar-Math는 평균 53.3% (8/15)의 문제를 해결하여 가장 뛰어난 고등학교 수학 학생들 중 상위 20%에 속합니다. 코드와 데이터는 https://github.com/microsoft/rStar에서 제공될 예정입니다.

English

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

rStar-Math: 작은 LLMs가 자기 진화된 심층 사고로 수학 추론을 습득할 수 있습니다.

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

초록

Summary

Support