rStar-Math:小型LLM可以通過自我進化的深度思考掌握數學推理
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
January 8, 2025
作者: Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang
cs.AI
摘要
我們提出 rStar-Math 來展示小型語言模型(SLMs)可以在沒有來自優越模型的精煉的情況下,與 OpenAI o1 的數學推理能力相媲美甚至超越。rStar-Math 通過利用蒙特卡洛樹搜索(MCTS)進行“深度思考”,其中數學策略 SLM 通過基於 SLM 的過程獎勵模型引導的測試時間搜索來實現這一目標。rStar-Math 引入了三項創新來應對訓練兩個 SLMs 中的挑戰:(1)一種新穎的代碼增強的 CoT 數據合成方法,通過進行大量 MCTS 展開來生成用於訓練策略 SLM 的逐步驗證推理軌跡;(2)一種新穎的過程獎勵模型訓練方法,避免了天真的步驟級得分標註,產生更有效的過程偏好模型(PPM);(3)一種自我演進的方法,其中策略 SLM 和 PPM 從頭開始構建並逐步演進以提高推理能力。通過對 747k 個數學問題進行 4 輪自我演進,經過數百萬個合成解的 rStar-Math 將 SLMs 的數學推理提升到最先進的水平。在 MATH 基準測試中,它將 Qwen2.5-Math-7B 從 58.8% 提升至 90.0%,將 Phi3-mini-3.8B 從 41.4% 提升至 86.4%,超越 o1-preview 分別達到 +4.5% 和 +0.9%。在美國數學奧林匹克(AIME)中,rStar-Math 平均解決了 53.3%(8/15)的問題,位列最優秀的高中數學學生前 20%。代碼和數據將在 https://github.com/microsoft/rStar 上提供。
English
We present rStar-Math to demonstrate that small language models (SLMs) can
rival or even surpass the math reasoning capability of OpenAI o1, without
distillation from superior models. rStar-Math achieves this by exercising "deep
thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM
performs test-time search guided by an SLM-based process reward model.
rStar-Math introduces three innovations to tackle the challenges in training
the two SLMs: (1) a novel code-augmented CoT data sythesis method, which
performs extensive MCTS rollouts to generate step-by-step verified reasoning
trajectories used to train the policy SLM; (2) a novel process reward model
training method that avoids na\"ive step-level score annotation, yielding a
more effective process preference model (PPM); (3) a self-evolution recipe in
which the policy SLM and PPM are built from scratch and iteratively evolved to
improve reasoning capabilities. Through 4 rounds of self-evolution with
millions of synthesized solutions for 747k math problems, rStar-Math boosts
SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it
improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to
86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad
(AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among
the top 20% the brightest high school math students. Code and data will be
available at https://github.com/microsoft/rStar.Summary
AI-Generated Summary