rStar-Math:小型LLMs可以通过自我进化的深度思维掌握数学推理
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
January 8, 2025
作者: Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, Mao Yang
cs.AI
摘要
我们提出rStar-Math来展示小型语言模型(SLMs)可以在没有来自优秀模型的蒸馏的情况下,与OpenAI o1 的数学推理能力相媲美甚至超越。rStar-Math通过利用蒙特卡洛树搜索(MCTS)进行“深度思考”来实现这一点,其中一个数学策略SLM通过由基于SLM的过程奖励模型引导的测试时间搜索来执行。rStar-Math引入了三项创新来解决训练这两个SLM面临的挑战:(1)一种新颖的代码增强的CoT数据合成方法,通过进行大量的MCTS展开来生成用于训练策略SLM的逐步验证推理轨迹;(2)一种新颖的过程奖励模型训练方法,避免了天真的步骤级别评分标注,产生了更有效的过程偏好模型(PPM);(3)一种自我进化的方法,在这种方法中,策略SLM和PPM是从头开始构建并迭代演变以提高推理能力。通过进行4轮自我演化,对747k个数学问题进行数百万个合成解的训练,rStar-Math将SLMs的数学推理提升到了最先进的水平。在MATH基准测试中,它将Qwen2.5-Math-7B的准确率从58.8%提高到90.0%,将Phi3-mini-3.8B的准确率从41.4%提高到86.4%,超过o1-preview分别增加了+4.5%和+0.9%。在美国数学奥林匹克(AIME)中,rStar-Math平均解决了53.3%(8/15)的问题,位列最优秀的高中数学学生的前20%。代码和数据将在https://github.com/microsoft/rStar 上提供。
English
We present rStar-Math to demonstrate that small language models (SLMs) can
rival or even surpass the math reasoning capability of OpenAI o1, without
distillation from superior models. rStar-Math achieves this by exercising "deep
thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM
performs test-time search guided by an SLM-based process reward model.
rStar-Math introduces three innovations to tackle the challenges in training
the two SLMs: (1) a novel code-augmented CoT data sythesis method, which
performs extensive MCTS rollouts to generate step-by-step verified reasoning
trajectories used to train the policy SLM; (2) a novel process reward model
training method that avoids na\"ive step-level score annotation, yielding a
more effective process preference model (PPM); (3) a self-evolution recipe in
which the policy SLM and PPM are built from scratch and iteratively evolved to
improve reasoning capabilities. Through 4 rounds of self-evolution with
millions of synthesized solutions for 747k math problems, rStar-Math boosts
SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it
improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to
86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad
(AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among
the top 20% the brightest high school math students. Code and data will be
available at https://github.com/microsoft/rStar.Summary
AI-Generated Summary