rStar-Math: 小規模LLMが自己進化型深層思考によって数学推論を習得することができる

要旨

rStar-Mathを提示し、優れたモデルからの蒸留なしに、小規模言語モデル（SLM）がOpenAIのo1の数学推論能力に匹敵し、あるいはそれを上回ることを示します。rStar-Mathは、Monte Carlo Tree Search（MCTS）を通じて「深い思考」を実践し、数学ポリシーSLMがSLMベースのプロセス報酬モデルによって誘導されたテスト時探索を行うことでこれを達成します。rStar-Mathは、2つのSLMのトレーニングにおける課題に取り組むために3つの革新を導入します：（1）革新的なコード拡張されたCoTデータ合成手法である、この手法は、方針SLMをトレーニングするために使用されるステップバイステップの検証済み推論軌跡を生成するために広範なMCTSロールアウトを実行します；（2）単純なステップレベルのスコア注釈を回避し、より効果的なプロセス優先モデル（PPM）を生み出す革新的なプロセス報酬モデルトレーニング手法；（3）ポリシーSLMとPPMをゼロから構築し、推論能力を向上させるために反復的に進化させる自己進化レシピ。747kの数学問題に対する数百万の合成解決策を用いた4回の自己進化を通じて、rStar-MathはSLMの数学推論を最先端のレベルに引き上げます。MATHベンチマークでは、Qwen2.5-Math-7Bを58.8％から90.0％、Phi3-mini-3.8Bを41.4％から86.4％に向上させ、o1-previewを+4.5％、+0.9％上回ります。USA数学オリンピアード（AIME）では、rStar-Mathは問題の平均53.3％（8/15）を解決し、最も優れた高校生数学生徒の20％にランクインします。コードとデータはhttps://github.com/microsoft/rStar で入手可能です。

English

We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising "deep thinking" through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data sythesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids na\"ive step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs' math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students. Code and data will be available at https://github.com/microsoft/rStar.

rStar-Math: 小規模LLMが自己進化型深層思考によって数学推論を習得することができる

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

要旨

Support