BoostStep: 大規模言語モデルの数学的能力を向上させるための単一ステップ推論の強化

要旨

最先端の大規模言語モデル（LLMs）は、分割統治パイプラインとインコンテキストラーニング（ICL）例の支援により、複雑な数学問題の解決で有望なパフォーマンスを示しています。ただし、ICL例内の2つの重要な問題、つまり粒度不一致とそれに続く負の影響ノイズ問題によって、その改善可能性は限られています。具体的には、LLMsは分割プロセスを行うことができますが、征服ステップの内での不正確な推論によって失敗することが多いです。一方、質問単位で取得されるICL例は、特定の難しい推論ステップに対する関連ステップが欠けていることがあります。さらに、この不一致は関連性の欠如によって正しい推論を妨げる可能性があります。このため、私たちは各ステップ内の推論品質を向上させることに焦点を当て、BoostStepを提案します。BoostStepは、取得と推論の粒度を整合させ、各推論ステップに対して新しい「最初の試み」戦略を用いて高度に関連するICL例を提供します。BoostStepは、粗い質問単位戦略よりもより関連性の高い例を提供し、各ステップ内のモデル推論品質を着実に向上させます。BoostStepは、スタンドアロンの推論パフォーマンスを向上させるだけでなく、モンテカルロ木探索法（MCTS）とシームレスに統合して候補生成と意思決定の両方を洗練させる汎用かつ堅牢な推論向上手法です。定量的には、さまざまな数学ベンチマークでGPT-4oとQwen2.5-Math-72Bをそれぞれ3.6\%と2.0\%向上させ、MCTSと組み合わせることで7.5\%の利益をもたらします。

English

Cutting-edge large language models (LLMs) demonstrate promising performance in solving complex math problems with a divide-and-conquer pipeline and the assistance of in-context learning (ICL) examples. However, their potential for improvement is limited by two critical problems within their ICL examples: granularity-mismatch and the ensuing negative-effect noise problem. Specifically, the LLMs are capable of the dividing process yet mostly failed by inaccurate reasoning within a few conquer steps, while the ICL examples retrieved in question-grained sometimes lack relevant steps for a specific challenging reasoning step. Further, this disconnect may hinder the correct reasoning due to its irrelevance. To this end, we focus on improving the reasoning quality within each step and present BoostStep. BoostStep aligns the granularity between the retrieving and reasoning on step grained, and provides highly related ICL examples for each reasoning step with a novel `first-try' strategy. BoostStep provides more relevant examples than the coarse question-grained strategy, enhancing the model reasoning quality within each step steadily. BoostStep is a general and robust reasoning-enhancing method that not only improves standalone reasoning performance but also integrates seamlessly with Monte Carlo Tree Search methods (MCTS) to refine both candidate generation and decision-making. Quantitatively, it improves GPT-4o and Qwen2.5-Math-72B by 3.6\% and 2.0\% respectively on various mathematical benchmarks, and 7.5\% gain combined with MCTS.

BoostStep: 大規模言語モデルの数学的能力を向上させるための単一ステップ推論の強化

BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning

要旨

Support