数学的推論におけるプロセス報酬モデルの開発過程の教訓

要旨

プロセス報酬モデル（PRM）は、大規模言語モデル（LLM）の数学的推論におけるプロセス監視の有望なアプローチとして登場し、推論プロセスにおける中間エラーを特定し軽減することを目指しています。しかしながら、効果的なPRMの開発には、特にデータ注釈と評価方法論において著しい課題が存在します。本論文では、幅広い実験を通じて、一般的に使用されるモンテカルロ（MC）推定に基づくPRM用データ合成が、通常LLM-判定者および人間注釈方法と比較して、劣った性能と汎化をもたらすことを示します。MC推定は、現在のステップの正確性を評価するために補完モデルに依存しており、不正確なステップ検証をもたらします。さらに、PRM向けの従来のBest-of-N（BoN）評価戦略における潜在的なバイアスを特定します：（1）信頼性のないポリシーモデルは、正しい答えを生成しますが、欠陥のあるプロセスをもたらし、BoNの評価基準とPRMのプロセス検証の目的との不一致を引き起こします。（2）このような応答に対するPRMの許容度が、BoNスコアを過大評価する原因となります。（3）既存のPRMは、最終的な回答ステップに集中した最小スコアのかなりの割合を持ち、BoN最適化PRMにおけるプロセスから結果ベースの評価への移行を示しています。これらの課題に対処するために、MC推定をLLM-判定者と効果的に統合する合意フィルタリングメカニズムを開発し、応答レベルとステップレベルのメトリクスを組み合わせたより包括的な評価フレームワークを提唱します。これらのメカニズムに基づいて、BoN評価およびステップごとのエラー識別タスクにおいて、モデルの性能とデータ効率を大幅に向上させます。最後に、既存のオープンソースの代替手段を凌駕し、プロセス監視モデルの将来の研究に向けた実践的なガイドラインを提供する、新たな最先端のPRMを公開します。

English

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

数学的推論におけるプロセス報酬モデルの開発過程の教訓

The Lessons of Developing Process Reward Models in Mathematical Reasoning

要旨

Summary

Support