在數學推理中開發過程獎勵模型的教訓

The Lessons of Developing Process Reward Models in Mathematical Reasoning

January 13, 2025

作者: Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin

cs.AI

摘要

過程獎勵模型（PRMs）被視為在大型語言模型（LLMs）的數學推理中進行過程監督的一種有前途的方法，旨在識別並減輕推理過程中的中間錯誤。然而，有效PRMs的開發面臨著重大挑戰，特別是在數據標註和評估方法方面。通過廣泛的實驗，本文證明了通常使用的基於蒙特卡羅（MC）估計的數據合成對於PRMs通常產生較差的性能和泛化能力，相較於LLM作為評判者和人工標註方法。MC估計依賴於完成模型來評估當前步驟的正確性，導致步驟驗證不準確。此外，我們識別出傳統的最佳N（BoN）評估策略中對PRMs的潛在偏見：（1）不可靠的策略模型生成具有正確答案但有缺陷過程的回應，導致BoN的評估標準與PRM的過程驗證目標之間存在不一致。（2）PRMs對此類回應的容忍導致BoN分數被誇大。（3）現有的PRMs在最終答案步驟上有相當比例的最低分數，顯示了BoN優化的PRMs中從過程到結果為基礎評估的轉變。為應對這些挑戰，我們開發了一個共識過濾機制，有效地將MC估計與LLM作為評判者相結合，並提倡一個更全面的評估框架，結合回應級和步驟級指標。基於這些機制，我們在BoN評估和逐步錯誤識別任務中顯著提高了模型性能和數據效率。最後，我們發布了一個新的最先進的PRM，優於現有的開源替代方案，並為未來建立過程監督模型的研究提供實用指南。

English

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.