在数学推理中开发过程奖励模型的教训

摘要

过程奖励模型（PRMs）已成为大型语言模型（LLMs）数学推理中过程监督的一种有前途的方法，旨在识别和减轻推理过程中的中间错误。然而，有效PRMs的开发面临着重大挑战，特别是在数据标注和评估方法方面。本文通过大量实验表明，通常使用的基于蒙特卡洛（MC）估计的PRMs数据合成通常表现不佳，并且泛化性较LLM作为评判者和人类标注方法差。MC估计依赖完成模型来评估当前步骤的正确性，导致步骤验证不准确。此外，我们发现传统的最佳N（BoN）评估策略中存在潜在偏见：（1）不可靠的策略模型生成具有正确答案但有缺陷过程的响应，导致BoN的评估标准与PRM的过程验证目标之间不一致。（2）PRMs对此类响应的容忍导致BoN分数被夸大。（3）现有的PRMs在最终答案步骤上有相当比例的最低分数，揭示了从过程到结果为基础的评估在BoN优化PRMs中的转变。为解决这些挑战，我们开发了一种共识过滤机制，有效地将MC估计与LLM作为评判者相结合，并提倡结合响应级和步骤级指标的更全面的评估框架。基于这些机制，我们显著提高了模型在BoN评估和逐步错误识别任务中的性能和数据效率。最后，我们发布了一个新的最先进PRM，优于现有的开源替代方案，并为未来构建过程监督模型的研究提供了实用指南。

English

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

在数学推理中开发过程奖励模型的教训

The Lessons of Developing Process Reward Models in Mathematical Reasoning

摘要

Summary

Support