在数学推理中开发过程奖励模型的教训
The Lessons of Developing Process Reward Models in Mathematical Reasoning
January 13, 2025
作者: Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin
cs.AI
摘要
过程奖励模型(PRMs)已成为大型语言模型(LLMs)数学推理中过程监督的一种有前途的方法,旨在识别和减轻推理过程中的中间错误。然而,有效PRMs的开发面临着重大挑战,特别是在数据标注和评估方法方面。本文通过大量实验表明,通常使用的基于蒙特卡洛(MC)估计的PRMs数据合成通常表现不佳,并且泛化性较LLM作为评判者和人类标注方法差。MC估计依赖完成模型来评估当前步骤的正确性,导致步骤验证不准确。此外,我们发现传统的最佳N(BoN)评估策略中存在潜在偏见:(1)不可靠的策略模型生成具有正确答案但有缺陷过程的响应,导致BoN的评估标准与PRM的过程验证目标之间不一致。 (2)PRMs对此类响应的容忍导致BoN分数被夸大。 (3)现有的PRMs在最终答案步骤上有相当比例的最低分数,揭示了从过程到结果为基础的评估在BoN优化PRMs中的转变。为解决这些挑战,我们开发了一种共识过滤机制,有效地将MC估计与LLM作为评判者相结合,并提倡结合响应级和步骤级指标的更全面的评估框架。基于这些机制,我们显著提高了模型在BoN评估和逐步错误识别任务中的性能和数据效率。最后,我们发布了一个新的最先进PRM,优于现有的开源替代方案,并为未来构建过程监督模型的研究提供了实用指南。
English
Process Reward Models (PRMs) emerge as a promising approach for process
supervision in mathematical reasoning of Large Language Models (LLMs), which
aim to identify and mitigate intermediate errors in the reasoning processes.
However, the development of effective PRMs faces significant challenges,
particularly in data annotation and evaluation methodologies. In this paper,
through extensive experiments, we demonstrate that commonly used Monte Carlo
(MC) estimation-based data synthesis for PRMs typically yields inferior
performance and generalization compared to LLM-as-a-judge and human annotation
methods. MC estimation relies on completion models to evaluate current-step
correctness, leading to inaccurate step verification. Furthermore, we identify
potential biases in conventional Best-of-N (BoN) evaluation strategies for
PRMs: (1) The unreliable policy models generate responses with correct answers
but flawed processes, leading to a misalignment between the evaluation criteria
of BoN and the PRM objectives of process verification. (2) The tolerance of
PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a
significant proportion of minimum scores concentrated on the final answer
steps, revealing the shift from process to outcome-based assessment in BoN
Optimized PRMs. To address these challenges, we develop a consensus filtering
mechanism that effectively integrates MC estimation with LLM-as-a-judge and
advocates a more comprehensive evaluation framework that combines
response-level and step-level metrics. Based on the mechanisms, we
significantly improve both model performance and data efficiency in the BoN
evaluation and the step-wise error identification task. Finally, we release a
new state-of-the-art PRM that outperforms existing open-source alternatives and
provides practical guidelines for future research in building process
supervision models.Summary
AI-Generated Summary