GenPRM:通过生成式推理扩展过程奖励模型的测试时计算能力
GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning
April 1, 2025
作者: Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, Bowen Zhou
cs.AI
摘要
近期在大语言模型(LLMs)领域的进展表明,利用过程奖励模型(PRMs)作为验证器来提升LLMs性能具有广阔前景。然而,当前的PRMs面临三大关键挑战:(1)过程监督与泛化能力有限;(2)依赖标量值预测而未充分利用LLMs的生成能力;(3)无法扩展PRMs在测试时的计算资源。本研究中,我们提出了GenPRM,一种生成式过程奖励模型,它在对每个推理步骤做出判断前,通过代码验证执行显式的思维链(CoT)推理。为了获得高质量的过程监督标签和推理数据,我们引入了相对进度估计(RPE)及融合代码验证的推理合成框架。在ProcessBench及多项数学推理任务上的实验结果显示,仅使用MATH数据集中23K训练数据的GenPRM显著超越了先前的PRMs。通过测试时扩展,1.5B参数的GenPRM超越了GPT-4o,而7B参数的GenPRM在ProcessBench上超越了Qwen2.5-Math-PRM-72B。此外,GenPRM展现了作为策略模型精炼的批评模型的强大能力。本研究为过程监督建立了新范式,弥合了PRMs与LLMs中批评模型之间的鸿沟。我们的代码、模型及数据将在https://ryanliu112.github.io/GenPRM 上公开。
English
Recent advancements in Large Language Models (LLMs) have shown that it is
promising to utilize Process Reward Models (PRMs) as verifiers to enhance the
performance of LLMs. However, current PRMs face three key challenges: (1)
limited process supervision and generalization capabilities, (2) dependence on
scalar value prediction without leveraging the generative abilities of LLMs,
and (3) inability to scale the test-time compute of PRMs. In this work, we
introduce GenPRM, a generative process reward model that performs explicit
Chain-of-Thought (CoT) reasoning with code verification before providing
judgment for each reasoning step. To obtain high-quality process supervision
labels and rationale data, we propose Relative Progress Estimation (RPE) and a
rationale synthesis framework that incorporates code verification. Experimental
results on ProcessBench and several mathematical reasoning tasks show that
GenPRM significantly outperforms prior PRMs with only 23K training data from
MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and
a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally,
GenPRM demonstrates strong abilities to serve as a critic model for policy
model refinement. This work establishes a new paradigm for process supervision
that bridges the gap between PRMs and critic models in LLMs. Our code, model,
and data will be available in https://ryanliu112.github.io/GenPRM.Summary
AI-Generated Summary