GenPRM：通过生成式推理扩展过程奖励模型的测试时计算能力

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

April 1, 2025

作者: Jian Zhao, Runze Liu, Kaiyan Zhang, Zhimu Zhou, Junqi Gao, Dong Li, Jiafei Lyu, Zhouyi Qian, Biqing Qi, Xiu Li, Bowen Zhou

cs.AI

摘要

近期在大语言模型（LLMs）领域的进展表明，利用过程奖励模型（PRMs）作为验证器来提升LLMs性能具有广阔前景。然而，当前的PRMs面临三大关键挑战：（1）过程监督与泛化能力有限；（2）依赖标量值预测而未充分利用LLMs的生成能力；（3）无法扩展PRMs在测试时的计算资源。本研究中，我们提出了GenPRM，一种生成式过程奖励模型，它在对每个推理步骤做出判断前，通过代码验证执行显式的思维链（CoT）推理。为了获得高质量的过程监督标签和推理数据，我们引入了相对进度估计（RPE）及融合代码验证的推理合成框架。在ProcessBench及多项数学推理任务上的实验结果显示，仅使用MATH数据集中23K训练数据的GenPRM显著超越了先前的PRMs。通过测试时扩展，1.5B参数的GenPRM超越了GPT-4o，而7B参数的GenPRM在ProcessBench上超越了Qwen2.5-Math-PRM-72B。此外，GenPRM展现了作为策略模型精炼的批评模型的强大能力。本研究为过程监督建立了新范式，弥合了PRMs与LLMs中批评模型之间的鸿沟。我们的代码、模型及数据将在https://ryanliu112.github.io/GenPRM 上公开。

English

Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited process supervision and generalization capabilities, (2) dependence on scalar value prediction without leveraging the generative abilities of LLMs, and (3) inability to scale the test-time compute of PRMs. In this work, we introduce GenPRM, a generative process reward model that performs explicit Chain-of-Thought (CoT) reasoning with code verification before providing judgment for each reasoning step. To obtain high-quality process supervision labels and rationale data, we propose Relative Progress Estimation (RPE) and a rationale synthesis framework that incorporates code verification. Experimental results on ProcessBench and several mathematical reasoning tasks show that GenPRM significantly outperforms prior PRMs with only 23K training data from MATH dataset. Through test-time scaling, a 1.5B GenPRM outperforms GPT-4o, and a 7B GenPRM surpasses Qwen2.5-Math-PRM-72B on ProcessBench. Additionally, GenPRM demonstrates strong abilities to serve as a critic model for policy model refinement. This work establishes a new paradigm for process supervision that bridges the gap between PRMs and critic models in LLMs. Our code, model, and data will be available in https://ryanliu112.github.io/GenPRM.

Summary

AI-Generated Summary