推理时扩展的通用奖励建模
Inference-Time Scaling for Generalist Reward Modeling
April 3, 2025
作者: Zijun Liu, Peiyi Wang, Runxin Xu, Shirong Ma, Chong Ruan, Peng Li, Yang Liu, Yu Wu
cs.AI
摘要
强化学习(RL)在大规模语言模型(LLMs)的后训练中已被广泛采用。近期,通过RL激励LLMs的推理能力表明,恰当的学习方法能够实现有效的推理时扩展性。RL的一个关键挑战在于,在可验证问题或人工规则之外的多种领域中,为LLMs获取准确的奖励信号。本研究中,我们探讨了如何通过增加推理计算资源来提升通用查询的奖励建模(RM),即通用RM的推理时扩展性,并进一步探索了如何借助适当的学习方法提高性能与计算资源的扩展效率。在RM方法上,我们采用点式生成奖励建模(GRM),以增强对不同输入类型的灵活性及推理时扩展的潜力。在学习方法上,我们提出了自原则批判调优(SPCT),通过在线RL促进GRM中可扩展的奖励生成行为,自适应地生成原则并准确地进行批判,从而得到DeepSeek-GRM模型。此外,为了有效实现推理时扩展,我们利用并行采样扩大计算资源使用,并引入元RM指导投票过程,以优化扩展性能。实验证明,SPCT显著提升了GRM的质量与扩展性,在多个RM基准测试中超越现有方法与模型,且无明显偏差,相比训练时扩展能取得更优性能。尽管DeepSeek-GRM在某些任务中仍面临挑战,我们相信未来在通用奖励系统上的努力将能解决这些问题。相关模型将公开发布并开源。
English
Reinforcement learning (RL) has been widely adopted in post-training for
large language models (LLMs) at scale. Recently, the incentivization of
reasoning capabilities in LLMs from RL indicates that proper learning
methods could enable effective inference-time scalability. A key challenge of
RL is to obtain accurate reward signals for LLMs in various domains beyond
verifiable questions or artificial rules. In this work, we investigate how to
improve reward modeling (RM) with more inference compute for general queries,
i.e. the inference-time scalability of generalist RM, and further,
how to improve the effectiveness of performance-compute scaling with proper
learning methods. For the RM approach, we adopt pointwise generative reward
modeling (GRM) to enable flexibility for different input types and potential
for inference-time scaling. For the learning method, we propose Self-Principled
Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs
through online RL, to generate principles adaptively and critiques accurately,
resulting in DeepSeek-GRM models. Furthermore, for effective
inference-time scaling, we use parallel sampling to expand compute usage, and
introduce a meta RM to guide voting process for better scaling performance.
Empirically, we show that SPCT significantly improves the quality and
scalability of GRMs, outperforming existing methods and models in various RM
benchmarks without severe biases, and could achieve better performance compared
to training-time scaling. DeepSeek-GRM still meets challenges in some tasks,
which we believe can be addressed by future efforts in generalist reward
systems. The models will be released and open-sourced.Summary
AI-Generated Summary