推理时扩展的通用奖励建模

摘要

强化学习（RL）在大规模语言模型（LLMs）的后训练中已被广泛采用。近期，通过RL激励LLMs的推理能力表明，恰当的学习方法能够实现有效的推理时扩展性。RL的一个关键挑战在于，在可验证问题或人工规则之外的多种领域中，为LLMs获取准确的奖励信号。本研究中，我们探讨了如何通过增加推理计算资源来提升通用查询的奖励建模（RM），即通用RM的推理时扩展性，并进一步探索了如何借助适当的学习方法提高性能与计算资源的扩展效率。在RM方法上，我们采用点式生成奖励建模（GRM），以增强对不同输入类型的灵活性及推理时扩展的潜力。在学习方法上，我们提出了自原则批判调优（SPCT），通过在线RL促进GRM中可扩展的奖励生成行为，自适应地生成原则并准确地进行批判，从而得到DeepSeek-GRM模型。此外，为了有效实现推理时扩展，我们利用并行采样扩大计算资源使用，并引入元RM指导投票过程，以优化扩展性能。实验证明，SPCT显著提升了GRM的质量与扩展性，在多个RM基准测试中超越现有方法与模型，且无明显偏差，相比训练时扩展能取得更优性能。尽管DeepSeek-GRM在某些任务中仍面临挑战，我们相信未来在通用奖励系统上的努力将能解决这些问题。相关模型将公开发布并开源。

English

Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that proper learning methods could enable effective inference-time scalability. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the inference-time scalability of generalist RM, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in DeepSeek-GRM models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.

推理时扩展的通用奖励建模

Inference-Time Scaling for Generalist Reward Modeling

摘要

Summary

Support

Support