通过隐式奖励进行强化学习

摘要

在大型语言模型（LLMs）的推理时间扩展中，密集过程奖励已被证明是比稀疏的结果级奖励更有效的选择，特别是在需要复杂多步推理的任务中。虽然密集奖励对于强化学习（RL）LLMs也是一种吸引人的选择，因为它们的细粒度奖励有潜力解决一些结果级奖励固有的问题，比如训练效率和信用分配，但这种潜力主要仍未实现。这主要归因于在线训练过程奖励模型（PRMs）的挑战，收集高质量的过程标签成本过高，使其特别容易受到奖励欺骗的影响。为了解决这些挑战，我们提出了PRIME（通过隐式奖励进行过程强化），它通过隐式过程奖励，仅使用策略展开和结果标签来实现在线PRM更新。PRIME与各种优势函数结合，并放弃了现有方法所需的专门奖励模型训练阶段，大大降低了开发开销。我们在竞赛数学和编码方面展示了PRIME的有效性。从Qwen2.5-Math-7B-Base开始，PRIME在几个关键推理基准上平均提高了15.1%，超过了SFT模型。值得注意的是，我们的最终模型Eurus-2-7B-PRIME在七个推理基准上超过了Qwen2.5-Math-7B-Instruct模型，且只使用了其10%的训练数据。

English

Dense process rewards have proven a more effective alternative to the sparse outcome-level rewards in the inference-time scaling of large language models (LLMs), particularly in tasks requiring complex multi-step reasoning. While dense rewards also offer an appealing choice for the reinforcement learning (RL) of LLMs since their fine-grained rewards have the potential to address some inherent issues of outcome rewards, such as training efficiency and credit assignment, this potential remains largely unrealized. This can be primarily attributed to the challenges of training process reward models (PRMs) online, where collecting high-quality process labels is prohibitively expensive, making them particularly vulnerable to reward hacking. To address these challenges, we propose PRIME (Process Reinforcement through IMplicit rEwards), which enables online PRM updates using only policy rollouts and outcome labels through implict process rewards. PRIME combines well with various advantage functions and forgoes the dedicated reward model training phrase that existing approaches require, substantially reducing the development overhead. We demonstrate PRIME's effectiveness on competitional math and coding. Starting from Qwen2.5-Math-7B-Base, PRIME achieves a 15.1% average improvement across several key reasoning benchmarks over the SFT model. Notably, our resulting model, Eurus-2-7B-PRIME, surpasses Qwen2.5-Math-7B-Instruct on seven reasoning benchmarks with 10% of its training data.

通过隐式奖励进行强化学习

Process Reinforcement through Implicit Rewards

摘要

Summary

Support