代理奖励建模：将人类偏好与可验证的正确性信号相结合，构建可靠的奖励系统

摘要

奖励模型（RMs）对于大规模语言模型（LLMs）的训练及推理阶段的扩展至关重要。然而，现有的奖励模型主要聚焦于人类偏好，忽视了可验证的正确性信号，而这些信号在训练LLMs中已展现出强大的潜力。本文提出了一种代理式奖励建模方法，该系统将奖励模型与来自不同方面的可验证正确性信号相结合，以提供更可靠的奖励。我们实证性地实现了一个名为RewardAgent的奖励代理，它结合了人类偏好奖励与两种可验证信号：事实性和指令遵循，从而提供更为可靠的奖励。我们在现有奖励模型基准上进行了全面实验，并在现实世界下游任务中进行了推理时的最佳-n搜索。RewardAgent显著优于传统奖励模型，证明了其有效性。进一步，我们利用RewardAgent构建训练偏好对，并采用DPO目标训练了一个LLM，在多种NLP基准测试中均取得了优于传统奖励模型的性能。我们的代码已公开发布，以促进进一步研究（https://github.com/THU-KEG/Agentic-Reward-Modeling）。

English

Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).

代理奖励建模：将人类偏好与可验证的正确性信号相结合，构建可靠的奖励系统

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

摘要

Summary

Support

Support