跨领域扩展可验证奖励的强化学习
Expanding RL with Verifiable Rewards Across Diverse Domains
March 31, 2025
作者: Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, Dong Yu
cs.AI
摘要
带有可验证奖励的强化学习(RLVR)在数学推理和编程任务中已展现出显著成效,这些任务通常具备结构清晰的参考答案。然而,其在更广泛领域的适用性仍有待深入探索。本研究致力于将RLVR扩展至医学、化学、心理学及经济学等多元化领域。我们观察到,在存在客观参考答案的情况下,不同大型语言模型(LLMs)在二元判断上表现出高度一致性,这挑战了大规模标注训练领域特定奖励模型的必要性。针对处理非结构化参考答案时二元奖励的局限性,我们进一步将基于模型的软评分融入RLVR,以提升其灵活性。实验表明,蒸馏生成的奖励模型能有效充当跨领域验证器,为强化学习提供可靠的奖励信号,而无需依赖领域特定的标注。通过采用多种强化学习算法,基于我们的奖励模型对7B基础模型进行微调,我们获得的策略在自由形式答案设置下,大幅超越了如Qwen2.5-72B-Instruct和DeepSeek-R1-Distill-Qwen-32B等顶尖开源对齐LLMs,跨越多个领域。这一成果不仅增强了RLVR的鲁棒性和可扩展性,也凸显了其在现实世界应用中面对噪声或弱标签时的巨大潜力。
English
Reinforcement learning (RL) with verifiable rewards (RLVR) has shown
promising results in mathematical reasoning and coding tasks where
well-structured reference answers are available. However, its applicability to
broader domains remains underexplored. In this work, we study the extension of
RLVR to more diverse domains such as medicine, chemistry, psychology, and
economics. We observe high agreement in binary judgments across different large
language models (LLMs) when objective reference answers exist, which challenges
the necessity of large-scale annotation for training domain-specific reward
models. To address the limitations of binary rewards when handling unstructured
reference answers, we further incorporate model-based soft scoring into RLVR to
improve its flexibility. Our experiments show that a distilled generative
reward model can serve as an effective cross-domain verifier, providing
reliable reward signals for RL without requiring domain-specific annotations.
By fine-tuning a base 7B model using various RL algorithms against our reward
model, we obtain policies that outperform state-of-the-art open-source aligned
LLMs such as Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B by a large
margin, across domains in free-form answer settings. This also strengthens
RLVR's robustness and scalability, highlighting its potential for real-world
applications with noisy or weak labels.Summary
AI-Generated Summary