跨领域扩展可验证奖励的强化学习

摘要

带有可验证奖励的强化学习（RLVR）在数学推理和编程任务中已展现出显著成效，这些任务通常具备结构清晰的参考答案。然而，其在更广泛领域的适用性仍有待深入探索。本研究致力于将RLVR扩展至医学、化学、心理学及经济学等多元化领域。我们观察到，在存在客观参考答案的情况下，不同大型语言模型（LLMs）在二元判断上表现出高度一致性，这挑战了大规模标注训练领域特定奖励模型的必要性。针对处理非结构化参考答案时二元奖励的局限性，我们进一步将基于模型的软评分融入RLVR，以提升其灵活性。实验表明，蒸馏生成的奖励模型能有效充当跨领域验证器，为强化学习提供可靠的奖励信号，而无需依赖领域特定的标注。通过采用多种强化学习算法，基于我们的奖励模型对7B基础模型进行微调，我们获得的策略在自由形式答案设置下，大幅超越了如Qwen2.5-72B-Instruct和DeepSeek-R1-Distill-Qwen-32B等顶尖开源对齐LLMs，跨越多个领域。这一成果不仅增强了RLVR的鲁棒性和可扩展性，也凸显了其在现实世界应用中面对噪声或弱标签时的巨大潜力。

English

Reinforcement learning (RL) with verifiable rewards (RLVR) has shown promising results in mathematical reasoning and coding tasks where well-structured reference answers are available. However, its applicability to broader domains remains underexplored. In this work, we study the extension of RLVR to more diverse domains such as medicine, chemistry, psychology, and economics. We observe high agreement in binary judgments across different large language models (LLMs) when objective reference answers exist, which challenges the necessity of large-scale annotation for training domain-specific reward models. To address the limitations of binary rewards when handling unstructured reference answers, we further incorporate model-based soft scoring into RLVR to improve its flexibility. Our experiments show that a distilled generative reward model can serve as an effective cross-domain verifier, providing reliable reward signals for RL without requiring domain-specific annotations. By fine-tuning a base 7B model using various RL algorithms against our reward model, we obtain policies that outperform state-of-the-art open-source aligned LLMs such as Qwen2.5-72B-Instruct and DeepSeek-R1-Distill-Qwen-32B by a large margin, across domains in free-form answer settings. This also strengthens RLVR's robustness and scalability, highlighting its potential for real-world applications with noisy or weak labels.

跨领域扩展可验证奖励的强化学习

Expanding RL with Verifiable Rewards Across Diverse Domains

摘要

Summary

Support

Support