JudgeLRM:大型推理模型作为评判者
JudgeLRM: Large Reasoning Models as a Judge
March 31, 2025
作者: Nuo Chen, Zhiyuan Hu, Qingyun Zou, Jiaying Wu, Qian Wang, Bryan Hooi, Bingsheng He
cs.AI
摘要
大型语言模型(LLMs)作为评估工具的兴起,为人工标注提供了一种可扩展的替代方案,然而现有的监督微调(SFT)方法在需要复杂推理的领域中往往表现不足。在本研究中,我们探讨了LLM评估者是否真正受益于增强的推理能力。通过对评估任务中推理需求的详细分析,我们发现SFT性能提升与需要推理的样本比例之间存在负相关关系,这凸显了SFT在此类场景中的局限性。为解决这一问题,我们引入了JudgeLRM,这是一系列面向判断的LLM,通过使用带有法官视角、结果驱动的奖励进行强化学习(RL)训练。JudgeLRM模型在性能上持续超越SFT微调模型及最先进的推理模型。值得注意的是,JudgeLRM-3B超越了GPT-4,而JudgeLRM-7B在F1分数上以2.79%的优势超过了DeepSeek-R1,尤其在需要深度推理的法官任务中表现卓越。
English
The rise of Large Language Models (LLMs) as evaluators offers a scalable
alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for
judges approaches often fall short in domains requiring complex reasoning. In
this work, we investigate whether LLM judges truly benefit from enhanced
reasoning capabilities. Through a detailed analysis of reasoning requirements
across evaluation tasks, we reveal a negative correlation between SFT
performance gains and the proportion of reasoning-demanding samples -
highlighting the limitations of SFT in such scenarios. To address this, we
introduce JudgeLRM, a family of judgment-oriented LLMs trained using
reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM
models consistently outperform both SFT-tuned and state-of-the-art reasoning
models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms
DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks
requiring deep reasoning.Summary
AI-Generated Summary