ChatPaper.aiChatPaper

强化学习是否真能激励大语言模型超越基础模型的推理能力?

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

April 18, 2025
作者: Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang
cs.AI

摘要

可验证奖励强化学习(RLVR)近期在提升大语言模型(LLMs)的推理能力方面取得了显著成果,尤其是在数学和编程任务中。普遍认为,RLVR使LLMs能够持续自我改进,从而获得超越基础模型能力的新推理技能。然而,在本研究中,我们通过测量大k值下的pass@k指标,重新审视了这一假设,以探索不同模型家族和基准测试中模型的推理能力边界。令人惊讶的是,强化学习实际上并未引发根本性的新推理模式。尽管在较小k值(例如k=1)下,经过RL训练的模型优于其基础模型,但在大k值时,基础模型能够达到甚至超过其RL对应模型的pass@k得分。RL训练模型生成的推理路径已包含在基础模型的采样分布中,这表明RL训练模型展现的大多数推理能力已为基础模型所具备。进一步分析显示,RL训练通过偏向于更可能获得奖励的路径来提升性能,从而更高效地采样正确答案。但这也导致了与基础模型相比,推理能力边界更为狭窄。在采用RLVR训练的视觉推理任务中,我们观察到了类似的结果。此外,我们发现蒸馏确实能为模型引入新知识,这与RLVR不同。这些发现凸显了RLVR在推进LLM推理能力方面的一个关键局限,促使我们从根本上重新思考RL训练在推理型LLM中的影响,以及寻求更优范式的必要性。项目页面:https://limit-of-RLVR.github.io
English
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning capabilities of LLMs, particularly in mathematics and programming tasks. It is widely believed that RLVR enables LLMs to continuously self-improve, thus acquiring novel reasoning abilities that exceed corresponding base models' capacity. In this study, however, we critically re-examines this assumption by measuring the pass@k metric with large values of k to explore the reasoning capability boundary of the models across a wide range of model families and benchmarks. Surprisingly, the RL does not, in fact, elicit fundamentally new reasoning patterns. While RL-trained models outperform their base models at smaller values of k (\eg, k=1), base models can achieve a comparable or even higher pass@k score compared to their RL counterparts at large k values. The reasoning paths generated by RL-trained models are already included in the base models' sampling distribution, suggesting that most reasoning abilities manifested in RL-trained models are already obtained by base models. Further analysis shows that RL training boosts the performance by biasing the model's output distribution toward paths that are more likely to yield rewards, therefore sampling correct responses more efficiently. But this also results in a narrower reasoning capability boundary compared to base models. Similar results are observed in visual reasoning tasks trained with RLVR. Moreover, we find that distillation can genuinely introduce new knowledge into the model, different from RLVR. These findings underscore a critical limitation of RLVR in advancing LLM reasoning abilities which requires us to fundamentally rethink the impact of RL training in reasoning LLMs and the need of a better paradigm. Project Page: https://limit-of-RLVR.github.io

Summary

AI-Generated Summary

PDF9416April 21, 2025