在非策略指导下的推理学习
Learning to Reason under Off-Policy Guidance
April 21, 2025
作者: Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang
cs.AI
摘要
近期在大规模推理模型(LRMs)上的进展表明,通过基于简单规则的奖励进行强化学习(RL),可以涌现出多步推理和自我反思等复杂行为。然而,现有的零RL方法本质上是“同策略”的,将学习限制在模型自身的输出范围内,无法获得超越其初始能力的推理技能。我们提出了LUFFY(在异策略指导下学习推理),这是一个通过引入异策略推理轨迹来增强零RL的框架。LUFFY在训练过程中动态平衡模仿与探索,将异策略示范与同策略展开相结合。值得注意的是,我们提出了通过正则化重要性采样进行策略塑形,以避免在混合策略训练中出现肤浅和僵化的模仿。显著的是,LUFFY在六个数学基准测试中平均提升了超过7.0分,在分布外任务中取得了超过6.2分的优势。它还大幅超越了基于模仿的监督微调(SFT),特别是在泛化能力方面。分析表明,LUFFY不仅有效模仿,还能超越示范进行探索,为利用异策略指导训练可泛化的推理模型提供了一条可扩展的路径。
English
Recent advances in large reasoning models (LRMs) demonstrate that
sophisticated behaviors such as multi-step reasoning and self-reflection can
emerge via reinforcement learning (RL) with simple rule-based rewards. However,
existing zero-RL approaches are inherently ``on-policy'', limiting learning to
a model's own outputs and failing to acquire reasoning abilities beyond its
initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY
guidance), a framework that augments zero-RL with off-policy reasoning traces.
LUFFY dynamically balances imitation and exploration by combining off-policy
demonstrations with on-policy rollouts during training. Notably, we propose
policy shaping via regularized importance sampling to avoid superficial and
rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an
over +7.0 average gain across six math benchmarks and an advantage of over +6.2
points in out-of-distribution tasks. It also substantially surpasses
imitation-based supervised fine-tuning (SFT), particularly in generalization.
Analysis shows LUFFY not only imitates effectively but also explores beyond
demonstrations, offering a scalable path to train generalizable reasoning
models with off-policy guidance.Summary
AI-Generated Summary