在非策略指导下的推理学习

摘要

近期在大规模推理模型（LRMs）上的进展表明，通过基于简单规则的奖励进行强化学习（RL），可以涌现出多步推理和自我反思等复杂行为。然而，现有的零RL方法本质上是“同策略”的，将学习限制在模型自身的输出范围内，无法获得超越其初始能力的推理技能。我们提出了LUFFY（在异策略指导下学习推理），这是一个通过引入异策略推理轨迹来增强零RL的框架。LUFFY在训练过程中动态平衡模仿与探索，将异策略示范与同策略展开相结合。值得注意的是，我们提出了通过正则化重要性采样进行策略塑形，以避免在混合策略训练中出现肤浅和僵化的模仿。显著的是，LUFFY在六个数学基准测试中平均提升了超过7.0分，在分布外任务中取得了超过6.2分的优势。它还大幅超越了基于模仿的监督微调（SFT），特别是在泛化能力方面。分析表明，LUFFY不仅有效模仿，还能超越示范进行探索，为利用异策略指导训练可泛化的推理模型提供了一条可扩展的路径。

English

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.

在非策略指导下的推理学习

Learning to Reason under Off-Policy Guidance

摘要

Summary

Support

Support