ChatPaper.aiChatPaper

在非策略指导下的推理学习

Learning to Reason under Off-Policy Guidance

April 21, 2025
作者: Jianhao Yan, Yafu Li, Zican Hu, Zhi Wang, Ganqu Cui, Xiaoye Qu, Yu Cheng, Yue Zhang
cs.AI

摘要

近期在大规模推理模型(LRMs)上的进展表明,通过基于简单规则的奖励进行强化学习(RL),可以涌现出多步推理和自我反思等复杂行为。然而,现有的零RL方法本质上是“同策略”的,将学习限制在模型自身的输出范围内,无法获得超越其初始能力的推理技能。我们提出了LUFFY(在异策略指导下学习推理),这是一个通过引入异策略推理轨迹来增强零RL的框架。LUFFY在训练过程中动态平衡模仿与探索,将异策略示范与同策略展开相结合。值得注意的是,我们提出了通过正则化重要性采样进行策略塑形,以避免在混合策略训练中出现肤浅和僵化的模仿。显著的是,LUFFY在六个数学基准测试中平均提升了超过7.0分,在分布外任务中取得了超过6.2分的优势。它还大幅超越了基于模仿的监督微调(SFT),特别是在泛化能力方面。分析表明,LUFFY不仅有效模仿,还能超越示范进行探索,为利用异策略指导训练可泛化的推理模型提供了一条可扩展的路径。
English
Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.

Summary

AI-Generated Summary

PDF664April 22, 2025