在離策略指導下學習推理

摘要

近期在大規模推理模型（LRMs）上的進展表明，通過基於簡單規則的獎勵進行強化學習（RL），可以湧現出多步推理和自我反思等複雜行為。然而，現有的零RL方法本質上是「在策略」的，這限制了學習僅依賴模型自身的輸出，無法獲得超越其初始能力的推理技能。我們引入了LUFFY（在離策略指導下學習推理），這是一個通過離策略推理軌跡來增強零RL的框架。LUFFY在訓練過程中動態平衡模仿與探索，通過結合離策略示範和在策略的rollout來實現。值得注意的是，我們提出了通過正則化重要性採樣進行策略塑形，以避免在混合策略訓練中出現表面化和僵化的模仿。顯著的是，LUFFY在六個數學基準測試中平均提升了超過+7.0分，並在分佈外任務中取得了超過+6.2分的優勢。它還大幅超越了基於模仿的監督微調（SFT），特別是在泛化能力方面。分析顯示，LUFFY不僅能有效模仿，還能超越示範進行探索，為訓練具有離策略指導的可泛化推理模型提供了一條可擴展的路徑。

English

Recent advances in large reasoning models (LRMs) demonstrate that sophisticated behaviors such as multi-step reasoning and self-reflection can emerge via reinforcement learning (RL) with simple rule-based rewards. However, existing zero-RL approaches are inherently ``on-policy'', limiting learning to a model's own outputs and failing to acquire reasoning abilities beyond its initial capabilities. We introduce LUFFY (Learning to reason Under oFF-policY guidance), a framework that augments zero-RL with off-policy reasoning traces. LUFFY dynamically balances imitation and exploration by combining off-policy demonstrations with on-policy rollouts during training. Notably, we propose policy shaping via regularized importance sampling to avoid superficial and rigid imitation during mixed-policy training. Remarkably, LUFFY achieves an over +7.0 average gain across six math benchmarks and an advantage of over +6.2 points in out-of-distribution tasks. It also substantially surpasses imitation-based supervised fine-tuning (SFT), particularly in generalization. Analysis shows LUFFY not only imitates effectively but also explores beyond demonstrations, offering a scalable path to train generalizable reasoning models with off-policy guidance.

在離策略指導下學習推理

Learning to Reason under Off-Policy Guidance

摘要

Summary

Support

Support