SFT记忆,RL泛化:基础模型后训练的比较研究
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
January 28, 2025
作者: Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, Yi Ma
cs.AI
摘要
监督微调(SFT)和强化学习(RL)是广泛应用于基础模型的后训练技术。然而,它们在增强模型泛化能力方面的作用仍不清楚。本文研究了SFT和RL在泛化和记忆方面的差异,重点关注基于文本规则变体和视觉变体。我们引入了GeneralPoints,一个算术推理卡牌游戏,并采用了V-IRL,一个真实世界的导航环境,来评估通过SFT和RL训练的模型如何泛化到文本和视觉领域中的未见变体。我们展示了RL,特别是在使用基于结果的奖励进行训练时,能够跨越基于规则的文本和视觉变体进行泛化。相比之下,SFT倾向于记忆训练数据,并且难以泛化到分布之外的场景。进一步的分析揭示了RL改善了模型的基础视觉识别能力,有助于其在视觉领域的增强泛化。尽管RL具有更好的泛化能力,我们展示了SFT对于有效的RL训练仍然至关重要;SFT稳定了模型的输出格式,使随后的RL能够实现其性能提升。这些发现展示了RL在复杂的多模态任务中获取可泛化知识的能力。
English
Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used
post-training techniques for foundation models. However, their roles in
enhancing model generalization capabilities remain unclear. This paper studies
the difference between SFT and RL on generalization and memorization, focusing
on text-based rule variants and visual variants. We introduce GeneralPoints, an
arithmetic reasoning card game, and adopt V-IRL, a real-world navigation
environment, to assess how models trained with SFT and RL generalize to unseen
variants in both textual and visual domains. We show that RL, especially when
trained with an outcome-based reward, generalizes across both rule-based
textual and visual variants. SFT, in contrast, tends to memorize training data
and struggles to generalize out-of-distribution scenarios. Further analysis
reveals that RL improves the model's underlying visual recognition
capabilities, contributing to its enhanced generalization in the visual domain.
Despite RL's superior generalization, we show that SFT remains essential for
effective RL training; SFT stabilizes the model's output format, enabling
subsequent RL to achieve its performance gains. These findings demonstrates the
capability of RL for acquiring generalizable knowledge in complex, multi-modal
tasks.Summary
AI-Generated Summary