ChatPaper.aiChatPaper

SFT还是RL?关于训练R1类推理大型视觉语言模型的早期探索

SFT or RL? An Early Investigation into Training R1-Like Reasoning Large Vision-Language Models

April 10, 2025
作者: Hardy Chen, Haoqin Tu, Fali Wang, Hui Liu, Xianfeng Tang, Xinya Du, Yuyin Zhou, Cihang Xie
cs.AI

摘要

本研究重新审视了当前主流的训练大型视觉语言模型(LVLMs)的范式——先进行监督微调(SFT),再进行强化学习(RL),并揭示了一个关键发现:SFT可能会通过引入从专家模型模仿而来的“伪推理路径”严重削弱后续的RL效果。尽管这些路径可能与RL模型的原生推理路径相似,但它们往往包含冗长、犹豫、信息量不足的步骤以及错误的推理过程。为了系统研究这一现象,我们引入了VLAA-Thinking,这是一个专为支持LVLMs推理而设计的新型多模态数据集。通过包含标注、推理蒸馏、答案重写和验证的六步流程构建,VLAA-Thinking提供了高质量的、分步的视觉推理轨迹用于SFT,以及来自同一数据源的更具挑战性的RL分割。利用该数据集,我们进行了广泛的实验,比较了SFT、RL及其组合的效果。结果表明,虽然SFT有助于模型学习推理格式,但它常常使对齐模型陷入模仿性、僵化的推理模式,阻碍进一步学习。相比之下,基于群体相对策略优化(GRPO)并结合感知与认知信号的新型混合奖励模块,我们的RL方法促进了更真实、适应性更强的推理行为。值得注意的是,基于Qwen2.5VL 3B的模型VLAA-Thinker在Open LMM推理排行榜(https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard)上,在4B规模的LVLMs中取得了Top-1的成绩,较之前的最先进水平提升了1.8%。我们希望这些发现能为开发具备推理能力的LVLMs提供有价值的见解,并启发该领域的未来研究。
English
This work revisits the dominant supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm for training Large Vision-Language Models (LVLMs), and reveals a key finding: SFT can significantly undermine subsequent RL by inducing ``pseudo reasoning paths'' imitated from expert models. While these paths may resemble the native reasoning paths of RL models, they often involve prolonged, hesitant, less informative steps, and incorrect reasoning. To systematically study this effect, we introduce VLAA-Thinking, a new multimodal dataset designed to support reasoning in LVLMs. Constructed via a six-step pipeline involving captioning, reasoning distillation, answer rewrite and verification, VLAA-Thinking comprises high-quality, step-by-step visual reasoning traces for SFT, along with a more challenging RL split from the same data source. Using this dataset, we conduct extensive experiments comparing SFT, RL and their combinations. Results show that while SFT helps models learn reasoning formats, it often locks aligned models into imitative, rigid reasoning modes that impede further learning. In contrast, building on the Group Relative Policy Optimization (GRPO) with a novel mixed reward module integrating both perception and cognition signals, our RL approach fosters more genuine, adaptive reasoning behavior. Notably, our model VLAA-Thinker, based on Qwen2.5VL 3B, achieves top-1 performance on Open LMM Reasoning Leaderboard (https://huggingface.co/spaces/opencompass/Open_LMM_Reasoning_Leaderboard) among 4B scale LVLMs, surpassing the previous state-of-the-art by 1.8%. We hope our findings provide valuable insights in developing reasoning-capable LVLMs and can inform future research in this area.

Summary

AI-Generated Summary

PDF262April 17, 2025