VLM-R1:一款稳定且泛化能力强的R1风格大型视觉-语言模型
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
April 10, 2025
作者: Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, Tiancheng Zhao
cs.AI
摘要
近期,DeepSeek R1 的研究表明,通过一种简洁而高效的设计,强化学习(RL)能够显著提升大型语言模型(LLMs)的推理能力。R1 的核心在于其基于规则的奖励机制,该机制利用具有确定性标准答案的任务,实现了精确且稳定的奖励计算。在视觉领域,我们同样观察到,众多视觉理解任务天然具备明确的标准标注,这一特性使得它们与基于规则的奖励机制自然契合。受此启发,我们探索将 R1 风格的强化学习扩展至视觉-语言模型(VLMs),旨在增强其视觉推理能力。为此,我们开发了 VLM-R1,这是一个专为利用 RL 提升 VLMs 在通用视觉-语言任务上表现而设计的框架。借助此框架,我们进一步探讨了 RL 在视觉领域应用的可行性。实验结果显示,基于 RL 的模型不仅在视觉理解任务上展现出竞争力,而且在泛化能力上超越了监督微调(SFT)。此外,我们进行了全面的消融研究,揭示了一系列值得关注的发现,包括目标检测中的奖励欺骗现象、“OD 顿悟时刻”的出现、训练数据质量的影响,以及 RL 在不同模型规模下的扩展行为。通过这些分析,我们旨在深化对强化学习如何增强视觉-语言模型能力的理解,并希望我们的发现与开源贡献能推动视觉-语言 RL 社区的持续进步。我们的代码和模型已发布于 https://github.com/om-ai-lab/VLM-R1。
English
Recently DeepSeek R1 has shown that reinforcement learning (RL) can
substantially improve the reasoning capabilities of Large Language Models
(LLMs) through a simple yet effective design. The core of R1 lies in its
rule-based reward formulation, which leverages tasks with deterministic
ground-truth answers to enable precise and stable reward computation. In the
visual domain, we similarly observe that a wide range of visual understanding
tasks are inherently equipped with well-defined ground-truth annotations. This
property makes them naturally compatible with rule-based reward mechanisms.
Motivated by this observation, we investigate the extension of R1-style
reinforcement learning to Vision-Language Models (VLMs), aiming to enhance
their visual reasoning capabilities. To this end, we develop VLM-R1, a
dedicated framework designed to harness RL for improving VLMs' performance on
general vision-language tasks. Using this framework, we further explore the
feasibility of applying RL to visual domain. Experimental results indicate that
the RL-based model not only delivers competitive performance on visual
understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in
generalization ability. Furthermore, we conduct comprehensive ablation studies
that uncover a series of noteworthy insights, including the presence of reward
hacking in object detection, the emergence of the "OD aha moment", the impact
of training data quality, and the scaling behavior of RL across different model
sizes. Through these analyses, we aim to deepen the understanding of how
reinforcement learning enhances the capabilities of vision-language models, and
we hope our findings and open-source contributions will support continued
progress in the vision-language RL community. Our code and model are available
at https://github.com/om-ai-lab/VLM-R1Summary
AI-Generated Summary