Vision-R1:激励多模态大语言模型中的推理能力
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
March 9, 2025
作者: Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, Shaohui Lin
cs.AI
摘要
DeepSeek-R1-Zero 成功展示了仅通过强化学习(RL)在大型语言模型(LLMs)中推理能力的涌现。受此突破启发,我们探索了如何利用 RL 来增强多模态大语言模型(MLLMs)的推理能力。然而,由于缺乏大量高质量的多模态推理数据,直接使用 RL 训练难以激活 MLLMs 中的复杂推理能力,如提问和反思。为解决这一问题,我们提出了推理型 MLLM——Vision-R1,以提升多模态推理能力。具体而言,我们首先通过利用现有 MLLM 和 DeepSeek-R1,借助模态桥接和数据过滤,构建了一个无需人工标注的高质量多模态思维链(CoT)数据集,即 Vision-R1-cold 数据集,包含 20 万条多模态 CoT 数据,作为 Vision-R1 的冷启动初始化数据。为缓解冷启动后因过度思考带来的优化难题,我们提出了渐进式思维抑制训练(PTST)策略,并采用组相对策略优化(GRPO)结合硬格式化结果奖励函数,逐步在 1 万条多模态数学数据集上精炼模型学习正确且复杂推理过程的能力。综合实验表明,我们的模型在多个多模态数学推理基准上平均提升了约 6%。Vision-R1-7B 在广泛使用的 MathVista 基准上达到了 73.5% 的准确率,仅比领先的推理模型 OpenAI O1 低 0.4%。数据集和代码将发布于:https://github.com/Osilly/Vision-R1。
English
DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning
capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by
this breakthrough, we explore how RL can be utilized to enhance the reasoning
capability of MLLMs. However, direct training with RL struggles to activate
complex reasoning capabilities such as questioning and reflection in MLLMs, due
to the absence of substantial high-quality multimodal reasoning data. To
address this issue, we propose the reasoning MLLM, Vision-R1, to improve
multimodal reasoning capability. Specifically, we first construct a
high-quality multimodal CoT dataset without human annotations by leveraging an
existing MLLM and DeepSeek-R1 through modality bridging and data filtering to
obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as
cold-start initialization data for Vision-R1. To mitigate the optimization
challenges caused by overthinking after cold start, we propose Progressive
Thinking Suppression Training (PTST) strategy and employ Group Relative Policy
Optimization (GRPO) with the hard formatting result reward function to
gradually refine the model's ability to learn correct and complex reasoning
processes on a 10K multimodal math dataset. Comprehensive experiments show our
model achieves an average improvement of sim6% across various multimodal
math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely
used MathVista benchmark, which is only 0.4% lower than the leading reasoning
model, OpenAI O1. The datasets and code will be released in:
https://github.com/Osilly/Vision-R1 .Summary
AI-Generated Summary