VL-Rethinker:利用強化學習激勵視覺語言模型的自我反思
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning
April 10, 2025
作者: Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, Wenhu Chen
cs.AI
摘要
近期,如GPT-o1和DeepSeek-R1等慢思考系统通过显式反思在解决复杂问题上展现了巨大潜力。它们在各类数学与科学基准测试中显著超越了包括GPT-4o在内的最佳快思考模型。然而,其多模态推理能力仍与快思考模型持平。例如,GPT-o1在MathVista、MathVerse和MathVision等基准上的表现与快思考模型相似。本文旨在通过强化学习(不依赖蒸馏)提升视觉语言模型的慢思考能力,以推动技术前沿。首先,我们采用GRPO算法并引入一种名为选择性样本回放(SSR)的新技术,以解决优势消失问题。尽管此方法带来了强劲性能,但由此训练的强化学习模型在自我反思或自我验证方面表现有限。为进一步促进慢思考,我们提出了强制再思考机制,即在强化学习训练的初始轨迹末尾附加文本再思考触发器,明确强制执行自我反思推理步骤。结合这两种技术,我们的模型VL-Rethinker在MathVista、MathVerse和MathVision上分别达到了80.3%、61.8%和43.9%的最新成绩。此外,VL-Rethinker还在MMMU-Pro、EMMA和MEGA-Bench等多学科基准测试中实现了开源领域的最先进水平,缩小了与GPT-o1的差距。
English
Recently, slow-thinking systems like GPT-o1 and DeepSeek-R1 have demonstrated
great potential in solving challenging problems through explicit reflection.
They significantly outperform the best fast-thinking models, such as GPT-4o, on
various math and science benchmarks. However, their multimodal reasoning
capabilities remain on par with fast-thinking models. For instance, GPT-o1's
performance on benchmarks like MathVista, MathVerse, and MathVision is similar
to fast-thinking models. In this paper, we aim to enhance the slow-thinking
capabilities of vision-language models using reinforcement learning (without
relying on distillation) to advance the state of the art. First, we adapt the
GRPO algorithm with a novel technique called Selective Sample Replay (SSR) to
address the vanishing advantages problem. While this approach yields strong
performance, the resulting RL-trained models exhibit limited self-reflection or
self-verification. To further encourage slow-thinking, we introduce Forced
Rethinking, which appends a textual rethinking trigger to the end of initial
rollouts in RL training, explicitly enforcing a self-reflection reasoning step.
By combining these two techniques, our model, VL-Rethinker, advances
state-of-the-art scores on MathVista, MathVerse, and MathVision to achieve
80.3%, 61.8%, and 43.9% respectively. VL-Rethinker also achieves open-source
SoTA on multi-disciplinary benchmarks such as MMMU-Pro, EMMA, and MEGA-Bench,
narrowing the gap with GPT-o1.Summary
AI-Generated Summary