ChatPaper.aiChatPaper

通过R1-Zero式训练提升视觉空间推理能力

Improved Visual-Spatial Reasoning via R1-Zero-Like Training

April 1, 2025
作者: Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, Zhijie Deng
cs.AI

摘要

提升多模态大语言模型(MLLMs)的推理能力正受到越来越多的关注。作为在物理世界中运作的AI智能体的基石,基于视频的视觉空间智能(VSI)成为了MLLMs最为关键的推理能力之一。本研究首次深入探讨了通过类似R1-Zero的训练方法来增强MLLMs的视觉空间推理能力。技术上,我们首先发现,中小规模的Qwen2-VL模型无法通过思维链(CoT)提示激活其视觉空间推理能力。随后,我们借鉴DeepSeek-R1-Zero,采用精心策划的VSI-100k数据集,引入了GRPO训练以提升视觉空间推理。在研究中,我们认识到即使在GRPO中保持KL惩罚(即使数值较小)也是必要的。仅用120 GPU小时,我们基于Qwen2-VL-2B微调的vsGRPO-2B模型,其性能就超越了基础模型12.1%,并超过了GPT-4o。此外,基于Qwen2-VL-7B微调的vsGRPO-7B模型,其表现可与最佳开源模型LLaVA-NeXT-Video-72B相媲美。同时,我们将vsGRPO与监督微调和直接偏好优化基线进行了对比,观察到显著的性能优势。代码和数据集即将公开。
English
Increasing attention has been placed on improving the reasoning capacities of multi-modal large language models (MLLMs). As the cornerstone for AI agents that function in the physical realm, video-based visual-spatial intelligence (VSI) emerges as one of the most pivotal reasoning capabilities of MLLMs. This work conducts a first, in-depth study on improving the visual-spatial reasoning of MLLMs via R1-Zero-like training. Technically, we first identify that the visual-spatial reasoning capacities of small- to medium-sized Qwen2-VL models cannot be activated via Chain of Thought (CoT) prompts. We then incorporate GRPO training for improved visual-spatial reasoning, using the carefully curated VSI-100k dataset, following DeepSeek-R1-Zero. During the investigation, we identify the necessity to keep the KL penalty (even with a small value) in GRPO. With just 120 GPU hours, our vsGRPO-2B model, fine-tuned from Qwen2-VL-2B, can outperform the base model by 12.1% and surpass GPT-4o. Moreover, our vsGRPO-7B model, fine-tuned from Qwen2-VL-7B, achieves performance comparable to that of the best open-source model LLaVA-NeXT-Video-72B. Additionally, we compare vsGRPO to supervised fine-tuning and direct preference optimization baselines and observe strong performance superiority. The code and dataset will be available soon.

Summary

AI-Generated Summary

PDF623April 3, 2025