ChatPaper.aiChatPaper

VideoChat-R1:通过强化微调提升时空感知能力

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

April 9, 2025
作者: Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang
cs.AI

摘要

近期强化学习领域的进展显著提升了多模态大语言模型(MLLMs)的推理能力。尽管诸如群体相对策略优化(GRPO)和基于规则的奖励机制等方法在文本和图像领域展现出潜力,但它们在视频理解中的应用仍较为有限。本文系统性地探索了将GRPO应用于视频MLLMs的强化微调(RFT),旨在增强时空感知能力的同时保持模型的通用性能。实验表明,RFT在任务特定改进方面具有极高的数据效率。通过在有限样本上对时空感知目标进行多任务RFT,我们开发了VideoChat-R1,这是一个强大的视频MLLM,在时空感知任务上实现了最先进的性能,且未牺牲聊天能力,同时展现出新兴的时空推理能力。与Qwen2.5-VL-7B相比,VideoChat-R1在时间定位(+31.8)和对象跟踪(+31.2)等任务中性能提升数倍。此外,它在通用问答基准测试如VideoMME(+0.9)、MVBench(+1.0)和Perception Test(+0.9)上也有显著提升。我们的研究结果强调了RFT在视频MLLMs特定任务增强中的潜力。希望我们的工作能为未来视频MLLMs的强化学习研究提供有价值的见解。
English
Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.

Summary

AI-Generated Summary

PDF102April 10, 2025