ChatPaper.aiChatPaper

VideoChat-R1:通過強化微調提升時空感知能力

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

April 9, 2025
作者: Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, Limin Wang
cs.AI

摘要

近期強化學習的進展顯著提升了多模態大型語言模型(MLLMs)的推理能力。雖然群組相對策略優化(GRPO)和基於規則的獎勵機制在文本和圖像領域展現出潛力,但它們在視頻理解中的應用仍有限。本文系統地探討了將GRPO應用於視頻MLLMs的強化微調(RFT),旨在增強時空感知能力,同時保持模型的通用能力。我們的實驗表明,RFT在特定任務改進上具有極高的數據效率。通過在有限樣本上進行時空感知目標的多任務RFT,我們開發了VideoChat-R1,這是一個強大的視頻MLLM,在時空感知任務上達到了最先進的性能,且未犧牲聊天能力,同時展現出新興的時空推理能力。與Qwen2.5-VL-7B相比,VideoChat-R1在時間定位(+31.8)和目標跟踪(+31.2)等任務中的性能提升了數倍。此外,它在通用問答基準測試如VideoMME(+0.9)、MVBench(+1.0)和Perception Test(+0.9)上也有顯著提升。我們的研究結果強調了RFT在視頻MLLMs專項任務增強中的潛力。我們希望這項工作能為未來視頻MLLMs的強化學習研究提供寶貴的見解。
English
Recent advancements in reinforcement learning have significantly advanced the reasoning capabilities of multimodal large language models (MLLMs). While approaches such as Group Relative Policy Optimization (GRPO) and rule-based reward mechanisms demonstrate promise in text and image domains, their application to video understanding remains limited. This paper presents a systematic exploration of Reinforcement Fine-Tuning (RFT) with GRPO for video MLLMs, aiming to enhance spatio-temporal perception while maintaining general capabilities. Our experiments reveal that RFT is highly data-efficient for task-specific improvements. Through multi-task RFT on spatio-temporal perception objectives with limited samples, we develop VideoChat-R1, a powerful video MLLM that achieves state-of-the-art performance on spatio-temporal perception tasks without sacrificing chat ability, while exhibiting emerging spatio-temporal reasoning abilities. Compared to Qwen2.5-VL-7B, VideoChat-R1 boosts performance several-fold in tasks like temporal grounding (+31.8) and object tracking (+31.2). Additionally, it significantly improves on general QA benchmarks such as VideoMME (+0.9), MVBench (+1.0), and Perception Test (+0.9). Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs. We hope our work offers valuable insights for future RL research in video MLLMs.

Summary

AI-Generated Summary

PDF92April 10, 2025