MotionBench:用於視覺語言模型的細粒度視頻運動理解基準測試和改進
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
January 6, 2025
作者: Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang
cs.AI
摘要
近年來,視覺語言模型(VLMs)在視頻理解方面取得了顯著進展。然而,一個至關重要的能力 - 細粒度運動理解 - 在當前的基準測試中仍未得到充分探索。為了填補這一空白,我們提出了MotionBench,這是一個全面的評估基準測試,旨在評估視頻理解模型對細粒度運動理解的能力。MotionBench通過六個主要類別的運動導向問題類型評估模型的運動級別感知,並包含從多源收集的數據,確保對現實世界視頻內容的廣泛代表性。實驗結果顯示,現有的VLMs在理解細粒度運動方面表現不佳。為了增強VLM在有限序列長度的情況下感知細粒度運動的能力,我們進行了大量實驗,檢視了針對視頻特徵壓縮進行優化的VLM架構,並提出了一種新穎且高效的Through-Encoder(TE)融合方法。實驗表明,更高的幀率輸入和TE融合可以提高運動理解能力,但仍有很大的改進空間。我們的基準測試旨在引導和激勵更具能力的視頻理解模型的發展,強調細粒度運動理解的重要性。項目頁面:https://motion-bench.github.io。
English
In recent years, vision language models (VLMs) have made significant
advancements in video understanding. However, a crucial capability -
fine-grained motion comprehension - remains under-explored in current
benchmarks. To address this gap, we propose MotionBench, a comprehensive
evaluation benchmark designed to assess the fine-grained motion comprehension
of video understanding models. MotionBench evaluates models' motion-level
perception through six primary categories of motion-oriented question types and
includes data collected from diverse sources, ensuring a broad representation
of real-world video content. Experimental results reveal that existing VLMs
perform poorly in understanding fine-grained motions. To enhance VLM's ability
to perceive fine-grained motion within a limited sequence length of LLM, we
conduct extensive experiments reviewing VLM architectures optimized for video
feature compression and propose a novel and efficient Through-Encoder (TE)
Fusion method. Experiments show that higher frame rate inputs and TE Fusion
yield improvements in motion understanding, yet there is still substantial room
for enhancement. Our benchmark aims to guide and motivate the development of
more capable video understanding models, emphasizing the importance of
fine-grained motion comprehension. Project page: https://motion-bench.github.io .Summary
AI-Generated Summary