MotionBench：ビジョン言語モデルのための細かいビデオ動き理解のベンチマークおよび改善

要旨

近年、ビジョン言語モデル（VLMs）はビデオ理解の分野で大きな進歩を遂げてきました。しかしながら、重要な能力である微細な動きの理解という点において、現在のベンチマークでは未だに探求されていない部分があります。このギャップに対処するために、私たちはMotionBenchを提案します。これは、ビデオ理解モデルの微細な動きの理解能力を評価するために設計された包括的な評価ベンチマークです。MotionBenchは、6つの主要なモーション指向の質問タイプを通じてモデルの動きレベルの認識を評価し、多様なソースから収集されたデータを含むことで、現実世界のビデオコンテンツの幅広い表現を保証します。実験結果によると、既存のVLMsは微細な動きを理解する能力が低いことが示されています。LLMの限られたシーケンス長内で微細な動きを認識するために、ビデオ特徴の圧縮に最適化されたVLMアーキテクチャを検討し、新しい効率的なスルーエンコーダ（TE）フュージョン手法を提案しています。実験では、より高いフレームレートの入力とTEフュージョンが動きの理解を向上させることが示されていますが、まだ大幅な改善の余地があります。私たちのベンチマークは、より能力のあるビデオ理解モデルの開発を指導し、促進することを目的としており、微細な動きの理解の重要性を強調しています。プロジェクトページ：https://motion-bench.github.io

English

In recent years, vision language models (VLMs) have made significant advancements in video understanding. However, a crucial capability - fine-grained motion comprehension - remains under-explored in current benchmarks. To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models. MotionBench evaluates models' motion-level perception through six primary categories of motion-oriented question types and includes data collected from diverse sources, ensuring a broad representation of real-world video content. Experimental results reveal that existing VLMs perform poorly in understanding fine-grained motions. To enhance VLM's ability to perceive fine-grained motion within a limited sequence length of LLM, we conduct extensive experiments reviewing VLM architectures optimized for video feature compression and propose a novel and efficient Through-Encoder (TE) Fusion method. Experiments show that higher frame rate inputs and TE Fusion yield improvements in motion understanding, yet there is still substantial room for enhancement. Our benchmark aims to guide and motivate the development of more capable video understanding models, emphasizing the importance of fine-grained motion comprehension. Project page: https://motion-bench.github.io .

MotionBench：ビジョン言語モデルのための細かいビデオ動き理解のベンチマークおよび改善

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

要旨

Summary

Support