ChatPaper.aiChatPaper

E.T. Bench:邁向開放式事件級視頻語言理解

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding

September 26, 2024
作者: Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen
cs.AI

摘要

最近在視頻大型語言模型(Video-LLMs)方面的進展展示了它們在通用視頻理解方面的巨大潛力。為了驗證這些模型的重要性,已提出了許多基準來診斷它們在不同場景中的能力。然而,現有的基準僅通過視頻級問答來評估模型,缺乏細粒度事件級別評估和任務多樣性。為彌補這一空白,我們引入了 E.T. Bench(事件級別和時間敏感視頻理解基準),這是一個大規模且高質量的基準,用於開放式事件級別視頻理解。E.T. Bench分為3級任務分類,包含12個任務下的7.3K個樣本,涵蓋8個領域的7K個視頻(總長度251.4小時),提供全面的評估。我們在我們的基準上對8個圖像-LLMs和12個視頻-LLMs進行了廣泛評估,結果顯示,目前最先進的粗細級(視頻級)理解模型難以解決我們的細粒度任務,例如在視頻中定位感興趣的事件,這主要是由於短視頻上下文長度、不當的時間表示和缺乏多事件訓練數據。針對這些問題,我們進一步提出了一個強大的基線模型,E.T. Chat,以及一個針對細粒度事件級別理解的指導調整數據集 E.T. Instruct 164K。我們簡單而有效的解決方案在多種場景中展現出卓越的性能。
English
Recent advances in Video Large Language Models (Video-LLMs) have demonstrated their great potential in general-purpose video understanding. To verify the significance of these models, a number of benchmarks have been proposed to diagnose their capabilities in different scenarios. However, existing benchmarks merely evaluate models through video-level question-answering, lacking fine-grained event-level assessment and task diversity. To fill this gap, we introduce E.T. Bench (Event-Level & Time-Sensitive Video Understanding Benchmark), a large-scale and high-quality benchmark for open-ended event-level video understanding. Categorized within a 3-level task taxonomy, E.T. Bench encompasses 7.3K samples under 12 tasks with 7K videos (251.4h total length) under 8 domains, providing comprehensive evaluations. We extensively evaluated 8 Image-LLMs and 12 Video-LLMs on our benchmark, and the results reveal that state-of-the-art models for coarse-level (video-level) understanding struggle to solve our fine-grained tasks, e.g., grounding event-of-interests within videos, largely due to the short video context length, improper time representations, and lack of multi-event training data. Focusing on these issues, we further propose a strong baseline model, E.T. Chat, together with an instruction-tuning dataset E.T. Instruct 164K tailored for fine-grained event-level understanding. Our simple but effective solution demonstrates superior performance in multiple scenarios.

Summary

AI-Generated Summary

PDF72November 16, 2024