ChatPaper.aiChatPaper

TOMATO:評估多模態基礎模型中的視覺時間推理能力

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

October 30, 2024
作者: Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan
cs.AI

摘要

現有的基準往往突顯了最先進的多模態基礎模型(MFMs)在利用時間上下文進行視頻理解時取得的卓越表現。然而,這些模型在視覺時間推理方面表現如何?我們對現有基準的研究顯示,MFMs 的這種能力可能被高估,因為許多問題可以通過使用單個、少量或無序幀來解決。為了系統地檢驗當前的視覺時間推理任務,我們提出了三個原則和相應的指標:(1)多幀增益,(2)幀序敏感性和(3)幀信息差異。遵循這些原則,我們引入了 TOMATO,即時間推理多模態評估,這是一個新穎的基準,旨在嚴格評估 MFMs 在視頻理解中的時間推理能力。TOMATO 包括 1,484 個精心策劃的、人工標註的問題,涵蓋六個任務(即動作計數、方向、旋轉、形狀和趨勢、速度和頻率以及視覺提示),應用於 1,417 個視頻,包括 805 個自行錄製和生成的視頻,涵蓋人類中心、現實世界和模擬情境。我們的全面評估顯示,與表現最佳的模型相比,人類與模型之間存在 57.3% 的性能差距。此外,我們的深入分析揭示了當前 MFMs 之間更基本的限制。儘管它們可以準確識別孤立幀中的事件,但它們無法將這些幀解釋為連續序列。我們相信 TOMATO 將成為評估下一代 MFMs 的重要試驗平臺,並呼籲社區開發能夠通過視頻模態理解人類世界動態的人工智能系統。
English
Existing benchmarks often highlight the remarkable performance achieved by state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal context for video understanding. However, how well do the models truly perform visual temporal reasoning? Our study of existing benchmarks shows that this capability of MFMs is likely overestimated as many questions can be solved by using a single, few, or out-of-order frames. To systematically examine current visual temporal reasoning tasks, we propose three principles with corresponding metrics: (1) Multi-Frame Gain, (2) Frame Order Sensitivity, and (3) Frame Information Disparity. Following these principles, we introduce TOMATO, Temporal Reasoning Multimodal Evaluation, a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding. TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks (i.e., action count, direction, rotation, shape & trend, velocity & frequency, and visual cues), applied to 1,417 videos, including 805 self-recorded and -generated videos, that encompass human-centric, real-world, and simulated scenarios. Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model. Moreover, our in-depth analysis uncovers more fundamental limitations beyond this gap in current MFMs. While they can accurately recognize events in isolated frames, they fail to interpret these frames as a continuous sequence. We believe TOMATO will serve as a crucial testbed for evaluating the next-generation MFMs and as a call to the community to develop AI systems capable of comprehending human world dynamics through the video modality.

Summary

AI-Generated Summary

PDF202November 13, 2024