番茄:在多模态基础模型中评估视觉时间推理能力
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
October 30, 2024
作者: Ziyao Shangguan, Chuhan Li, Yuxuan Ding, Yanan Zheng, Yilun Zhao, Tesca Fitzgerald, Arman Cohan
cs.AI
摘要
现有的基准测试经常强调最先进的多模态基础模型(MFMs)在利用时间上下文进行视频理解方面取得的显著性能。然而,这些模型在视觉时间推理方面的表现如何?我们对现有基准测试的研究显示,MFMs 的这种能力很可能被高估,因为许多问题可以通过使用单个、少量或无序帧来解决。为了系统地检验当前的视觉时间推理任务,我们提出了三项原则及相应的度量标准:(1)多帧增益,(2)帧顺序敏感性,以及(3)帧信息差异性。遵循这些原则,我们引入了TOMATO,即时间推理多模态评估,这是一个新颖的基准测试,旨在严格评估MFMs 在视频理解中的时间推理能力。TOMATO 包括1,484个精心策划的、人工标注的问题,涵盖六个任务(即动作计数、方向、旋转、形状与趋势、速度与频率以及视觉线索),应用于1,417个视频,其中包括805个自录制和生成的视频,涵盖了以人为中心、真实世界和模拟场景。我们的全面评估显示,最佳表现模型与人类之间存在57.3%的性能差距。此外,我们的深入分析揭示了当前MFMs存在的更基本限制。虽然它们可以准确识别孤立帧中的事件,但无法将这些帧解释为连续序列。我们相信TOMATO 将成为评估下一代MFMs的重要测试平台,并呼吁社区开发能够通过视频模态理解人类世界动态的人工智能系统。
English
Existing benchmarks often highlight the remarkable performance achieved by
state-of-the-art Multimodal Foundation Models (MFMs) in leveraging temporal
context for video understanding. However, how well do the models truly perform
visual temporal reasoning? Our study of existing benchmarks shows that this
capability of MFMs is likely overestimated as many questions can be solved by
using a single, few, or out-of-order frames. To systematically examine current
visual temporal reasoning tasks, we propose three principles with corresponding
metrics: (1) Multi-Frame Gain, (2) Frame Order Sensitivity, and (3) Frame
Information Disparity. Following these principles, we introduce TOMATO,
Temporal Reasoning Multimodal Evaluation, a novel benchmark crafted to
rigorously assess MFMs' temporal reasoning capabilities in video understanding.
TOMATO comprises 1,484 carefully curated, human-annotated questions spanning
six tasks (i.e., action count, direction, rotation, shape & trend, velocity &
frequency, and visual cues), applied to 1,417 videos, including 805
self-recorded and -generated videos, that encompass human-centric, real-world,
and simulated scenarios. Our comprehensive evaluation reveals a human-model
performance gap of 57.3% with the best-performing model. Moreover, our in-depth
analysis uncovers more fundamental limitations beyond this gap in current MFMs.
While they can accurately recognize events in isolated frames, they fail to
interpret these frames as a continuous sequence. We believe TOMATO will serve
as a crucial testbed for evaluating the next-generation MFMs and as a call to
the community to develop AI systems capable of comprehending human world
dynamics through the video modality.Summary
AI-Generated Summary