Vinoground:通過短視頻對密集時間推理中的LMMs進行審查
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos
October 3, 2024
作者: Jianrui Zhang, Mu Cai, Yong Jae Lee
cs.AI
摘要
最近有越來越多的觀點認為,現代大型多模型(LMMs)已經解決了與短視頻理解相關的大部分關鍵挑戰。因此,學術界和工業界逐漸將注意力轉向更複雜的長視頻理解所帶來的挑戰。然而,這是否屬實?我們的研究顯示,即使處理短視頻,LMMs 仍然缺乏許多基本的推理能力。我們引入了 Vinoground,一個包含 1000 個短自然視頻-標題配對的時間反事實 LMM 評估基準。我們展示現有的 LMMs 在區分不同動作和物體變換之間的時間差異時嚴重困難。例如,最佳模型 GPT-4o 在我們的文本和視頻分數上僅獲得約 50%,與人類基準約 90% 的巨大差距。所有開源多模型和基於 CLIP 的模型表現更差,主要產生隨機機會表現。通過這項工作,我們揭示了短視頻中的時間推理仍然是一個尚未完全解決的問題。數據集和評估代碼可在 https://vinoground.github.io 獲得。
English
There has been growing sentiment recently that modern large multimodal models
(LMMs) have addressed most of the key challenges related to short video
comprehension. As a result, both academia and industry are gradually shifting
their attention towards the more complex challenges posed by understanding
long-form videos. However, is this really the case? Our studies indicate that
LMMs still lack many fundamental reasoning capabilities even when dealing with
short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation
benchmark encompassing 1000 short and natural video-caption pairs. We
demonstrate that existing LMMs severely struggle to distinguish temporal
differences between different actions and object transformations. For example,
the best model GPT-4o only obtains ~50% on our text and video scores, showing a
large gap compared to the human baseline of ~90%. All open-source multimodal
models and CLIP-based models perform much worse, producing mostly random chance
performance. Through this work, we shed light onto the fact that temporal
reasoning in short videos is a problem yet to be fully solved. The dataset and
evaluation code are available at https://vinoground.github.io.Summary
AI-Generated Summary