VidEgoThink:評估具身體式人工智能的自我中心視頻理解能力
VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI
October 15, 2024
作者: Sijie Cheng, Kechen Fang, Yangyang Yu, Sicheng Zhou, Bohao Li, Ye Tian, Tingguang Li, Lei Han, Yang Liu
cs.AI
摘要
最近在多模式大型語言模型(MLLMs)方面的進展為具體化人工智慧(Embodied AI)應用開辟了新的途徑。在以前的工作EgoThink的基礎上,我們引入了VidEgoThink,這是一個用於評估自我中心視頻理解能力的全面基準。為了彌合MLLMs和具體化人工智慧中低層控制之間的差距,我們設計了四個關鍵相關任務:視頻問答、層次規劃、視覺對齊和獎勵建模。為了減少手動標註成本,我們基於Ego4D數據集開發了一個自動數據生成流程,利用GPT-4o的先前知識和多模式能力。然後,三名人類標註者過濾生成的數據,以確保多樣性和質量,從而產生了VidEgoThink基準。我們對三種類型的模型進行了廣泛實驗:基於API的MLLMs、基於開源圖像的MLLMs和基於開源視頻的MLLMs。實驗結果表明,所有MLLMs,包括GPT-4o,在與自我中心視頻理解相關的所有任務中表現不佳。這些發現表明,基礎模型仍需要顯著進步,才能有效應用於具體化人工智慧中的第一人稱場景。總之,VidEgoThink反映了一種研究趨勢,即利用MLLMs進行自我中心視覺,類似於人類能力,實現在複雜的現實世界環境中的主動觀察和互動。
English
Recent advancements in Multi-modal Large Language Models (MLLMs) have opened
new avenues for applications in Embodied AI. Building on previous work,
EgoThink, we introduce VidEgoThink, a comprehensive benchmark for evaluating
egocentric video understanding capabilities. To bridge the gap between MLLMs
and low-level control in Embodied AI, we design four key interrelated tasks:
video question-answering, hierarchy planning, visual grounding and reward
modeling. To minimize manual annotation costs, we develop an automatic data
generation pipeline based on the Ego4D dataset, leveraging the prior knowledge
and multimodal capabilities of GPT-4o. Three human annotators then filter the
generated data to ensure diversity and quality, resulting in the VidEgoThink
benchmark. We conduct extensive experiments with three types of models:
API-based MLLMs, open-source image-based MLLMs, and open-source video-based
MLLMs. Experimental results indicate that all MLLMs, including GPT-4o, perform
poorly across all tasks related to egocentric video understanding. These
findings suggest that foundation models still require significant advancements
to be effectively applied to first-person scenarios in Embodied AI. In
conclusion, VidEgoThink reflects a research trend towards employing MLLMs for
egocentric vision, akin to human capabilities, enabling active observation and
interaction in the complex real-world environments.Summary
AI-Generated Summary