VideoWebArena:使用視頻理解網絡任務評估具有長上下文多模式代理的研究
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
October 24, 2024
作者: Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida
cs.AI
摘要
影片常被用來學習或提取完成任務所需的必要資訊,這與僅有文字和靜態圖像提供的方式不同。然而,許多現有的智能體評估忽略了長篇影片理解,而是專注於文字或靜態圖像輸入。為彌合這一差距,我們引入了VideoWebArena(VideoWA),這是一個用於評估長篇多模式智能體在影片理解方面能力的基準。VideoWA 包含 2,021 個基於手工製作的影片教程的網頁智能體任務,總計近四小時的內容。對於我們的基準,我們定義了一個長篇影片為基礎的智能體任務分類法,主要關注兩個方面:技能保留和事實保留。技能保留任務評估了智能體是否能夠使用給定的人類演示來有效地完成任務,而事實保留任務則評估了智能體是否能夠從影片中檢索與指示相關的信息以完成任務。我們發現最佳模型在事實保留任務上的成功率為 13.3%,在事實保留問答對上為 45.8%,遠低於人類的表現分別為 73.9% 和 79.3%。在技能保留任務中,長篇模型在教程中表現不佳,與沒有教程相比,WebArena 任務表現下降了 5%,VisualWebArena 任務下降了 10.3%。我們的工作凸顯了需要改進長篇多模式模型的智能能力,並為未來開發長篇影片智能體提供了一個測試平臺。
English
Videos are often used to learn or extract the necessary information to
complete tasks in ways different than what text and static imagery alone can
provide. However, many existing agent benchmarks neglect long-context video
understanding, instead focusing on text or static image inputs. To bridge this
gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the
capabilities of long-context multimodal agents for video understanding. VideoWA
consists of 2,021 web agent tasks based on manually crafted video tutorials,
which total almost four hours of content. For our benchmark, we define a
taxonomy of long-context video-based agent tasks with two main areas of focus:
skill retention and factual retention. While skill retention tasks evaluate
whether an agent can use a given human demonstration to complete a task
efficiently, the factual retention task evaluates whether an agent can retrieve
instruction-relevant information from a video to complete a task. We find that
the best model achieves 13.3% success on factual retention tasks and 45.8% on
factual retention QA pairs, far below human performance at 73.9% and 79.3%,
respectively. On skill retention tasks, long-context models perform worse with
tutorials than without, exhibiting a 5% performance decrease in WebArena tasks
and a 10.3% decrease in VisualWebArena tasks. Our work highlights the need to
improve the agentic abilities of long-context multimodal models and provides a
testbed for future development with long-context video agents.Summary
AI-Generated Summary