VideoWebArena:使用视频理解网络任务评估长上下文多模态代理。
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
October 24, 2024
作者: Lawrence Jang, Yinheng Li, Charles Ding, Justin Lin, Paul Pu Liang, Dan Zhao, Rogerio Bonatti, Kazuhito Koishida
cs.AI
摘要
视频通常被用于学习或提取完成任务所需的信息,这种方式与仅使用文本和静态图像提供的方式不同。然而,许多现有的智能体基准测试忽视了长上下文视频理解,而是专注于文本或静态图像输入。为了弥补这一差距,我们引入了VideoWebArena(VideoWA),这是一个用于评估长上下文多模态智能体在视频理解方面能力的基准测试。VideoWA包括基于手工制作的视频教程的2,021个网络智能体任务,总计近四个小时的内容。对于我们的基准测试,我们定义了一个长上下文视频为基础的智能体任务分类法,重点关注两个主要领域:技能保留和事实保留。技能保留任务评估智能体是否能够利用给定的人类演示有效地完成任务,而事实保留任务评估智能体是否能够从视频中检索与指导相关的信息以完成任务。我们发现最佳模型在事实保留任务上的成功率为13.3%,在事实保留问答对上为45.8%,远低于人类的73.9%和79.3%。在技能保留任务中,长上下文模型在使用教程时表现不佳,WebArena任务中表现下降了5%,VisualWebArena任务中下降了10.3%。我们的工作突显了改进长上下文多模态模型的智能能力的必要性,并为未来开发长上下文视频智能体提供了一个测试平台。
English
Videos are often used to learn or extract the necessary information to
complete tasks in ways different than what text and static imagery alone can
provide. However, many existing agent benchmarks neglect long-context video
understanding, instead focusing on text or static image inputs. To bridge this
gap, we introduce VideoWebArena (VideoWA), a benchmark for evaluating the
capabilities of long-context multimodal agents for video understanding. VideoWA
consists of 2,021 web agent tasks based on manually crafted video tutorials,
which total almost four hours of content. For our benchmark, we define a
taxonomy of long-context video-based agent tasks with two main areas of focus:
skill retention and factual retention. While skill retention tasks evaluate
whether an agent can use a given human demonstration to complete a task
efficiently, the factual retention task evaluates whether an agent can retrieve
instruction-relevant information from a video to complete a task. We find that
the best model achieves 13.3% success on factual retention tasks and 45.8% on
factual retention QA pairs, far below human performance at 73.9% and 79.3%,
respectively. On skill retention tasks, long-context models perform worse with
tutorials than without, exhibiting a 5% performance decrease in WebArena tasks
and a 10.3% decrease in VisualWebArena tasks. Our work highlights the need to
improve the agentic abilities of long-context multimodal models and provides a
testbed for future development with long-context video agents.Summary
AI-Generated Summary