EmbodiedBench:用于视觉驱动的具身代理的多模态大型语言模型的全面基准测试
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
February 13, 2025
作者: Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang
cs.AI
摘要
利用多模态大型语言模型(MLLMs)创建具身代理为解决现实世界任务提供了一个有前途的途径。虽然以语言为中心的具身代理引起了相当大的关注,但基于MLLM的具身代理由于缺乏全面的评估框架而鲜为人知。为了弥补这一差距,我们引入了EmbodiedBench,一个旨在评估以视觉驱动的具身代理的广泛基准。EmbodiedBench具有以下特点:(1)涵盖四个环境中的1,128个测试任务的多样化集合,从高级语义任务(例如家庭)到涉及原子动作的低级任务(例如导航和操作);以及(2)六个精心策划的子集,评估基本代理能力,如常识推理、复杂指令理解、空间意识、视觉感知和长期规划。通过大量实验,我们评估了EmbodiedBench中的13个主要专有和开源MLLM的表现。我们的研究结果表明:MLLM在高级任务上表现出色,但在低级操作方面表现不佳,最佳模型GPT-4o的平均得分仅为28.9%。EmbodiedBench提供了一个多方面的标准化评估平台,不仅突出了现有挑战,还提供了有价值的见解,以推进基于MLLM的具身代理。我们的代码可在https://embodiedbench.github.io 获取。
English
Leveraging Multi-modal Large Language Models (MLLMs) to create embodied
agents offers a promising avenue for tackling real-world tasks. While
language-centric embodied agents have garnered substantial attention,
MLLM-based embodied agents remain underexplored due to the lack of
comprehensive evaluation frameworks. To bridge this gap, we introduce
EmbodiedBench, an extensive benchmark designed to evaluate vision-driven
embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing
tasks across four environments, ranging from high-level semantic tasks (e.g.,
household) to low-level tasks involving atomic actions (e.g., navigation and
manipulation); and (2) six meticulously curated subsets evaluating essential
agent capabilities like commonsense reasoning, complex instruction
understanding, spatial awareness, visual perception, and long-term planning.
Through extensive experiments, we evaluated 13 leading proprietary and
open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel
at high-level tasks but struggle with low-level manipulation, with the best
model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a
multifaceted standardized evaluation platform that not only highlights existing
challenges but also offers valuable insights to advance MLLM-based embodied
agents. Our code is available at https://embodiedbench.github.io.Summary
AI-Generated Summary