ChatPaper.aiChatPaper

EmbodiedEval:评估多模态LLM作为具身代理

EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents

January 21, 2025
作者: Zhili Cheng, Yuge Tu, Ran Li, Shiqi Dai, Jinyi Hu, Shengding Hu, Jiahao Li, Yang Shi, Tianyu Yu, Weize Chen, Lei Shi, Maosong Sun
cs.AI

摘要

多模态大型语言模型(MLLMs)已经展示出显著的进展,为具有潜在未来的具身代理提供了希望。现有用于评估MLLMs的基准主要利用静态图像或视频,限制了对非交互式场景的评估。与此同时,现有的具身人工智能基准是特定任务的,并且不够多样化,无法充分评估MLLMs的具身能力。为了解决这个问题,我们提出了EmbodiedEval,这是一个针对MLLMs具身任务的全面交互式评估基准。EmbodiedEval包含了328个不同任务,在125个多样化的3D场景中,每个任务都经过严格选择和注释。它涵盖了广泛的现有具身人工智能任务,具有显著增强的多样性,全部在为MLLMs量身定制的统一仿真和评估框架内。这些任务分为五类:导航、物体交互、社交互动、属性问题回答以及空间问题回答,以评估代理的不同能力。我们在EmbodiedEval上评估了最先进的MLLMs,并发现它们在具身任务上与人类水平相比存在显著不足。我们的分析展示了现有MLLMs在具身能力方面的局限性,为它们未来的发展提供了见解。我们在https://github.com/thunlp/EmbodiedEval 开源了所有评估数据和仿真框架。
English
Multimodal Large Language Models (MLLMs) have shown significant advancements, providing a promising future for embodied agents. Existing benchmarks for evaluating MLLMs primarily utilize static images or videos, limiting assessments to non-interactive scenarios. Meanwhile, existing embodied AI benchmarks are task-specific and not diverse enough, which do not adequately evaluate the embodied capabilities of MLLMs. To address this, we propose EmbodiedEval, a comprehensive and interactive evaluation benchmark for MLLMs with embodied tasks. EmbodiedEval features 328 distinct tasks within 125 varied 3D scenes, each of which is rigorously selected and annotated. It covers a broad spectrum of existing embodied AI tasks with significantly enhanced diversity, all within a unified simulation and evaluation framework tailored for MLLMs. The tasks are organized into five categories: navigation, object interaction, social interaction, attribute question answering, and spatial question answering to assess different capabilities of the agents. We evaluated the state-of-the-art MLLMs on EmbodiedEval and found that they have a significant shortfall compared to human level on embodied tasks. Our analysis demonstrates the limitations of existing MLLMs in embodied capabilities, providing insights for their future development. We open-source all evaluation data and simulation framework at https://github.com/thunlp/EmbodiedEval.

Summary

AI-Generated Summary

PDF72January 25, 2025