ChatPaper.aiChatPaper

V-MAGE:面向多模态大语言模型视觉能力评估的游戏化评测框架

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

April 8, 2025
作者: Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, Lijuan Wang
cs.AI

摘要

多模态大语言模型(MLLMs)的最新进展已在各类多模态基准测试中取得了显著提升。然而,随着评估从静态数据集转向开放世界的动态环境,当前的游戏基准测试仍显不足,因其缺乏以视觉为中心的任务,且未能全面评估现实世界决策所需的多样化推理能力。为此,我们引入了视觉中心多能力游戏评估框架(V-MAGE),这是一个旨在评估MLLMs视觉推理能力的游戏化评估体系。V-MAGE包含五款多样化游戏,超过30个精心设计的关卡,测试模型在定位、轨迹追踪、时机把握及视觉记忆等核心视觉技能上的表现,同时考察长期规划与深思熟虑等高级推理能力。我们利用V-MAGE对领先的MLLMs进行了评估,揭示了它们在视觉感知与推理方面面临的重大挑战。在所有游戏环境中,根据Elo评分比较得出的表现最佳MLLMs,与人类相比存在显著的性能差距。我们的研究结果凸显了关键限制,包括模型做出的多种感知错误,并从智能体中心视角提出了改进的潜在路径,如优化智能体策略和解决感知不准确问题。代码已发布于https://github.com/CSU-JPG/V-MAGE。
English
Recent advancements in Multimodal Large Language Models (MLLMs) have led to significant improvements across various multimodal benchmarks. However, as evaluations shift from static datasets to open-world, dynamic environments, current game-based benchmarks remain inadequate because they lack visual-centric tasks and fail to assess the diverse reasoning skills required for real-world decision-making. To address this, we introduce Visual-centric Multiple Abilities Game Evaluation (V-MAGE), a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. V-MAGE features five diverse games with 30+ handcrafted levels, testing models on core visual skills such as positioning, trajectory tracking, timing, and visual memory, alongside higher-level reasoning like long-term planning and deliberation. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning. In all game environments, the top-performing MLLMs, as determined by Elo rating comparisons, exhibit a substantial performance gap compared to humans. Our findings highlight critical limitations, including various types of perceptual errors made by the models, and suggest potential avenues for improvement from an agent-centric perspective, such as refining agent strategies and addressing perceptual inaccuracies. Code is available at https://github.com/CSU-JPG/V-MAGE.

Summary

AI-Generated Summary

PDF132April 9, 2025