V-MAGE:面向多模态大语言模型视觉能力评估的游戏化评测框架
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models
April 8, 2025
作者: Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, Lijuan Wang
cs.AI
摘要
多模态大语言模型(MLLMs)的最新进展已在各类多模态基准测试中取得了显著提升。然而,随着评估从静态数据集转向开放世界的动态环境,当前的游戏基准测试仍显不足,因其缺乏以视觉为中心的任务,且未能全面评估现实世界决策所需的多样化推理能力。为此,我们引入了视觉中心多能力游戏评估框架(V-MAGE),这是一个旨在评估MLLMs视觉推理能力的游戏化评估体系。V-MAGE包含五款多样化游戏,超过30个精心设计的关卡,测试模型在定位、轨迹追踪、时机把握及视觉记忆等核心视觉技能上的表现,同时考察长期规划与深思熟虑等高级推理能力。我们利用V-MAGE对领先的MLLMs进行了评估,揭示了它们在视觉感知与推理方面面临的重大挑战。在所有游戏环境中,根据Elo评分比较得出的表现最佳MLLMs,与人类相比存在显著的性能差距。我们的研究结果凸显了关键限制,包括模型做出的多种感知错误,并从智能体中心视角提出了改进的潜在路径,如优化智能体策略和解决感知不准确问题。代码已发布于https://github.com/CSU-JPG/V-MAGE。
English
Recent advancements in Multimodal Large Language Models (MLLMs) have led to
significant improvements across various multimodal benchmarks. However, as
evaluations shift from static datasets to open-world, dynamic environments,
current game-based benchmarks remain inadequate because they lack
visual-centric tasks and fail to assess the diverse reasoning skills required
for real-world decision-making. To address this, we introduce Visual-centric
Multiple Abilities Game Evaluation (V-MAGE), a game-based evaluation framework
designed to assess visual reasoning capabilities of MLLMs. V-MAGE features five
diverse games with 30+ handcrafted levels, testing models on core visual skills
such as positioning, trajectory tracking, timing, and visual memory, alongside
higher-level reasoning like long-term planning and deliberation. We use V-MAGE
to evaluate leading MLLMs, revealing significant challenges in their visual
perception and reasoning. In all game environments, the top-performing MLLMs,
as determined by Elo rating comparisons, exhibit a substantial performance gap
compared to humans. Our findings highlight critical limitations, including
various types of perceptual errors made by the models, and suggest potential
avenues for improvement from an agent-centric perspective, such as refining
agent strategies and addressing perceptual inaccuracies. Code is available at
https://github.com/CSU-JPG/V-MAGE.Summary
AI-Generated Summary