V-MAGE：面向多模态大语言模型视觉能力评估的游戏化评测框架

摘要

多模态大语言模型（MLLMs）的最新进展已在各类多模态基准测试中取得了显著提升。然而，随着评估从静态数据集转向开放世界的动态环境，当前的游戏基准测试仍显不足，因其缺乏以视觉为中心的任务，且未能全面评估现实世界决策所需的多样化推理能力。为此，我们引入了视觉中心多能力游戏评估框架（V-MAGE），这是一个旨在评估MLLMs视觉推理能力的游戏化评估体系。V-MAGE包含五款多样化游戏，超过30个精心设计的关卡，测试模型在定位、轨迹追踪、时机把握及视觉记忆等核心视觉技能上的表现，同时考察长期规划与深思熟虑等高级推理能力。我们利用V-MAGE对领先的MLLMs进行了评估，揭示了它们在视觉感知与推理方面面临的重大挑战。在所有游戏环境中，根据Elo评分比较得出的表现最佳MLLMs，与人类相比存在显著的性能差距。我们的研究结果凸显了关键限制，包括模型做出的多种感知错误，并从智能体中心视角提出了改进的潜在路径，如优化智能体策略和解决感知不准确问题。代码已发布于https://github.com/CSU-JPG/V-MAGE。

English

Recent advancements in Multimodal Large Language Models (MLLMs) have led to significant improvements across various multimodal benchmarks. However, as evaluations shift from static datasets to open-world, dynamic environments, current game-based benchmarks remain inadequate because they lack visual-centric tasks and fail to assess the diverse reasoning skills required for real-world decision-making. To address this, we introduce Visual-centric Multiple Abilities Game Evaluation (V-MAGE), a game-based evaluation framework designed to assess visual reasoning capabilities of MLLMs. V-MAGE features five diverse games with 30+ handcrafted levels, testing models on core visual skills such as positioning, trajectory tracking, timing, and visual memory, alongside higher-level reasoning like long-term planning and deliberation. We use V-MAGE to evaluate leading MLLMs, revealing significant challenges in their visual perception and reasoning. In all game environments, the top-performing MLLMs, as determined by Elo rating comparisons, exhibit a substantial performance gap compared to humans. Our findings highlight critical limitations, including various types of perceptual errors made by the models, and suggest potential avenues for improvement from an agent-centric perspective, such as refining agent strategies and addressing perceptual inaccuracies. Code is available at https://github.com/CSU-JPG/V-MAGE.

V-MAGE：面向多模态大语言模型视觉能力评估的游戏化评测框架

V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models

摘要

Summary

Support

Support