V-MAGE:一個用於評估多模態大語言模型中視覺核心能力的遊戲評估框架
V-MAGE: A Game Evaluation Framework for Assessing Visual-Centric Capabilities in Multimodal Large Language Models
April 8, 2025
作者: Xiangxi Zheng, Linjie Li, Zhengyuan Yang, Ping Yu, Alex Jinpeng Wang, Rui Yan, Yuan Yao, Lijuan Wang
cs.AI
摘要
多模态大語言模型(MLLMs)的最新進展在多種多模态基準測試中取得了顯著提升。然而,隨著評估從靜態數據集轉向開放世界的動態環境,現有的基於遊戲的基準測試仍顯不足,因為它們缺乏以視覺為核心的任務,並且未能評估現實世界決策所需的多樣化推理能力。為此,我們引入了視覺核心多能力遊戲評估(V-MAGE),這是一個基於遊戲的評估框架,旨在評估MLLMs的視覺推理能力。V-MAGE包含五款多樣化的遊戲,擁有30多個精心設計的關卡,測試模型在核心視覺技能(如定位、軌跡追踪、時機把握和視覺記憶)以及高層次推理(如長期規劃和深思熟慮)方面的表現。我們使用V-MAGE評估了領先的MLLMs,揭示了它們在視覺感知和推理方面的重大挑戰。在所有遊戲環境中,根據Elo評分比較得出的頂尖MLLMs與人類相比存在顯著的性能差距。我們的研究結果突顯了關鍵的局限性,包括模型產生的各種感知錯誤,並從以智能體為中心的角度提出了潛在的改進途徑,例如優化智能體策略和解決感知不準確性。代碼可在https://github.com/CSU-JPG/V-MAGE獲取。
English
Recent advancements in Multimodal Large Language Models (MLLMs) have led to
significant improvements across various multimodal benchmarks. However, as
evaluations shift from static datasets to open-world, dynamic environments,
current game-based benchmarks remain inadequate because they lack
visual-centric tasks and fail to assess the diverse reasoning skills required
for real-world decision-making. To address this, we introduce Visual-centric
Multiple Abilities Game Evaluation (V-MAGE), a game-based evaluation framework
designed to assess visual reasoning capabilities of MLLMs. V-MAGE features five
diverse games with 30+ handcrafted levels, testing models on core visual skills
such as positioning, trajectory tracking, timing, and visual memory, alongside
higher-level reasoning like long-term planning and deliberation. We use V-MAGE
to evaluate leading MLLMs, revealing significant challenges in their visual
perception and reasoning. In all game environments, the top-performing MLLMs,
as determined by Elo rating comparisons, exhibit a substantial performance gap
compared to humans. Our findings highlight critical limitations, including
various types of perceptual errors made by the models, and suggest potential
avenues for improvement from an agent-centric perspective, such as refining
agent strategies and addressing perceptual inaccuracies. Code is available at
https://github.com/CSU-JPG/V-MAGE.Summary
AI-Generated Summary