VisuLogic:評估多模態大型語言模型中視覺推理能力的基準
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
April 21, 2025
作者: Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu
cs.AI
摘要
視覺推理是人類智慧的核心組成部分,也是高級多模態模型的關鍵能力。然而,當前對多模態大型語言模型(MLLMs)的推理評估往往依賴於文本描述,並允許基於語言的推理捷徑,未能真正衡量以視覺為中心的推理能力。為解決這一問題,我們引入了VisuLogic:一個包含六個類別(如數量變化、空間關係、屬性比較等)共1,000道人機驗證問題的基準。這些多樣化的問題類型可以從多個角度評估MLLMs的視覺推理能力。我們在該基準上評估了領先的MLLMs,並分析其結果以識別常見的失敗模式。大多數模型的準確率低於30%——僅略高於25%的隨機基線,遠低於人類達到的51.4%——揭示了視覺推理方面的顯著差距。此外,我們提供了一個補充訓練數據集和一個基於強化學習的基線,以支持進一步的進展。
English
Visual reasoning is a core component of human intelligence and a critical
capability for advanced multimodal models. Yet current reasoning evaluations of
multimodal large language models (MLLMs) often rely on text descriptions and
allow language-based reasoning shortcuts, failing to measure genuine
vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark
of 1,000 human-verified problems across six categories (e.g., quantitative
shifts, spatial relations, attribute comparisons). These various types of
questions can be evaluated to assess the visual reasoning capabilities of MLLMs
from multiple perspectives. We evaluate leading MLLMs on this benchmark and
analyze their results to identify common failure modes. Most models score below
30% accuracy-only slightly above the 25% random baseline and far below the
51.4% achieved by humans-revealing significant gaps in visual reasoning.
Furthermore, we provide a supplementary training dataset and a
reinforcement-learning baseline to support further progress.Summary
AI-Generated Summary