VisuLogic:评估多模态大语言模型视觉推理能力的基准测试
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
April 21, 2025
作者: Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, Wenhai Wang, Jifeng Dai, Jinguo Zhu
cs.AI
摘要
视觉推理是人类智能的核心组成部分,也是高级多模态模型的关键能力。然而,当前对多模态大语言模型(MLLMs)的推理评估往往依赖文本描述,并允许基于语言的推理捷径,未能真正衡量以视觉为中心的推理能力。为此,我们推出了VisuLogic:一个包含1000道经过人工验证问题的基准测试,涵盖六大类别(如数量变化、空间关系、属性比较等)。这些多样化的问题类型可以从多个角度评估MLLMs的视觉推理能力。我们在该基准上评估了领先的MLLMs,并分析其结果以识别常见失败模式。大多数模型的准确率低于30%——仅略高于25%的随机基线,远低于人类达到的51.4%——揭示了视觉推理方面的显著差距。此外,我们提供了一个补充训练数据集和一个基于强化学习的基线,以支持进一步的研究进展。
English
Visual reasoning is a core component of human intelligence and a critical
capability for advanced multimodal models. Yet current reasoning evaluations of
multimodal large language models (MLLMs) often rely on text descriptions and
allow language-based reasoning shortcuts, failing to measure genuine
vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark
of 1,000 human-verified problems across six categories (e.g., quantitative
shifts, spatial relations, attribute comparisons). These various types of
questions can be evaluated to assess the visual reasoning capabilities of MLLMs
from multiple perspectives. We evaluate leading MLLMs on this benchmark and
analyze their results to identify common failure modes. Most models score below
30% accuracy-only slightly above the 25% random baseline and far below the
51.4% achieved by humans-revealing significant gaps in visual reasoning.
Furthermore, we provide a supplementary training dataset and a
reinforcement-learning baseline to support further progress.Summary
AI-Generated Summary