VisuLogic：評估多模態大型語言模型中視覺推理能力的基準

摘要

視覺推理是人類智慧的核心組成部分，也是高級多模態模型的關鍵能力。然而，當前對多模態大型語言模型（MLLMs）的推理評估往往依賴於文本描述，並允許基於語言的推理捷徑，未能真正衡量以視覺為中心的推理能力。為解決這一問題，我們引入了VisuLogic：一個包含六個類別（如數量變化、空間關係、屬性比較等）共1,000道人機驗證問題的基準。這些多樣化的問題類型可以從多個角度評估MLLMs的視覺推理能力。我們在該基準上評估了領先的MLLMs，並分析其結果以識別常見的失敗模式。大多數模型的準確率低於30%——僅略高於25%的隨機基線，遠低於人類達到的51.4%——揭示了視覺推理方面的顯著差距。此外，我們提供了一個補充訓練數據集和一個基於強化學習的基線，以支持進一步的進展。

English

Visual reasoning is a core component of human intelligence and a critical capability for advanced multimodal models. Yet current reasoning evaluations of multimodal large language models (MLLMs) often rely on text descriptions and allow language-based reasoning shortcuts, failing to measure genuine vision-centric reasoning. To address this, we introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories (e.g., quantitative shifts, spatial relations, attribute comparisons). These various types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. We evaluate leading MLLMs on this benchmark and analyze their results to identify common failure modes. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans-revealing significant gaps in visual reasoning. Furthermore, we provide a supplementary training dataset and a reinforcement-learning baseline to support further progress.

VisuLogic：評估多模態大型語言模型中視覺推理能力的基準

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models

摘要

Summary

Support

Support