多模态数学推理基准测试：显式视觉依赖关系

摘要

近期，大型视觉语言模型（LVLMs）的进展显著提升了其整合视觉与语言信息的能力，在物体识别、图像描述和视觉问答等任务上达到了接近人类的水平。然而，当前的基准测试通常侧重于以知识为中心的评价，评估特定领域的专业知识，往往忽视了模型在基本数学元素和视觉概念上的核心推理能力。我们发现，在评估依赖明确视觉关联的基础数学问题时存在一个空白，这类问题要求模型能够辨别、整合并跨多幅图像进行推理，同时融入常识知识，这些能力对于推动更广泛的人工通用智能（AGI）发展至关重要。为填补这一空白，我们引入了VCBENCH，一个针对具有明确视觉依赖性的多模态数学推理的综合基准。VCBENCH包含六个认知领域的1,720个问题，涉及6,697张图像（平均每个问题3.9张），以确保多图像推理的需求。我们对26个最先进的LVLMs在VCBENCH上进行了评估，结果显示性能存在显著差异，即便是表现最佳的模型准确率也未能超过50%。我们的研究结果凸显了视觉与数学整合方面持续的挑战，并为未来LVLMs的发展指明了方向。

English

Recent advancements in Large Vision-Language Models (LVLMs) have significantly enhanced their ability to integrate visual and linguistic information, achieving near-human proficiency in tasks like object recognition, captioning, and visual question answering. However, current benchmarks typically focus on knowledge-centric evaluations that assess domain-specific expertise, often neglecting the core ability to reason about fundamental mathematical elements and visual concepts. We identify a gap in evaluating elementary-level math problems, which rely on explicit visual dependencies-requiring models to discern, integrate, and reason across multiple images while incorporating commonsense knowledge, all of which are crucial for advancing toward broader AGI capabilities. To address this gap, we introduce VCBENCH, a comprehensive benchmark for multimodal mathematical reasoning with explicit visual dependencies. VCBENCH includes 1,720 problems across six cognitive domains, featuring 6,697 images (averaging 3.9 per question) to ensure multi-image reasoning. We evaluate 26 state-of-the-art LVLMs on VCBENCH, revealing substantial performance disparities, with even the top models unable to exceed 50% accuracy. Our findings highlight the ongoing challenges in visual-mathematical integration and suggest avenues for future LVLM advancements.

多模态数学推理基准测试：显式视觉依赖关系

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

摘要

Summary

Support

Support