视觉谜题:将多模态推理评估与领域知识解耦
VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge
April 14, 2025
作者: Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, Xiang Yue
cs.AI
摘要
当前的多模态基准测试常将推理能力与领域特定知识混为一谈,使得在非专业环境中孤立评估通用推理能力变得困难。为解决这一问题,我们推出了VisualPuzzles,这一基准测试专注于视觉推理,同时刻意减少对专业知识的依赖。VisualPuzzles包含五大类多样化问题:算法推理、类比推理、演绎推理、归纳推理及空间推理。其中,大量问题源自中国公务员考试中逻辑推理题的手工翻译。实验表明,与MMMU等基准相比,VisualPuzzles对领域特定知识的需求显著降低,而对复杂推理的要求更高,从而能更有效地评估真实的多模态推理能力。评估结果显示,在VisualPuzzles上,最先进的多模态大语言模型持续落后于人类表现,且在知识密集型基准上的优异表现并不必然转化为在推理为主、知识需求较低任务上的成功。此外,诸如扩大推理计算规模(采用“思考”模式)等推理增强措施,在不同模型和任务类型间带来的提升并不一致,且我们未观察到模型规模与性能之间存在明确关联。我们还发现,与更侧重知识的基准相比,模型在VisualPuzzles上展现出不同的推理和作答模式。VisualPuzzles提供了一个更清晰的视角,用以评估超越事实记忆和领域知识的推理能力。
English
Current multimodal benchmarks often conflate reasoning with domain-specific
knowledge, making it difficult to isolate and evaluate general reasoning
abilities in non-expert settings. To address this, we introduce VisualPuzzles,
a benchmark that targets visual reasoning while deliberately minimizing
reliance on specialized knowledge. VisualPuzzles consists of diverse questions
spanning five categories: algorithmic, analogical, deductive, inductive, and
spatial reasoning. One major source of our questions is manually translated
logical reasoning questions from the Chinese Civil Service Examination.
Experiments show that VisualPuzzles requires significantly less intensive
domain-specific knowledge and more complex reasoning compared to benchmarks
like MMMU, enabling us to better evaluate genuine multimodal reasoning.
Evaluations show that state-of-the-art multimodal large language models
consistently lag behind human performance on VisualPuzzles, and that strong
performance on knowledge-intensive benchmarks does not necessarily translate to
success on reasoning-focused, knowledge-light tasks. Additionally, reasoning
enhancements such as scaling up inference compute (with "thinking" modes) yield
inconsistent gains across models and task types, and we observe no clear
correlation between model size and performance. We also found that models
exhibit different reasoning and answering patterns on VisualPuzzles compared to
benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer
lens through which to evaluate reasoning capabilities beyond factual recall and
domain knowledge.Summary
AI-Generated Summary