ChatPaper.aiChatPaper

視覺謎題:將多模態推理評估與領域知識解耦

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

April 14, 2025
作者: Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, Xiang Yue
cs.AI

摘要

当前的多模态基准测试常常将推理能力与领域特定知识混为一谈,这使得在非专业环境中孤立和评估一般推理能力变得困难。为解决这一问题,我们引入了VisualPuzzles,一个旨在评估视觉推理能力,同时刻意减少对专业知识依赖的基准测试。VisualPuzzles包含五类多样化的问题:算法推理、类比推理、演绎推理、归纳推理和空间推理。我们问题的一个重要来源是手动翻译自中国公务员考试的逻辑推理题。实验表明,与MMMU等基准测试相比,VisualPuzzles对领域特定知识的需求显著减少,而对复杂推理的要求更高,使我们能够更好地评估真正的多模态推理能力。评估结果显示,在VisualPuzzles上,最先进的多模态大语言模型始终落后于人类表现,且在知识密集型基准测试上的强劲表现并不一定意味着在注重推理、知识需求较低的任务上也能成功。此外,推理增强措施(如通过“思考”模式扩大推理计算规模)在不同模型和任务类型上带来的增益并不一致,我们也没有观察到模型大小与性能之间的明确关联。我们还发现,与更强调知识的基准测试相比,模型在VisualPuzzles上展现出不同的推理和回答模式。VisualPuzzles提供了一个更清晰的视角,用于评估超越事实记忆和领域知识的推理能力。
English
Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. One major source of our questions is manually translated logical reasoning questions from the Chinese Civil Service Examination. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with "thinking" modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.

Summary

AI-Generated Summary

PDF112April 16, 2025