ChatPaper.aiChatPaper

视觉谜题:将多模态推理评估与领域知识解耦

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

April 14, 2025
作者: Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, Xiang Yue
cs.AI

摘要

当前的多模态基准测试常将推理能力与领域特定知识混为一谈,使得在非专业环境中孤立评估通用推理能力变得困难。为解决这一问题,我们推出了VisualPuzzles,这一基准测试专注于视觉推理,同时刻意减少对专业知识的依赖。VisualPuzzles包含五大类多样化问题:算法推理、类比推理、演绎推理、归纳推理及空间推理。其中,大量问题源自中国公务员考试中逻辑推理题的手工翻译。实验表明,与MMMU等基准相比,VisualPuzzles对领域特定知识的需求显著降低,而对复杂推理的要求更高,从而能更有效地评估真实的多模态推理能力。评估结果显示,在VisualPuzzles上,最先进的多模态大语言模型持续落后于人类表现,且在知识密集型基准上的优异表现并不必然转化为在推理为主、知识需求较低任务上的成功。此外,诸如扩大推理计算规模(采用“思考”模式)等推理增强措施,在不同模型和任务类型间带来的提升并不一致,且我们未观察到模型规模与性能之间存在明确关联。我们还发现,与更侧重知识的基准相比,模型在VisualPuzzles上展现出不同的推理和作答模式。VisualPuzzles提供了一个更清晰的视角,用以评估超越事实记忆和领域知识的推理能力。
English
Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. One major source of our questions is manually translated logical reasoning questions from the Chinese Civil Service Examination. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with "thinking" modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.

Summary

AI-Generated Summary

PDF102April 16, 2025