ChatPaper.aiChatPaper

多模态不一致性推理(MMIR):多模态推理模型的新基准

Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

February 22, 2025
作者: Qianqi Yan, Yue Fan, Hongquan Li, Shan Jiang, Yang Zhao, Xinze Guan, Ching-Chen Kuo, Xin Eric Wang
cs.AI

摘要

现有的多模态大语言模型(MLLMs)主要是在视觉-文本一致的输入上进行训练和测试,这留下了一个悬而未决的问题:它们能否处理现实世界中布局丰富内容中的不一致性。为填补这一空白,我们提出了多模态不一致性推理(MMIR)基准,以评估MLLMs在检测和推理网页、演示文稿和海报等人工制品中语义不匹配的能力。MMIR包含534个具有挑战性的样本,每个样本在五个推理密集的类别中注入了合成错误:事实矛盾、身份误认、上下文不匹配、数量差异以及时间/空间不连贯。我们评估了六种最先进的MLLMs,结果表明,具备专门多模态推理能力的模型,如o1,显著优于其他模型,而开源模型在面对不一致性错误时尤为脆弱。详细的错误分析进一步显示,模型在检测局限于单一模态(尤其是文本)的不一致性方面表现出色,但在处理跨模态冲突和复杂布局时则显得力不从心。探索性实验揭示,单一模态提示,包括思维链(CoT)和标记集(SoM)方法,带来的提升有限,这暴露了跨模态推理中的一个关键瓶颈。我们的研究结果强调了发展先进多模态推理的必要性,并为未来关于多模态不一致性的研究指明了方向。
English
Existing Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs, leaving open the question of whether they can handle inconsistencies in real-world, layout-rich content. To bridge this gap, we propose the Multimodal Inconsistency Reasoning (MMIR) benchmark to assess MLLMs' ability to detect and reason about semantic mismatches in artifacts such as webpages, presentation slides, and posters. MMIR comprises 534 challenging samples, each containing synthetically injected errors across five reasoning-heavy categories: Factual Contradiction, Identity Misattribution, Contextual Mismatch, Quantitative Discrepancy, and Temporal/Spatial Incoherence. We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts while open-source models remain particularly vulnerable to inconsistency errors. Detailed error analyses further show that models excel in detecting inconsistencies confined to a single modality, particularly in text, but struggle with cross-modal conflicts and complex layouts. Probing experiments reveal that single-modality prompting, including Chain-of-Thought (CoT) and Set-of-Mark (SoM) methods, yields marginal gains, revealing a key bottleneck in cross-modal reasoning. Our findings highlight the need for advanced multimodal reasoning and point to future research on multimodal inconsistency.

Summary

AI-Generated Summary

PDF162February 25, 2025