ChatPaper.aiChatPaper

多模态大语言模型知晓何处聚焦:无需训练即可感知细微视觉细节

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

February 24, 2025
作者: Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski
cs.AI

摘要

近年来,多模态大语言模型(MLLMs)在视觉识别任务中取得了快速进展。鉴于其有望融入众多关键应用场景,深入理解其视觉感知的局限性显得尤为重要。本研究探讨了MLLMs在回答图像相关问题时,是否能够像处理大尺寸视觉对象一样有效地感知微小细节。我们发现,模型的表现对问题中视觉主体的大小极为敏感,并通过干预研究进一步证实了这一影响的因果关系。随后,我们分析了MLLMs在回答视觉问题时的注意力分布模式,有趣的是,即便给出错误答案,它们也总能准确聚焦于相关区域。基于这些发现,我们提出了一种无需训练的视觉干预方法,该方法利用MLLM自身的内部知识,以注意力和梯度图的形式,增强其对微小视觉细节的感知能力。我们在两种广泛使用的MLLMs及七个视觉问答基准上评估了所提方法,结果表明,无需额外训练即可显著提升MLLMs的准确性。我们的研究结果揭示了将MLLMs应用于涉及微小细节的视觉识别任务时存在的风险,并指出利用模型内部状态进行视觉干预是缓解这一风险的有力途径。
English
Multimodal Large Language Models (MLLMs) have experienced rapid progress in visual recognition tasks in recent years. Given their potential integration into many critical applications, it is important to understand the limitations of their visual perception. In this work, we study whether MLLMs can perceive small visual details as effectively as large ones when answering questions about images. We observe that their performance is very sensitive to the size of the visual subject of the question, and further show that this effect is in fact causal by conducting an intervention study. Next, we study the attention patterns of MLLMs when answering visual questions, and intriguingly find that they consistently know where to look, even when they provide the wrong answer. Based on these findings, we then propose training-free visual intervention methods that leverage the internal knowledge of any MLLM itself, in the form of attention and gradient maps, to enhance its perception of small visual details. We evaluate our proposed methods on two widely-used MLLMs and seven visual question answering benchmarks and show that they can significantly improve MLLMs' accuracy without requiring any training. Our results elucidate the risk of applying MLLMs to visual recognition tasks concerning small details and indicate that visual intervention using the model's internal state is a promising direction to mitigate this risk.

Summary

AI-Generated Summary

PDF72February 26, 2025