文字还是视觉:视觉-语言模型是否盲目信任文本?
Words or Vision: Do Vision-Language Models Have Blind Faith in Text?
March 4, 2025
作者: Ailin Deng, Tri Cao, Zhirui Chen, Bryan Hooi
cs.AI
摘要
视觉-语言模型(VLMs)在整合视觉与文本信息以执行视觉中心任务方面表现出色,但其在处理模态间不一致性方面的能力尚待深入探究。本研究探讨了在视觉主导场景下,面对视觉数据与多样化文本输入时,VLMs的模态偏好。通过在四项视觉中心任务中引入文本变体,并对十种视觉-语言模型进行评估,我们发现了一种“盲目信任文本”的现象:当出现不一致时,VLMs过度依赖文本数据而忽视视觉数据,导致在文本被污染时性能显著下降,并引发安全隐患。我们分析了影响这种文本偏见的因素,包括指令提示、语言模型规模、文本相关性、词序以及视觉与文本确定性之间的相互作用。虽然某些因素(如扩大语言模型规模)能轻微缓解文本偏见,但其他因素(如词序)由于继承了语言模型的位置偏见,反而可能加剧这一问题。为解决此问题,我们探索了结合文本增强的监督微调方法,并证明了其在减少文本偏见方面的有效性。此外,我们提供了理论分析,指出“盲目信任文本”现象可能源于训练过程中纯文本与多模态数据的不平衡。我们的研究结果强调了在VLMs中实现平衡训练及审慎考虑模态间交互的必要性,以增强其在处理多模态数据不一致性时的鲁棒性和可靠性。
English
Vision-Language Models (VLMs) excel in integrating visual and textual
information for vision-centric tasks, but their handling of inconsistencies
between modalities is underexplored. We investigate VLMs' modality preferences
when faced with visual data and varied textual inputs in vision-centered
settings. By introducing textual variations to four vision-centric tasks and
evaluating ten Vision-Language Models (VLMs), we discover a ``blind faith
in text'' phenomenon: VLMs disproportionately trust textual data over visual
data when inconsistencies arise, leading to significant performance drops under
corrupted text and raising safety concerns. We analyze factors influencing this
text bias, including instruction prompts, language model size, text relevance,
token order, and the interplay between visual and textual certainty. While
certain factors, such as scaling up the language model size, slightly mitigate
text bias, others like token order can exacerbate it due to positional biases
inherited from language models. To address this issue, we explore supervised
fine-tuning with text augmentation and demonstrate its effectiveness in
reducing text bias. Additionally, we provide a theoretical analysis suggesting
that the blind faith in text phenomenon may stem from an imbalance of pure text
and multi-modal data during training. Our findings highlight the need for
balanced training and careful consideration of modality interactions in VLMs to
enhance their robustness and reliability in handling multi-modal data
inconsistencies.Summary
AI-Generated Summary