修复不平衡的关注以减轻大型视觉语言模型中的上下文幻觉
Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model
January 21, 2025
作者: Kazi Hasan Ibn Arif, Sajib Acharjee Dip, Khizar Hussain, Lang Zhang, Chris Thomas
cs.AI
摘要
大型视觉语言模型(LVLMs)展示了在理解和描述视觉内容方面的显著能力,在各种视觉-语言任务中取得了最先进的性能。然而,这些模型经常表现出幻觉行为,即生成包含输入图像中不存在的对象或细节的描述。我们的工作通过分析变压器层和头部之间的注意力模式来研究这一现象,揭示了幻觉经常源于在更深层中视觉基础的渐进退化。我们提出了一种新颖的注意力修改方法,结合选择性标记强调和头部特定调节,以在生成过程中保持视觉基础。我们的方法引入了两个关键组成部分:(1)双流标记选择机制,识别和优先考虑局部信息和空间重要的视觉标记,以及(2)头部特定的注意力调节策略,根据各个注意力头部的视觉敏感性来差异性放大视觉信息处理。通过在MSCOCO数据集上进行大量实验,我们展示了我们的方法将幻觉率相对基线模型降低了高达62.3%,同时保持了可比较的任务性能。我们的分析表明,通过有选择地调节具有不同视觉敏感性水平的注意力头部之间的标记,可以显著改善视觉基础,而无需进行模型重新训练。
English
Large Vision Language Models (LVLMs) have demonstrated remarkable
capabilities in understanding and describing visual content, achieving
state-of-the-art performance across various vision-language tasks. However,
these models frequently exhibit hallucination behavior, where they generate
descriptions containing objects or details absent in the input image. Our work
investigates this phenomenon by analyzing attention patterns across transformer
layers and heads, revealing that hallucinations often stem from progressive
degradation of visual grounding in deeper layers. We propose a novel attention
modification approach that combines selective token emphasis and head-specific
modulation to maintain visual grounding throughout the generation process. Our
method introduces two key components: (1) a dual-stream token selection
mechanism that identifies and prioritizes both locally informative and
spatially significant visual tokens, and (2) an attention head-specific
modulation strategy that differentially amplifies visual information processing
based on measured visual sensitivity of individual attention heads. Through
extensive experimentation on the MSCOCO dataset, we demonstrate that our
approach reduces hallucination rates by up to 62.3\% compared to baseline
models while maintaining comparable task performance. Our analysis reveals that
selectively modulating tokens across attention heads with varying levels of
visual sensitivity can significantly improve visual grounding without requiring
model retraining.Summary
AI-Generated Summary