通過同心因果注意力來減輕物件幻覺

摘要

最近的大型視覺語言模型（LVLMs）展現出卓越的零-shot 對話和推理能力，尤其在給定多模式查詢時。然而，它們存在物件幻覺的問題，即LVLMs容易生成與圖像輸入事實不符的文本回應。我們的初步研究發現，物件幻覺與Rotary Position Encoding（RoPE）密切相關，RoPE是現有LVLMs中廣泛採用的位置依賴性建模設計。由於RoPE中存在的長期衰減，當相關的視覺線索與多模式輸入序列中的指示標記相距較遠時，LVLMs更容易產生幻覺。此外，我們觀察到在多模式對齊期間反轉視覺標記的順序時也會出現類似的效應。我們的測試表明，RoPE中的長期衰減對LVLMs在捕捉視覺-指示交互作用時跨越長距離存在挑戰。我們提出了Concentric Causal Attention（CCA），這是一種簡單而有效的位置對齊策略，通過自然地減少視覺和指示標記之間的相對距離，減輕了LVLMs中RoPE長期衰減的影響。有了CCA，視覺標記可以更好地與指示標記互動，從而增強模型的感知能力並減輕物件幻覺。在不添加瑣碎功能的情況下，我們的位置對齊方法在多個物件幻覺基準測試中遠遠超越現有的幻覺緩解策略。

English

Recent Large Vision Language Models (LVLMs) present remarkable zero-shot conversational and reasoning capabilities given multimodal queries. Nevertheless, they suffer from object hallucination, a phenomenon where LVLMs are prone to generate textual responses not factually aligned with image inputs. Our pilot study reveals that object hallucination is closely tied with Rotary Position Encoding (RoPE), a widely adopted positional dependency modeling design in existing LVLMs. Due to the long-term decay in RoPE, LVLMs tend to hallucinate more when relevant visual cues are distant from instruction tokens in the multimodal input sequence. Additionally, we observe a similar effect when reversing the sequential order of visual tokens during multimodal alignment. Our tests indicate that long-term decay in RoPE poses challenges to LVLMs while capturing visual-instruction interactions across long distances. We propose Concentric Causal Attention (CCA), a simple yet effective positional alignment strategy that mitigates the impact of RoPE long-term decay in LVLMs by naturally reducing relative distance between visual and instruction tokens. With CCA, visual tokens can better interact with instruction tokens, thereby enhancing model's perception capability and alleviating object hallucination. Without bells and whistles, our positional alignment method surpasses existing hallucination mitigation strategies by large margins on multiple object hallucination benchmarks.

通過同心因果注意力來減輕物件幻覺

Mitigating Object Hallucination via Concentric Causal Attention

摘要

Summary

Support

Support