DiffCLIP:差分注意力机制与CLIP的融合
DiffCLIP: Differential Attention Meets CLIP
March 9, 2025
作者: Hasan Abed Al Kader Hammoud, Bernard Ghanem
cs.AI
摘要
我们提出了DiffCLIP,一种新颖的视觉-语言模型,它将差分注意力机制扩展至CLIP架构。差分注意力最初是为大型语言模型开发的,旨在强化相关上下文的同时消除噪声信息。在本研究中,我们将这一机制整合进CLIP的双编码器(图像与文本)框架中。仅需增加少量参数,DiffCLIP便在图文理解任务上实现了卓越性能。在零样本分类、检索及鲁棒性基准测试中,DiffCLIP持续超越基线CLIP模型。尤为重要的是,这些性能提升伴随着几乎可忽略的计算开销,表明差分注意力能显著增强多模态表示,而无需牺牲效率。代码可在https://github.com/hammoudhasan/DiffCLIP 获取。
English
We propose DiffCLIP, a novel vision-language model that extends the
differential attention mechanism to CLIP architectures. Differential attention
was originally developed for large language models to amplify relevant context
while canceling out noisy information. In this work, we integrate this
mechanism into CLIP's dual encoder (image and text) framework. With minimal
additional parameters, DiffCLIP achieves superior performance on image-text
understanding tasks. Across zero-shot classification, retrieval, and robustness
benchmarks, DiffCLIP consistently outperforms baseline CLIP models. Notably,
these gains come with negligible computational overhead, demonstrating that
differential attention can significantly enhance multi-modal representations
without sacrificing efficiency. Code can be found at
https://github.com/hammoudhasan/DiffCLIP.Summary
AI-Generated Summary