解耦的全局-局部对齐以提升组合理解能力
Decoupled Global-Local Alignment for Improving Compositional Understanding
April 23, 2025
作者: Xiaoxing Hu, Kaicheng Yang, Jun Wang, Haoran Xu, Ziyong Feng, Yupei Wang
cs.AI
摘要
对比语言-图像预训练(CLIP)通过对齐图像与文本模态,在多项下游任务中取得了成功。然而,全局对比学习的本质限制了CLIP理解组合概念(如关系和属性)的能力。尽管近期研究采用全局硬负样本来提升组合理解,但这些方法通过强制在嵌入空间中拉远文本负样本与图像的距离,显著削弱了模型的固有通用能力。为克服这一局限,我们提出了一种解耦的全局-局部对齐(DeGLA)框架,在显著减轻通用能力损失的同时,提升了组合理解。为优化模型固有能力的保留,我们在全局对齐过程中引入了自蒸馏机制,将可学习的图像-文本编码器与基于指数移动平均的冻结教师模型对齐。在自蒸馏的约束下,它有效缓解了微调过程中预训练知识的灾难性遗忘。为提升组合理解,我们首先利用大语言模型(LLMs)的上下文学习能力,构建了涵盖五类约200万条高质量负样本描述。随后,我们提出了图像引导对比(IGC)损失和文本引导对比(TGC)损失,以增强视觉-语言的组合性。大量实验结果验证了DeGLA框架的有效性。与之前的最先进方法相比,DeGLA在VALSE、SugarCrepe和ARO基准测试中平均提升了3.5%。同时,在11个数据集上的零样本分类任务中,平均性能提升了13.0%。我们的代码将在https://github.com/xiaoxing2001/DeGLA 发布。
English
Contrastive Language-Image Pre-training (CLIP) has achieved success on
multiple downstream tasks by aligning image and text modalities. However, the
nature of global contrastive learning limits CLIP's ability to comprehend
compositional concepts, such as relations and attributes. Although recent
studies employ global hard negative samples to improve compositional
understanding, these methods significantly compromise the model's inherent
general capabilities by forcibly distancing textual negative samples from
images in the embedding space. To overcome this limitation, we introduce a
Decoupled Global-Local Alignment (DeGLA) framework that improves compositional
understanding while substantially mitigating losses in general capabilities. To
optimize the retention of the model's inherent capabilities, we incorporate a
self-distillation mechanism within the global alignment process, aligning the
learnable image-text encoder with a frozen teacher model derived from an
exponential moving average. Under the constraint of self-distillation, it
effectively mitigates the catastrophic forgetting of pretrained knowledge
during fine-tuning. To improve compositional understanding, we first leverage
the in-context learning capability of Large Language Models (LLMs) to construct
about 2M high-quality negative captions across five types. Subsequently, we
propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC)
loss to enhance vision-language compositionally. Extensive experimental results
demonstrate the effectiveness of the DeGLA framework. Compared to previous
state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across
the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average
performance improvement of 13.0% on zero-shot classification tasks across
eleven datasets. Our code will be released at
https://github.com/xiaoxing2001/DeGLASummary
AI-Generated Summary