解耦的全局-局部对齐以提升组合理解能力

摘要

对比语言-图像预训练（CLIP）通过对齐图像与文本模态，在多项下游任务中取得了成功。然而，全局对比学习的本质限制了CLIP理解组合概念（如关系和属性）的能力。尽管近期研究采用全局硬负样本来提升组合理解，但这些方法通过强制在嵌入空间中拉远文本负样本与图像的距离，显著削弱了模型的固有通用能力。为克服这一局限，我们提出了一种解耦的全局-局部对齐（DeGLA）框架，在显著减轻通用能力损失的同时，提升了组合理解。为优化模型固有能力的保留，我们在全局对齐过程中引入了自蒸馏机制，将可学习的图像-文本编码器与基于指数移动平均的冻结教师模型对齐。在自蒸馏的约束下，它有效缓解了微调过程中预训练知识的灾难性遗忘。为提升组合理解，我们首先利用大语言模型（LLMs）的上下文学习能力，构建了涵盖五类约200万条高质量负样本描述。随后，我们提出了图像引导对比（IGC）损失和文本引导对比（TGC）损失，以增强视觉-语言的组合性。大量实验结果验证了DeGLA框架的有效性。与之前的最先进方法相比，DeGLA在VALSE、SugarCrepe和ARO基准测试中平均提升了3.5%。同时，在11个数据集上的零样本分类任务中，平均性能提升了13.0%。我们的代码将在https://github.com/xiaoxing2001/DeGLA 发布。

English

Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA

解耦的全局-局部对齐以提升组合理解能力

Decoupled Global-Local Alignment for Improving Compositional Understanding

摘要

Summary

Support

Support