解耦的全局-局部對齊以提升組合理解能力

摘要

對比式語言-圖像預訓練（CLIP）通過對齊圖像與文本模態，在多個下游任務中取得了成功。然而，全局對比學習的本質限制了CLIP理解組合概念（如關係與屬性）的能力。儘管近期研究採用全局硬負樣本來提升組合理解，這些方法通過在嵌入空間中強制拉開文本負樣本與圖像的距離，顯著損害了模型的固有通用能力。為克服這一限制，我們提出了一種解耦的全局-局部對齊（DeGLA）框架，該框架在提升組合理解的同時，大幅減少了通用能力的損失。為優化模型固有能力的保留，我們在全局對齊過程中引入了自蒸餾機制，將可學習的圖像-文本編碼器與基於指數移動平均的凍結教師模型對齊。在自蒸餾的約束下，它有效緩解了微調過程中預訓練知識的災難性遺忘。為提升組合理解，我們首先利用大型語言模型（LLMs）的上下文學習能力，構建了約200萬條跨五種類型的高質量負樣本描述。隨後，我們提出了基於圖像的對比（IGC）損失和基於文本的對比（TGC）損失，以增強視覺-語言的組合性。大量實驗結果證明了DeGLA框架的有效性。與先前最先進的方法相比，DeGLA在VALSE、SugarCrepe和ARO基準上平均提升了3.5%。同時，在十一個數據集的零樣本分類任務上，平均性能提升了13.0%。我們的代碼將發佈於https://github.com/xiaoxing2001/DeGLA。

English

Contrastive Language-Image Pre-training (CLIP) has achieved success on multiple downstream tasks by aligning image and text modalities. However, the nature of global contrastive learning limits CLIP's ability to comprehend compositional concepts, such as relations and attributes. Although recent studies employ global hard negative samples to improve compositional understanding, these methods significantly compromise the model's inherent general capabilities by forcibly distancing textual negative samples from images in the embedding space. To overcome this limitation, we introduce a Decoupled Global-Local Alignment (DeGLA) framework that improves compositional understanding while substantially mitigating losses in general capabilities. To optimize the retention of the model's inherent capabilities, we incorporate a self-distillation mechanism within the global alignment process, aligning the learnable image-text encoder with a frozen teacher model derived from an exponential moving average. Under the constraint of self-distillation, it effectively mitigates the catastrophic forgetting of pretrained knowledge during fine-tuning. To improve compositional understanding, we first leverage the in-context learning capability of Large Language Models (LLMs) to construct about 2M high-quality negative captions across five types. Subsequently, we propose the Image-Grounded Contrast (IGC) loss and Text-Grounded Contrast (TGC) loss to enhance vision-language compositionally. Extensive experimental results demonstrate the effectiveness of the DeGLA framework. Compared to previous state-of-the-art methods, DeGLA achieves an average enhancement of 3.5% across the VALSE, SugarCrepe, and ARO benchmarks. Concurrently, it obtains an average performance improvement of 13.0% on zero-shot classification tasks across eleven datasets. Our code will be released at https://github.com/xiaoxing2001/DeGLA

解耦的全局-局部對齊以提升組合理解能力

Decoupled Global-Local Alignment for Improving Compositional Understanding

摘要

Summary

Support

Support