LVLMs 的歧視性微調
Discriminative Fine-tuning of LVLMs
December 5, 2024
作者: Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, Brais Martinez
cs.AI
摘要
對比訓練的視覺語言模型(VLMs)如CLIP已成為區分性視覺語言表示學習的事實上方法。然而,這些模型在語言理解方面存在限制,通常表現出"詞袋"行為。與此同時,大型視覺語言模型(LVLMs),將視覺編碼器與LLMs結合,已被證明能夠進行詳細的視覺語言推理,但其自回歸的特性使其不太適合用於區分性任務。
在這項工作中,我們提出結合"兩全其美"的方法:一種新的訓練方法,用於對LVLMs進行區分性微調,從而產生強大的區分性和組成能力。基本上,我們的方法將一個生成式的LVLM轉換為一個區分性的LVLM,發揮其強大的圖像-文本區分能力,並提升語言理解能力。
我們的貢獻包括:(1)一個精心設計的訓練/優化框架,利用可變長度和粒度的圖像-文本對來訓練模型,同時使用對比和下一令牌預測損失。這伴隨著消融研究,證明了我們框架組件的必要性。 (2)一種使用軟提示和LoRA適配器組合的參數高效適應方法。 (3)與同等大小的最新CLIP類似模型相比,包括標准的圖像-文本檢索基準和組成性方面的顯著改進。
English
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the
de facto approach for discriminative vision-language representation learning.
However, these models have limited language understanding, often exhibiting a
"bag of words" behavior. At the same time, Large Vision-Language Models
(LVLMs), which combine vision encoders with LLMs, have been shown capable of
detailed vision-language reasoning, yet their autoregressive nature renders
them less suitable for discriminative tasks.
In this work, we propose to combine "the best of both worlds": a new training
approach for discriminative fine-tuning of LVLMs that results in strong
discriminative and compositional capabilities. Essentially, our approach
converts a generative LVLM into a discriminative one, unlocking its capability
for powerful image-text discrimination combined with enhanced language
understanding.
Our contributions include: (1) A carefully designed training/optimization
framework that utilizes image-text pairs of variable length and granularity for
training the model with both contrastive and next-token prediction losses. This
is accompanied by ablation studies that justify the necessity of our
framework's components. (2) A parameter-efficient adaptation method using a
combination of soft prompting and LoRA adapters. (3) Significant improvements
over state-of-the-art CLIP-like models of similar size, including standard
image-text retrieval benchmarks and notable gains in compositionality.Summary
AI-Generated Summary