ChatPaper.aiChatPaper

LVLMs的判别式微调

Discriminative Fine-tuning of LVLMs

December 5, 2024
作者: Yassine Ouali, Adrian Bulat, Alexandros Xenos, Anestis Zaganidis, Ioannis Maniadis Metaxas, Georgios Tzimiropoulos, Brais Martinez
cs.AI

摘要

对比训练的视觉语言模型(VLMs)如CLIP已成为区分性视觉语言表示学习的事实标准方法。然而,这些模型在语言理解方面存在局限,通常表现出“词袋”行为。与此同时,将视觉编码器与LLMs结合的大型视觉语言模型(LVLMs)已被证明能够进行详细的视觉语言推理,但其自回归性质使其不太适用于区分性任务。 在这项工作中,我们提出结合“两全其美”的方法:一种新的LVLMs区分性微调训练方法,可实现强大的区分性和组合能力。本质上,我们的方法将生成式LVLM转换为区分式LVLM,释放其进行强大的图像-文本区分能力以及增强的语言理解能力。 我们的贡献包括:(1)精心设计的训练/优化框架,利用可变长度和粒度的图像-文本对训练模型,同时结合对比和下一个标记预测损失。这伴随着消融研究,证明了我们框架组件的必要性。 (2)使用软提示和LoRA适配器结合的参数高效适应方法。 (3)相比于类似规模的最新CLIP模型,包括标准图像-文本检索基准和组合性方面的显著改进。
English
Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.

Summary

AI-Generated Summary

PDF112December 6, 2024