突破模态障礙:基於多模態大語言模型的通用嵌入學習
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
April 24, 2025
作者: Tiancheng Gu, Kaicheng Yang, Ziyong Feng, Xingjun Wang, Yanzhao Zhang, Dingkun Long, Yingda Chen, Weidong Cai, Jiankang Deng
cs.AI
摘要
對比語言-圖像預訓練(CLIP)框架已成為多模態表示學習中廣泛採用的方法,尤其在圖像-文本檢索與聚類任務中表現突出。然而,其效能受到三個關鍵限制的制約:(1) 文本標記截斷,(2) 孤立的圖像-文本編碼,以及(3) 因詞袋行為導致的組合性不足。儘管近期的多模態大語言模型(MLLMs)在通用視覺-語言理解方面展現了顯著進步,但其在學習可遷移多模態表示方面的潛力仍未被充分探索。本研究提出了一種新穎的兩階段框架——通用多模態嵌入(UniME),該框架利用MLLMs來學習適用於多樣下游任務的判別性表示。在第一階段,我們從一個基於大語言模型的強大教師模型中進行文本判別性知識蒸餾,以增強MLLM語言組件的嵌入能力。第二階段,我們引入了硬負樣本增強的指令微調,進一步推進判別性表示學習。具體而言,我們首先減輕假負樣本的污染,隨後在每批次內為每個實例採樣多個硬負樣本,迫使模型聚焦於具有挑戰性的樣本。此方法不僅提升了判別力,還增強了下游任務中的指令遵循能力。我們在MMEB基準及多個檢索任務上進行了廣泛實驗,包括短長文本檢索與組合檢索。結果表明,UniME在所有任務上均實現了性能的持續提升,展現出卓越的判別性與組合能力。
English
The Contrastive Language-Image Pre-training (CLIP) framework has become a
widely used approach for multimodal representation learning, particularly in
image-text retrieval and clustering. However, its efficacy is constrained by
three key limitations: (1) text token truncation, (2) isolated image-text
encoding, and (3) deficient compositionality due to bag-of-words behavior.
While recent Multimodal Large Language Models (MLLMs) have demonstrated
significant advances in generalized vision-language understanding, their
potential for learning transferable multimodal representations remains
underexplored.In this work, we present UniME (Universal Multimodal Embedding),
a novel two-stage framework that leverages MLLMs to learn discriminative
representations for diverse downstream tasks. In the first stage, we perform
textual discriminative knowledge distillation from a powerful LLM-based teacher
model to enhance the embedding capability of the MLLM\'s language component. In
the second stage, we introduce hard negative enhanced instruction tuning to
further advance discriminative representation learning. Specifically, we
initially mitigate false negative contamination and then sample multiple hard
negatives per instance within each batch, forcing the model to focus on
challenging samples. This approach not only improves discriminative power but
also enhances instruction-following ability in downstream tasks. We conduct
extensive experiments on the MMEB benchmark and multiple retrieval tasks,
including short and long caption retrieval and compositional retrieval. Results
demonstrate that UniME achieves consistent performance improvement across all
tasks, exhibiting superior discriminative and compositional capabilities.Summary
AI-Generated Summary