ChatPaper.aiChatPaper

LLaVE:基于难度加权对比学习的大规模语言与视觉嵌入模型

LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning

March 4, 2025
作者: Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su
cs.AI

摘要

通用多模态嵌入模型在交错图文检索、多模态RAG(检索增强生成)以及多模态聚类等任务中发挥着关键作用。然而,我们的实证研究表明,基于标准InfoNCE损失训练的大型多模态模型(LMM)嵌入模型,在正负样本对的相似度分布上存在高度重叠,这使得有效区分困难负样本对变得颇具挑战。为解决这一问题,我们提出了一种简单而有效的框架,该框架根据负样本对的判别难度动态优化嵌入模型的表示学习。在此框架下,我们训练了一系列名为LLaVE的模型,并在涵盖4个元任务和36个数据集的MMEB基准上进行了评估。实验结果显示,LLaVE建立了更强的基础模型,实现了最先进的(SOTA)性能,同时展现出卓越的可扩展性和效率。具体而言,LLaVE-2B超越了之前的7B SOTA模型,而LLaVE-7B则进一步提升了6.2个百分点的性能。尽管LLaVE是在图文数据上训练的,但它能够以零样本方式泛化至文本-视频检索任务,并取得优异表现,充分展示了其在其他嵌入任务迁移上的巨大潜力。
English
Universal multimodal embedding models play a critical role in tasks such as interleaved image-text retrieval, multimodal RAG, and multimodal clustering. However, our empirical results indicate that existing LMM-based embedding models trained with the standard InfoNCE loss exhibit a high degree of overlap in similarity distribution between positive and negative pairs, making it challenging to distinguish hard negative pairs effectively. To deal with this issue, we propose a simple yet effective framework that dynamically improves the embedding model's representation learning for negative pairs based on their discriminative difficulty. Within this framework, we train a series of models, named LLaVE, and evaluate them on the MMEB benchmark, which covers 4 meta-tasks and 36 datasets. Experimental results show that LLaVE establishes stronger baselines that achieve state-of-the-art (SOTA) performance while demonstrating strong scalability and efficiency. Specifically, LLaVE-2B surpasses the previous SOTA 7B models, while LLaVE-7B achieves a further performance improvement of 6.2 points. Although LLaVE is trained on image-text data, it can generalize to text-video retrieval tasks in a zero-shot manner and achieve strong performance, demonstrating its remarkable potential for transfer to other embedding tasks.

Summary

AI-Generated Summary

PDF123March 11, 2025