ABC:利用视觉语言模型实现多模态嵌入的更好控制
ABC: Achieving Better Control of Multimodal Embeddings using VLMs
March 1, 2025
作者: Benjamin Schneider, Florian Kerschbaum, Wenhu Chen
cs.AI
摘要
视觉嵌入模型在零样本任务(如图像检索和分类)中表现出色。然而,这些模型无法处理包含模糊性或需要用户指令的任务。此类任务需要多模态嵌入模型,该模型能够输出结合视觉和自然语言输入的嵌入表示。现有的基于CLIP的方法分别嵌入图像和文本,然后融合结果。我们发现这导致模态间交互较弱,且用户对表示的控制力不足。我们提出了ABC,一个开源的多模态嵌入模型,它利用视觉-语言模型骨干深度整合图像特征与自然语言指令。ABC在MSCOCO图像到文本检索任务中实现了同类最佳性能,并在大规模多模态嵌入基准测试的分类和视觉问答任务中表现最优。凭借高度统一的视觉-语言表示,ABC能够运用自然语言解决微妙且可能模糊的视觉检索问题。为评估这一能力,我们设计了CtrlBench基准测试,该测试要求将文本指令与图像内容交织以实现正确检索。ABC通过提供高质量的表示和灵活的自然语言控制,推动了多模态嵌入技术的发展。我们的模型和数据集可在项目页面获取。
English
Visual embedding models excel at zero-shot tasks like visual retrieval and
classification. However, these models cannot be used for tasks that contain
ambiguity or require user instruction. These tasks necessitate a multimodal
embedding model, which outputs embeddings that combine visual and natural
language input. Existing CLIP-based approaches embed images and text
independently, and fuse the result. We find that this results in weak
interactions between modalities, and poor user control over the representation.
We introduce ABC, an open-source multimodal embedding model that uses a
vision-language model backbone to deeply integrate image features with natural
language instructions. ABC achieves bestfor-size performance on MSCOCO
image-to-text retrieval and is the top performing model on classification and
VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly
unified vision-language representation, ABC can use natural language to solve
subtle and potentially ambiguous visual retrieval problems. To evaluate this
capability, we design CtrlBench, a benchmark that requires interleaving textual
instructions with image content for correct retrieval. ABC advances the state
of multimodal embeddings by offering high-quality representations and flexible
natural language control. Our model and datasets are available at our project
page.Summary
AI-Generated Summary