AlignVLM:将视觉和语言潜空间连接起来,实现多模态理解
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding
February 3, 2025
作者: Ahmed Masry, Juan A. Rodriguez, Tianyu Zhang, Suyuchen Wang, Chao Wang, Aarash Feizi, Akshay Kalkunte Suresh, Abhay Puri, Xiangru Jian, Pierre-André Noël, Sathwik Tejaswi Madhusudhan, Marco Pedersoli, Bang Liu, Nicolas Chapados, Yoshua Bengio, Enamul Hoque, Christopher Pal, Issam H. Laradji, David Vazquez, Perouz Taslakian, Spandana Gella, Sai Rajeswar
cs.AI
摘要
在视觉-语言模型(VLMs)中,将视觉特征与语言嵌入对齐是一个关键挑战。这类模型的性能取决于具有良好的连接器,将视觉编码器生成的视觉特征映射到与LLM共享的嵌入空间,同时保持语义相似性。现有的连接器,如多层感知器(MLPs),通常会产生超出分布范围或嘈杂的输入,导致模态之间的不对齐。在这项工作中,我们提出了一种新颖的视觉-文本对齐方法AlignVLM,将视觉特征映射到LLM文本嵌入的加权平均值。我们的方法利用LLM编码的语言先验,确保将视觉特征映射到LLM能够有效解释的空间区域。AlignVLM在文档理解任务中特别有效,其中扫描的文档图像必须准确映射到其文本内容。我们的广泛实验表明,与先前的对齐方法相比,AlignVLM实现了最先进的性能。我们进一步提供分析,证明了改进的视觉-文本特征对齐和对噪声的稳健性。
English
Aligning visual features with language embeddings is a key challenge in
vision-language models (VLMs). The performance of such models hinges on having
a good connector that maps visual features generated by a vision encoder to a
shared embedding space with the LLM while preserving semantic similarity.
Existing connectors, such as multilayer perceptrons (MLPs), often produce
out-of-distribution or noisy inputs, leading to misalignment between the
modalities. In this work, we propose a novel vision-text alignment method,
AlignVLM, that maps visual features to a weighted average of LLM text
embeddings. Our approach leverages the linguistic priors encoded by the LLM to
ensure that visual features are mapped to regions of the space that the LLM can
effectively interpret. AlignVLM is particularly effective for document
understanding tasks, where scanned document images must be accurately mapped to
their textual content. Our extensive experiments show that AlignVLM achieves
state-of-the-art performance compared to prior alignment methods. We provide
further analysis demonstrating improved vision-text feature alignment and
robustness to noise.Summary
AI-Generated Summary