ChatPaper.aiChatPaper

對比式區域化語言-圖像預訓練

Contrastive Localized Language-Image Pre-Training

October 3, 2024
作者: Hong-You Chen, Zhengfeng Lai, Haotian Zhang, Xinze Wang, Marcin Eichner, Keen You, Meng Cao, Bowen Zhang, Yinfei Yang, Zhe Gan
cs.AI

摘要

對比式語言-圖像預訓練(CLIP)一直是一種備受讚譽的方法,用於訓練視覺編碼器以生成圖像/文本表示,從而促進各種應用。最近,CLIP已被廣泛採用作多模式大型語言模型(MLLMs)的視覺骨幹,以連接圖像輸入進行語言交互。CLIP作為視覺-語言基礎模型的成功依賴於在圖像層面對齊網絡爬取的噪聲文本標註。然而,這樣的標準對於需要精細視覺表示的下游任務可能不足,特別是對於MLLMs而言,區域級別的理解尤為困難。在本文中,我們通過幾項進步來提高CLIP的定位能力。我們提出了一種名為對比式局部語言-圖像預訓練(CLOC)的預訓練方法,通過補充CLIP與區域-文本對比損失和模塊。我們提出了一個新概念,即可提示嵌入,其中編碼器產生的圖像嵌入易於轉換為區域表示,並提供空間提示。為了支持大規模預訓練,我們設計了一個視覺豐富且空間局部化的字幕框架,以有效生成大規模的區域-文本虛標籤。通過擴展到數十億個帶標註圖像,CLOC實現了高質量的區域嵌入,用於圖像區域識別和檢索任務,並可作為CLIP的替代方案,以增強MLLMs,特別是在指代和定位任務上。
English
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

Summary

AI-Generated Summary

PDF383November 16, 2024