RealSyn:一種高效且可擴展的多模態交錯文件轉換範式
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm
February 18, 2025
作者: Tiancheng Gu, Kaicheng Yang, Chaoyi Zhang, Yin Xie, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng
cs.AI
摘要
在對大量圖像-文本對進行預訓練後,對比式語言-圖像預訓練(CLIP)在多種基準測試中展現出優異的性能。然而,仍有大量非配對數據,如多模態交錯文檔,在視覺-語言表徵學習中未被充分利用。為充分挖掘這些未配對文檔的潛力,我們首先建立了一個現實世界數據提取流程,以獲取高質量的圖像和文本。隨後,我們設計了一種分層檢索方法,高效地將每幅圖像與多個語義相關的真實文本關聯起來。為了進一步增強細粒度的視覺信息,我們提出了一個圖像語義增強生成模塊,用於合成文本的生成。此外,我們採用語義平衡採樣策略來提升數據集的多樣性,從而更好地學習長尾概念。基於這些創新,我們構建了RealSyn數據集,該數據集結合了真實與合成文本,提供三種規模:1500萬、3000萬和1億。大量實驗證明,RealSyn有效推動了視覺-語言表徵學習的進步,並展現出強大的擴展性。基於RealSyn預訓練的模型在多個下游任務中達到了最先進的性能。為促進未來研究,RealSyn數據集及預訓練模型權重已發佈於https://github.com/deepglint/RealSyn。
English
After pre-training on extensive image-text pairs, Contrastive Language-Image
Pre-training (CLIP) demonstrates promising performance on a wide variety of
benchmarks. However, a substantial volume of non-paired data, such as
multimodal interleaved documents, remains underutilized for vision-language
representation learning. To fully leverage these unpaired documents, we
initially establish a Real-World Data Extraction pipeline to extract
high-quality images and texts. Then we design a hierarchical retrieval method
to efficiently associate each image with multiple semantically relevant
realistic texts. To further enhance fine-grained visual information, we propose
an image semantic augmented generation module for synthetic text production.
Furthermore, we employ a semantic balance sampling strategy to improve dataset
diversity, enabling better learning of long-tail concepts. Based on these
innovations, we construct RealSyn, a dataset combining realistic and synthetic
texts, available in three scales: 15M, 30M, and 100M. Extensive experiments
demonstrate that RealSyn effectively advances vision-language representation
learning and exhibits strong scalability. Models pre-trained on RealSyn achieve
state-of-the-art performance on multiple downstream tasks. To facilitate future
research, the RealSyn dataset and pre-trained model weights are released at
https://github.com/deepglint/RealSyn.Summary
AI-Generated Summary