RealSyn:一种高效且可扩展的多模态交错文档转换范式
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm
February 18, 2025
作者: Tiancheng Gu, Kaicheng Yang, Chaoyi Zhang, Yin Xie, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng
cs.AI
摘要
在大量图文对上进行预训练后,对比语言-图像预训练(CLIP)在多种基准测试中展现出优异性能。然而,仍有大量非配对数据,如多模态交错文档,在视觉-语言表征学习领域未被充分利用。为全面挖掘这些未配对文档的潜力,我们首先构建了一个现实世界数据提取管道,用于抽取高质量的图像和文本。随后,我们设计了一种层次化检索方法,高效地将每张图像与多条语义相关的真实文本关联起来。为进一步增强细粒度视觉信息,我们提出了一个图像语义增强生成模块,用于生成合成文本。此外,我们采用语义平衡采样策略提升数据集多样性,从而更好地学习长尾概念。基于这些创新,我们构建了RealSyn数据集,它融合了真实与合成文本,提供15M、30M和100M三种规模。大量实验证明,RealSyn有效推动了视觉-语言表征学习,并展现出强大的扩展性。在RealSyn上预训练的模型在多个下游任务中达到了最先进的性能。为促进未来研究,RealSyn数据集及预训练模型权重已发布于https://github.com/deepglint/RealSyn。
English
After pre-training on extensive image-text pairs, Contrastive Language-Image
Pre-training (CLIP) demonstrates promising performance on a wide variety of
benchmarks. However, a substantial volume of non-paired data, such as
multimodal interleaved documents, remains underutilized for vision-language
representation learning. To fully leverage these unpaired documents, we
initially establish a Real-World Data Extraction pipeline to extract
high-quality images and texts. Then we design a hierarchical retrieval method
to efficiently associate each image with multiple semantically relevant
realistic texts. To further enhance fine-grained visual information, we propose
an image semantic augmented generation module for synthetic text production.
Furthermore, we employ a semantic balance sampling strategy to improve dataset
diversity, enabling better learning of long-tail concepts. Based on these
innovations, we construct RealSyn, a dataset combining realistic and synthetic
texts, available in three scales: 15M, 30M, and 100M. Extensive experiments
demonstrate that RealSyn effectively advances vision-language representation
learning and exhibits strong scalability. Models pre-trained on RealSyn achieve
state-of-the-art performance on multiple downstream tasks. To facilitate future
research, the RealSyn dataset and pre-trained model weights are released at
https://github.com/deepglint/RealSyn.Summary
AI-Generated Summary