MegaPairs:用于通用多模态检索的大规模数据综合
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
December 19, 2024
作者: Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong
cs.AI
摘要
尽管多模态检索需求迅速增长,但该领域的进展仍受到训练数据不足的严重限制。本文介绍了MegaPairs,一种新颖的数据合成方法,利用视觉语言模型(VLMs)和开放域图像,结合从该方法生成的大规模合成数据集。我们的实证分析表明,MegaPairs生成了高质量数据,使多模态检索器能够明显优于基线模型,后者是在现有数据集中训练了70倍的数据。此外,由于MegaPairs仅依赖于通用图像语料库和开源VLMs,因此可以轻松扩展,从而实现检索性能的持续改进。在这一阶段,我们生成了超过2600万个训练实例,并使用这些数据训练了几个不同规模的模型。这些新模型在4个流行的组合图像检索(CIR)基准测试中实现了最先进的零样本性能,并在MMEB提供的36个数据集中取得了最佳的整体性能。它们还展示了在进行额外下游微调时显著的性能改进。我们生成的数据集、训练有素的模型和数据合成流程将公开提供,以促进该领域未来的发展。
English
Despite the rapidly growing demand for multimodal retrieval, progress in this
field remains severely constrained by a lack of training data. In this paper,
we introduce MegaPairs, a novel data synthesis method that leverages vision
language models (VLMs) and open-domain images, together with a massive
synthetic dataset generated from this method. Our empirical analysis shows that
MegaPairs generates high-quality data, enabling the multimodal retriever to
significantly outperform the baseline model trained on 70times more data
from existing datasets. Moreover, since MegaPairs solely relies on general
image corpora and open-source VLMs, it can be easily scaled up, enabling
continuous improvements in retrieval performance. In this stage, we produced
more than 26 million training instances and trained several models of varying
sizes using this data. These new models achieve state-of-the-art zero-shot
performance across 4 popular composed image retrieval (CIR) benchmarks and the
highest overall performance on the 36 datasets provided by MMEB. They also
demonstrate notable performance improvements with additional downstream
fine-tuning. Our produced dataset, well-trained models, and data synthesis
pipeline will be made publicly available to facilitate the future development
of this field.Summary
AI-Generated Summary