MegaPairs:大規模數據合成用於通用多模檢索
MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval
December 19, 2024
作者: Junjie Zhou, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, Defu Lian, Yongping Xiong
cs.AI
摘要
儘管對多模式檢索的需求迅速增長,但這一領域的進展仍受到訓練數據的嚴重限制。在本文中,我們介紹了 MegaPairs,一種新型的數據合成方法,利用視覺語言模型(VLMs)和開放域圖像,以及通過此方法生成的大規模合成數據集。我們的實證分析顯示,MegaPairs 生成了高質量的數據,使多模式檢索器能夠明顯優於在現有數據集中訓練的基準模型,後者使用了 70 倍更多的數據。此外,由於 MegaPairs 僅依賴於通用圖像語料庫和開源 VLMs,因此可以輕鬆擴展,實現檢索性能的持續改進。在這個階段,我們生成了超過 2600 萬個訓練實例,並使用這些數據訓練了幾個不同大小的模型。這些新模型在 4 個流行的組合圖像檢索(CIR)基準測試中實現了最先進的零樣本性能,並在 MMEB 提供的 36 個數據集中取得了最高的整體性能。它們還展示了在進一步的下游微調中明顯的性能改進。我們生成的數據集、訓練有素的模型和數據合成流程將公開提供,以促進該領域未來的發展。
English
Despite the rapidly growing demand for multimodal retrieval, progress in this
field remains severely constrained by a lack of training data. In this paper,
we introduce MegaPairs, a novel data synthesis method that leverages vision
language models (VLMs) and open-domain images, together with a massive
synthetic dataset generated from this method. Our empirical analysis shows that
MegaPairs generates high-quality data, enabling the multimodal retriever to
significantly outperform the baseline model trained on 70times more data
from existing datasets. Moreover, since MegaPairs solely relies on general
image corpora and open-source VLMs, it can be easily scaled up, enabling
continuous improvements in retrieval performance. In this stage, we produced
more than 26 million training instances and trained several models of varying
sizes using this data. These new models achieve state-of-the-art zero-shot
performance across 4 popular composed image retrieval (CIR) benchmarks and the
highest overall performance on the 36 datasets provided by MMEB. They also
demonstrate notable performance improvements with additional downstream
fine-tuning. Our produced dataset, well-trained models, and data synthesis
pipeline will be made publicly available to facilitate the future development
of this field.Summary
AI-Generated Summary