메가페어: 범용 다중 모달 검색을 위한 대규모 데이터 합성

초록

다중 모달 검색에 대한 빠르게 증가하는 수요에도 불구하고, 이 분야의 발전은 훈련 데이터 부족으로 심각하게 제약되어 있다. 본 논문에서는 비전 언어 모델(VLMs)과 오픈 도메인 이미지를 활용한 혁신적인 데이터 합성 방법인 MegaPairs를 소개한다. 이 방법으로 생성된 대규모 합성 데이터셋을 사용하여, MegaPairs가 고품질 데이터를 생성하며, 다중 모달 검색기가 기존 데이터셋에서 70배 더 많은 데이터로 훈련된 기준 모델을 크게 능가할 수 있음을 경험적으로 분석하였다. 또한 MegaPairs는 일반 이미지 코퍼스와 오픈 소스 VLMs에만 의존하므로 쉽게 확장이 가능하며, 검색 성능을 지속적으로 향상시킬 수 있다. 이 단계에서 우리는 이 데이터를 사용하여 2600만 개 이상의 훈련 인스턴스를 생성하고, 이 데이터를 사용하여 다양한 크기의 모델을 훈련시켰다. 이 새로운 모델들은 4가지 인기 있는 구성 이미지 검색(CIR) 벤치마크와 MMEB가 제공하는 36개 데이터셋에서 최첨단 제로샷 성능을 달성하며, 추가적인 하류 미세 조정으로 주목할만한 성능 향상을 보여주었다. 우리가 제작한 데이터셋, 훈련된 모델, 그리고 데이터 합성 파이프라인은 이 분야의 미래 발전을 촉진하기 위해 공개적으로 제공될 것이다.

English

Despite the rapidly growing demand for multimodal retrieval, progress in this field remains severely constrained by a lack of training data. In this paper, we introduce MegaPairs, a novel data synthesis method that leverages vision language models (VLMs) and open-domain images, together with a massive synthetic dataset generated from this method. Our empirical analysis shows that MegaPairs generates high-quality data, enabling the multimodal retriever to significantly outperform the baseline model trained on 70times more data from existing datasets. Moreover, since MegaPairs solely relies on general image corpora and open-source VLMs, it can be easily scaled up, enabling continuous improvements in retrieval performance. In this stage, we produced more than 26 million training instances and trained several models of varying sizes using this data. These new models achieve state-of-the-art zero-shot performance across 4 popular composed image retrieval (CIR) benchmarks and the highest overall performance on the 36 datasets provided by MMEB. They also demonstrate notable performance improvements with additional downstream fine-tuning. Our produced dataset, well-trained models, and data synthesis pipeline will be made publicly available to facilitate the future development of this field.

메가페어: 범용 다중 모달 검색을 위한 대규모 데이터 합성

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

초록

Support