通过高质量合成数据改进多模态多语言嵌入 mmE5
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
February 12, 2025
作者: Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou
cs.AI
摘要
多模态嵌入模型因其能够将来自不同模态(如文本和图像)的数据映射到统一的表示空间而备受关注。然而,有限的标记多模态数据经常限制了嵌入性能。最近的方法利用数据合成来解决这一问题,然而合成数据的质量仍然是一个关键瓶颈。在这项工作中,我们确定了高质量合成多模态数据的三个标准。首先,广泛的范围确保生成的数据涵盖不同任务和模态,使其适用于各种下游场景。其次,强大的跨模态对齐使不同模态在语义上保持一致。第三,高保真度确保合成数据保持真实细节,以增强其可靠性。在这些原则的指导下,我们合成了数据集:(1)涵盖广泛的任务、模态组合和语言,(2)通过多模态大型语言模型的单次深思过程生成,(3)结合真实世界图像和准确相关的文本,通过自我评估和改进确保保真度。利用这些高质量的合成和标记数据集,我们训练了一个多模态多语言E5模型mmE5。大量实验证明mmE5在MMEB基准测试上实现了最先进的性能,并在XTD基准测试上实现了卓越的多语言性能。我们的代码、数据集和模型已在https://github.com/haon-chen/mmE5 上发布。
English
Multimodal embedding models have gained significant attention for their
ability to map data from different modalities, such as text and images, into a
unified representation space. However, the limited labeled multimodal data
often hinders embedding performance. Recent approaches have leveraged data
synthesis to address this problem, yet the quality of synthetic data remains a
critical bottleneck. In this work, we identify three criteria for high-quality
synthetic multimodal data. First, broad scope ensures that the generated data
covers diverse tasks and modalities, making it applicable to various downstream
scenarios. Second, robust cross-modal alignment makes different modalities
semantically consistent. Third, high fidelity ensures that the synthetic data
maintains realistic details to enhance its reliability. Guided by these
principles, we synthesize datasets that: (1) cover a wide range of tasks,
modality combinations, and languages, (2) are generated via a deep thinking
process within a single pass of a multimodal large language model, and (3)
incorporate real-world images with accurate and relevant texts, ensuring
fidelity through self-evaluation and refinement. Leveraging these high-quality
synthetic and labeled datasets, we train a multimodal multilingual E5 model
mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art
performance on the MMEB Benchmark and superior multilingual performance on the
XTD benchmark. Our codes, datasets and models are released in
https://github.com/haon-chen/mmE5.Summary
AI-Generated Summary