ChatPaper.aiChatPaper

MIEB:大规模图像嵌入基准测试

MIEB: Massive Image Embedding Benchmark

April 14, 2025
作者: Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff
cs.AI

摘要

图像表示通常通过孤立、任务特定的评估协议进行评价,导致对模型能力的理解碎片化。例如,一个擅长图像聚类的图像嵌入模型是否同样擅长根据文本检索相关图像,这一点尚不明确。我们引入了大规模图像嵌入基准(MIEB),以评估图像及图像-文本嵌入模型在迄今为止最广泛领域内的表现。MIEB横跨38种语言,涵盖130项独立任务,并将其归纳为8个高级类别。我们在基准上测试了50个模型,发现没有单一方法能在所有任务类别中占据主导地位。我们揭示了先进视觉模型的潜在能力,如它们对文本的精确视觉表示,以及在存在干扰因素时交织编码及匹配图像与文本方面的能力仍有限。我们还展示了视觉编码器在MIEB上的表现与其在多模态大语言模型中的应用表现高度相关。我们的代码、数据集及排行榜已公开于https://github.com/embeddings-benchmark/mteb。
English
Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

Summary

AI-Generated Summary

PDF162April 15, 2025