ChatPaper.aiChatPaper

MIEB:大規模圖像嵌入基準測試

MIEB: Massive Image Embedding Benchmark

April 14, 2025
作者: Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff
cs.AI

摘要

圖像表示通常透過分散且任務特定的評估協議來進行,這導致對模型能力的理解呈現碎片化。例如,我們無法確定一個擅長圖像聚類的圖像嵌入模型,是否同樣擅長在給定一段文字時檢索相關圖像。我們引入了大規模圖像嵌入基準(Massive Image Embedding Benchmark, MIEB),以評估圖像及圖像-文字嵌入模型在迄今為止最廣泛的範圍內的表現。MIEB涵蓋38種語言,包含130個獨立任務,並將其分為8個高層次類別。我們在該基準上測試了50個模型,發現沒有任何單一方法在所有任務類別中均佔據主導地位。我們揭示了先進視覺模型的隱藏能力,例如它們對文本的精確視覺表示,以及它們在交錯編碼和存在干擾因素時匹配圖像與文字方面的能力尚有限。我們還展示了視覺編碼器在MIEB上的表現與其在多模態大型語言模型中的表現高度相關。我們的程式碼、數據集和排行榜已公開於https://github.com/embeddings-benchmark/mteb。
English
Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.

Summary

AI-Generated Summary

PDF142April 15, 2025