MMTEB:大规模多语言文本嵌入基准测试
MMTEB: Massive Multilingual Text Embedding Benchmark
February 19, 2025
作者: Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Gabriel Sequeira, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Çağatan, Akash Kundu, Martin Bernstorff, Shitao Xiao, Akshita Sukhlecha, Bhavish Pahwa, Rafał Poświata, Kranthi Kiran GV, Shawon Ashraf, Daniel Auras, Björn Plüster, Jan Philipp Harries, Loïc Magne, Isabelle Mohr, Mariya Hendriksen, Dawei Zhu, Hippolyte Gisserot-Boukhlef, Tom Aarsen, Jan Kostkan, Konrad Wojtasik, Taemin Lee, Marek Šuppa, Crystina Zhang, Roberta Rocca, Mohammed Hamdy, Andrianos Michail, John Yang, Manuel Faysse, Aleksei Vatolin, Nandan Thakur, Manan Dey, Dipam Vasani, Pranjal Chitale, Simone Tedeschi, Nguyen Tai, Artem Snegirev, Michael Günther, Mengzhou Xia, Weijia Shi, Xing Han Lù, Jordan Clive, Gayatri Krishnakumar, Anna Maksimova, Silvan Wehrli, Maria Tikhonova, Henil Panchal, Aleksandr Abramov, Malte Ostendorff, Zheng Liu, Simon Clematide, Lester James Miranda, Alena Fenogenova, Guangyu Song, Ruqiya Bin Safi, Wen-Ding Li, Alessia Borghini, Federico Cassano, Hongjin Su, Jimmy Lin, Howard Yen, Lasse Hansen, Sara Hooker, Chenghao Xiao, Vaibhav Adlakha, Orion Weller, Siva Reddy, Niklas Muennighoff
cs.AI
摘要
文本嵌入模型通常仅在有限的任务集上进行评估,这些任务受限于语言、领域和任务的多样性。为了突破这些限制并提供更全面的评估,我们引入了大规模多语言文本嵌入基准(MMTEB)——这是对MTEB的一次大规模、社区驱动的扩展,涵盖了超过500项经过质量控制的评估任务,涉及250多种语言。MMTEB包含了一系列多样且具有挑战性的新任务,如指令跟随、长文档检索和代码检索,代表了迄今为止最大的多语言嵌入模型评估任务集合。利用这一集合,我们开发了多个高度多语言的基准,并用于评估一组代表性模型。我们发现,尽管拥有数十亿参数的大型语言模型(LLMs)能在某些语言子集和任务类别上达到最先进的性能,但表现最佳的公开可用模型是仅含5.6亿参数的多语言e5-large-instruct。为了提升可访问性并降低计算成本,我们引入了一种基于任务间相关性的新颖下采样方法,确保在保持模型相对排名多样性的同时进行选择。此外,我们通过采样困难负例来优化诸如检索等任务,创建了更小但有效的子集。这些优化使我们能够引入大幅降低计算需求的基准。例如,我们新推出的零样本英语基准在保持与完整版本相似排名顺序的同时,仅需极少的计算成本。
English
Text embeddings are typically evaluated on a limited set of tasks, which are
constrained by language, domain, and task diversity. To address these
limitations and provide a more comprehensive evaluation, we introduce the
Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale,
community-driven expansion of MTEB, covering over 500 quality-controlled
evaluation tasks across 250+ languages. MMTEB includes a diverse set of
challenging, novel tasks such as instruction following, long-document
retrieval, and code retrieval, representing the largest multilingual collection
of evaluation tasks for embedding models to date. Using this collection, we
develop several highly multilingual benchmarks, which we use to evaluate a
representative set of models. We find that while large language models (LLMs)
with billions of parameters can achieve state-of-the-art performance on certain
language subsets and task categories, the best-performing publicly available
model is multilingual-e5-large-instruct with only 560 million parameters. To
facilitate accessibility and reduce computational cost, we introduce a novel
downsampling method based on inter-task correlation, ensuring a diverse
selection while preserving relative model rankings. Furthermore, we optimize
tasks such as retrieval by sampling hard negatives, creating smaller but
effective splits. These optimizations allow us to introduce benchmarks that
drastically reduce computational demands. For instance, our newly introduced
zero-shot English benchmark maintains a ranking order similar to the full-scale
version but at a fraction of the computational cost.Summary
AI-Generated Summary