LUSIFER：利用大型語言模型增強多語言嵌入的語言通用空間整合

摘要

最近在基於大型語言模型（LLMs）的嵌入式模型方面取得了新的突破，為文本嵌入任務建立了新的最先進基準，特別是在基於密集向量的檢索方面。然而，這些模型主要專注於英語，使得多語言嵌入能力大多未被探索。為了解決這一限制，我們提出了LUSIFER，一種新穎的零樣本方法，它適應了基於LLM的嵌入模型，用於多語言任務，而無需多語言監督。LUSIFER的架構結合了一個多語言編碼器，作為一個語言通用學習者，以及一個針對嵌入特定任務進行優化的基於LLM的嵌入模型。這些組件通過一組最少的可訓練參數無縫集成，這些參數作為連接器，有效地將多語言編碼器的語言理解能力轉移到專門的嵌入模型上。此外，為了全面評估多語言嵌入性能，我們引入了一個新的基準，包括5個主要的嵌入任務，123個不同的數據集，並覆蓋了14種語言。大量的實驗結果表明，LUSIFER顯著增強了各種嵌入任務的多語言性能，特別是對於中小資源語言，而無需明確的多語言訓練數據。

English

Recent advancements in large language models (LLMs) based embedding models have established new state-of-the-art benchmarks for text embedding tasks, particularly in dense vector-based retrieval. However, these models predominantly focus on English, leaving multilingual embedding capabilities largely unexplored. To address this limitation, we present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision. LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks. These components are seamlessly integrated through a minimal set of trainable parameters that act as a connector, effectively transferring the multilingual encoder's language understanding capabilities to the specialized embedding model. Additionally, to comprehensively evaluate multilingual embedding performance, we introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages. Extensive experimental results demonstrate that LUSIFER significantly enhances the multilingual performance across various embedding tasks, particularly for medium and low-resource languages, without requiring explicit multilingual training data.

LUSIFER：利用大型語言模型增強多語言嵌入的語言通用空間整合

LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models

摘要

Summary

Support