LUSIFER:利用大型語言模型增強多語言嵌入的語言通用空間整合
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models
January 1, 2025
作者: Hieu Man, Nghia Trung Ngo, Viet Dac Lai, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen
cs.AI
摘要
最近在基於大型語言模型(LLMs)的嵌入式模型方面取得了新的突破,為文本嵌入任務建立了新的最先進基準,特別是在基於密集向量的檢索方面。然而,這些模型主要專注於英語,使得多語言嵌入能力大多未被探索。為了解決這一限制,我們提出了LUSIFER,一種新穎的零樣本方法,它適應了基於LLM的嵌入模型,用於多語言任務,而無需多語言監督。LUSIFER的架構結合了一個多語言編碼器,作為一個語言通用學習者,以及一個針對嵌入特定任務進行優化的基於LLM的嵌入模型。這些組件通過一組最少的可訓練參數無縫集成,這些參數作為連接器,有效地將多語言編碼器的語言理解能力轉移到專門的嵌入模型上。此外,為了全面評估多語言嵌入性能,我們引入了一個新的基準,包括5個主要的嵌入任務,123個不同的數據集,並覆蓋了14種語言。大量的實驗結果表明,LUSIFER顯著增強了各種嵌入任務的多語言性能,特別是對於中小資源語言,而無需明確的多語言訓練數據。
English
Recent advancements in large language models (LLMs) based embedding models
have established new state-of-the-art benchmarks for text embedding tasks,
particularly in dense vector-based retrieval. However, these models
predominantly focus on English, leaving multilingual embedding capabilities
largely unexplored. To address this limitation, we present LUSIFER, a novel
zero-shot approach that adapts LLM-based embedding models for multilingual
tasks without requiring multilingual supervision. LUSIFER's architecture
combines a multilingual encoder, serving as a language-universal learner, with
an LLM-based embedding model optimized for embedding-specific tasks. These
components are seamlessly integrated through a minimal set of trainable
parameters that act as a connector, effectively transferring the multilingual
encoder's language understanding capabilities to the specialized embedding
model. Additionally, to comprehensively evaluate multilingual embedding
performance, we introduce a new benchmark encompassing 5 primary embedding
tasks, 123 diverse datasets, and coverage across 14 languages. Extensive
experimental results demonstrate that LUSIFER significantly enhances the
multilingual performance across various embedding tasks, particularly for
medium and low-resource languages, without requiring explicit multilingual
training data.Summary
AI-Generated Summary