LUSIFER:利用大型语言模型增强多语言嵌入的语言通用空间集成
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models
January 1, 2025
作者: Hieu Man, Nghia Trung Ngo, Viet Dac Lai, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen
cs.AI
摘要
最近基于大型语言模型(LLMs)的嵌入模型取得了新的技术突破,在文本嵌入任务中建立了新的技术基准,特别是在基于密集向量的检索方面。然而,这些模型主要集中在英语上,使得多语言嵌入能力大部分尚未被探索。为了解决这一局限性,我们提出了LUSIFER,这是一种新颖的零样本方法,可以将基于LLM的嵌入模型适应多语言任务,而无需多语言监督。LUSIFER的架构结合了一个多语言编码器,作为一种语言通用学习器,以及一个针对嵌入特定任务进行优化的基于LLM的嵌入模型。这些组件通过一组最小的可训练参数进行无缝集成,这些参数充当连接器,有效地将多语言编码器的语言理解能力转移到专门的嵌入模型中。此外,为了全面评估多语言嵌入性能,我们引入了一个新的基准,涵盖了5个主要的嵌入任务,123个不同的数据集,并覆盖了14种语言。广泛的实验结果表明,LUSIFER显著提升了各种嵌入任务的多语言性能,特别是对于中小资源语言,而无需显式的多语言训练数据。
English
Recent advancements in large language models (LLMs) based embedding models
have established new state-of-the-art benchmarks for text embedding tasks,
particularly in dense vector-based retrieval. However, these models
predominantly focus on English, leaving multilingual embedding capabilities
largely unexplored. To address this limitation, we present LUSIFER, a novel
zero-shot approach that adapts LLM-based embedding models for multilingual
tasks without requiring multilingual supervision. LUSIFER's architecture
combines a multilingual encoder, serving as a language-universal learner, with
an LLM-based embedding model optimized for embedding-specific tasks. These
components are seamlessly integrated through a minimal set of trainable
parameters that act as a connector, effectively transferring the multilingual
encoder's language understanding capabilities to the specialized embedding
model. Additionally, to comprehensively evaluate multilingual embedding
performance, we introduce a new benchmark encompassing 5 primary embedding
tasks, 123 diverse datasets, and coverage across 14 languages. Extensive
experimental results demonstrate that LUSIFER significantly enhances the
multilingual performance across various embedding tasks, particularly for
medium and low-resource languages, without requiring explicit multilingual
training data.Summary
AI-Generated Summary