jina-embeddings-v3:具有任務LoRA的多語言嵌入
jina-embeddings-v3: Multilingual Embeddings With Task LoRA
September 16, 2024
作者: Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao
cs.AI
摘要
我們介紹了 jina-embeddings-v3,一個具有 5.7 億參數的新型文本嵌入模型,在多語言數據和長文本檢索任務上實現了最先進的性能,支持長達 8192 個標記的上下文長度。該模型包括一組特定任務的低秩適應(LoRA)適配器,用於生成用於查詢-文檔檢索、聚類、分類和文本匹配的高質量嵌入。此外,Matryoshka 表示學習被整合到訓練過程中,允許靈活截斷嵌入維度而不影響性能。在 MTEB 基準測試中的評估顯示,jina-embeddings-v3 在英語任務上優於來自 OpenAI 和 Cohere 的最新專有嵌入,同時在所有多語言任務中相比 multilingual-e5-large-instruct 實現了更優異的性能。
English
We introduce jina-embeddings-v3, a novel text embedding model with 570
million parameters, achieves state-of-the-art performance on multilingual data
and long-context retrieval tasks, supporting context lengths of up to 8192
tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA)
adapters to generate high-quality embeddings for query-document retrieval,
clustering, classification, and text matching. Additionally, Matryoshka
Representation Learning is integrated into the training process, allowing
flexible truncation of embedding dimensions without compromising performance.
Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the
latest proprietary embeddings from OpenAI and Cohere on English tasks, while
achieving superior performance compared to multilingual-e5-large-instruct
across all multilingual tasks.Summary
AI-Generated Summary