ChatPaper.aiChatPaper

jina-embeddings-v3:具有任務LoRA的多語言嵌入

jina-embeddings-v3: Multilingual Embeddings With Task LoRA

September 16, 2024
作者: Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, Han Xiao
cs.AI

摘要

我們介紹了 jina-embeddings-v3,一個具有 5.7 億參數的新型文本嵌入模型,在多語言數據和長文本檢索任務上實現了最先進的性能,支持長達 8192 個標記的上下文長度。該模型包括一組特定任務的低秩適應(LoRA)適配器,用於生成用於查詢-文檔檢索、聚類、分類和文本匹配的高質量嵌入。此外,Matryoshka 表示學習被整合到訓練過程中,允許靈活截斷嵌入維度而不影響性能。在 MTEB 基準測試中的評估顯示,jina-embeddings-v3 在英語任務上優於來自 OpenAI 和 Cohere 的最新專有嵌入,同時在所有多語言任務中相比 multilingual-e5-large-instruct 實現了更優異的性能。
English
We introduce jina-embeddings-v3, a novel text embedding model with 570 million parameters, achieves state-of-the-art performance on multilingual data and long-context retrieval tasks, supporting context lengths of up to 8192 tokens. The model includes a set of task-specific Low-Rank Adaptation (LoRA) adapters to generate high-quality embeddings for query-document retrieval, clustering, classification, and text matching. Additionally, Matryoshka Representation Learning is integrated into the training process, allowing flexible truncation of embedding dimensions without compromising performance. Evaluation on the MTEB benchmark shows that jina-embeddings-v3 outperforms the latest proprietary embeddings from OpenAI and Cohere on English tasks, while achieving superior performance compared to multilingual-e5-large-instruct across all multilingual tasks.

Summary

AI-Generated Summary

PDF326November 16, 2024