製作文本嵌入器的少樣本學習器
Making Text Embedders Few-Shot Learners
September 24, 2024
作者: Chaofan Li, MingHao Qin, Shitao Xiao, Jianlyu Chen, Kun Luo, Yingxia Shao, Defu Lian, Zheng Liu
cs.AI
摘要
具有僅解碼器結構的大型語言模型(LLMs)展現出卓越的上下文學習(ICL)能力。這項功能使它們能夠有效處理熟悉和新奇任務,並利用輸入上下文中提供的示例。為了充分利用這種能力,我們提議利用LLMs中的ICL功能來增強文本嵌入生成過程。為此,我們引入了一個新型模型bge-en-icl,該模型利用少量示例來生成高質量的文本嵌入。我們的方法將與任務相關的示例直接整合到查詢端,從而在各種任務中實現顯著改進。此外,我們研究了如何有效利用LLMs作為嵌入模型,包括各種注意機制、池化方法等。我們的研究結果表明,保留原始框架通常會產生最佳結果,強調簡單即是最好的。在MTEB和AIR-Bench基準上的實驗結果表明,我們的方法確立了新的最先進性能。我們的模型、代碼和數據集可在https://github.com/FlagOpen/FlagEmbedding 免費獲得。
English
Large language models (LLMs) with decoder-only architectures demonstrate
remarkable in-context learning (ICL) capabilities. This feature enables them to
effectively handle both familiar and novel tasks by utilizing examples provided
within their input context. Recognizing the potential of this capability, we
propose leveraging the ICL feature in LLMs to enhance the process of text
embedding generation. To this end, we introduce a novel model bge-en-icl, which
employs few-shot examples to produce high-quality text embeddings. Our approach
integrates task-related examples directly into the query side, resulting in
significant improvements across various tasks. Additionally, we have
investigated how to effectively utilize LLMs as embedding models, including
various attention mechanisms, pooling methods, etc. Our findings suggest that
retaining the original framework often yields the best results, underscoring
that simplicity is best. Experimental results on the MTEB and AIR-Bench
benchmarks demonstrate that our approach sets new state-of-the-art (SOTA)
performance. Our model, code and dataset are freely available at
https://github.com/FlagOpen/FlagEmbedding .Summary
AI-Generated Summary