上下文文件嵌入
Contextual Document Embeddings
October 3, 2024
作者: John X. Morris, Alexander M. Rush
cs.AI
摘要
密集文件嵌入是神經檢索的核心。主導範式是通過直接在單個文件上運行編碼器來訓練和構建嵌入。在這項工作中,我們認為這些嵌入雖然有效,但對於檢索的目標用例來說,隱含地脫離了上下文,因此建議一個具有上下文的文件嵌入應該考慮文檔本身和上下文中的相鄰文檔 - 類似於具有上下文的詞嵌入。我們提出了兩種互補的方法來獲得具有上下文的文件嵌入:首先,是一種替代的對比學習目標,明確地將文檔鄰居納入批內上下文損失中;其次,是一種新的上下文架構,明確地將鄰居文檔信息編碼到編碼表示中。結果表明,這兩種方法在幾個設置中均實現了比雙編碼器更好的性能,特別是在域外時表現更為顯著。我們在MTEB基準測試中取得了最新成果,而無需進行困難的負採樣、分數蒸餾、特定於數據集的指令、GPU內部範例共享或極大的批量大小。我們的方法可應用於提高任何對比學習數據集和任何雙編碼器的性能。
English
Dense document embeddings are central to neural retrieval. The dominant
paradigm is to train and construct embeddings by running encoders directly on
individual documents. In this work, we argue that these embeddings, while
effective, are implicitly out-of-context for targeted use cases of retrieval,
and that a contextualized document embedding should take into account both the
document and neighboring documents in context - analogous to contextualized
word embeddings. We propose two complementary methods for contextualized
document embeddings: first, an alternative contrastive learning objective that
explicitly incorporates the document neighbors into the intra-batch contextual
loss; second, a new contextual architecture that explicitly encodes neighbor
document information into the encoded representation. Results show that both
methods achieve better performance than biencoders in several settings, with
differences especially pronounced out-of-domain. We achieve state-of-the-art
results on the MTEB benchmark with no hard negative mining, score distillation,
dataset-specific instructions, intra-GPU example-sharing, or extremely large
batch sizes. Our method can be applied to improve performance on any
contrastive learning dataset and any biencoder.Summary
AI-Generated Summary