ChatPaper.aiChatPaper

DyVo:具有實體的學習稀疏檢索的動態詞彙表

DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities

October 10, 2024
作者: Thong Nguyen, Shubham Chatterjee, Sean MacAvaney, Iain Mackie, Jeff Dalton, Andrew Yates
cs.AI

摘要

學習稀疏檢索(LSR)模型使用來自預先訓練的轉換器的詞彙,這些詞彙通常將實體分割為毫無意義的片段。分割實體可能會降低檢索準確性,並限制模型融入未包含在訓練數據中的最新世界知識的能力。在這項工作中,我們通過維基百科的概念和實體增強了LSR詞彙,使模型能夠更有效地解決歧義並與不斷發展的知識保持同步。我們方法的核心是一個動態詞彙(DyVo)頭,它利用現有的實體嵌入和一個識別與查詢或文檔相關的實體的實體檢索組件。我們使用DyVo頭生成實體權重,然後將其與詞片權重合併,以創建聯合表示,以便使用倒排索引進行高效索引和檢索。在三個富含實體的文檔排名數據集上進行的實驗中,結果顯示DyVo模型明顯優於最先進的基準模型。
English
Learned Sparse Retrieval (LSR) models use vocabularies from pre-trained transformers, which often split entities into nonsensical fragments. Splitting entities can reduce retrieval accuracy and limits the model's ability to incorporate up-to-date world knowledge not included in the training data. In this work, we enhance the LSR vocabulary with Wikipedia concepts and entities, enabling the model to resolve ambiguities more effectively and stay current with evolving knowledge. Central to our approach is a Dynamic Vocabulary (DyVo) head, which leverages existing entity embeddings and an entity retrieval component that identifies entities relevant to a query or document. We use the DyVo head to generate entity weights, which are then merged with word piece weights to create joint representations for efficient indexing and retrieval using an inverted index. In experiments across three entity-rich document ranking datasets, the resulting DyVo model substantially outperforms state-of-the-art baselines.

Summary

AI-Generated Summary

PDF132November 16, 2024