GeAR：生成增強檢索

摘要

文件檢索技術是發展大規模資訊系統的基礎。目前主流的方法是構建雙編碼器並計算語義相似度。然而，這種標量相似度難以反映足夠的信息，並阻礙我們對檢索結果的理解。此外，這種計算過程主要強調全局語義，忽略了查詢與文檔中複雜文本之間的細粒度語義關係。本文提出了一種名為生成增強檢索（GeAR）的新方法，該方法融合了精心設計的融合和解碼模塊。這使得GeAR能夠基於查詢和文檔的融合表示生成相關文本，從而學習“聚焦”於細粒度信息。此外，作為檢索器使用時，GeAR不會給雙編碼器增加任何計算負擔。為了支持新框架的訓練，我們引入了一個流程，通過利用大型語言模型高效合成高質量數據。GeAR在各種場景和數據集中展現出競爭力的檢索和定位性能。此外，通過GeAR生成的定性分析和結果提供了對檢索結果解釋的新見解。代碼、數據和模型將在完成技術審查後發布，以促進未來研究。

English

Document retrieval techniques form the foundation for the development of large-scale information systems. The prevailing methodology is to construct a bi-encoder and compute the semantic similarity. However, such scalar similarity is difficult to reflect enough information and impedes our comprehension of the retrieval results. In addition, this computational process mainly emphasizes the global semantics and ignores the fine-grained semantic relationship between the query and the complex text in the document. In this paper, we propose a new method called Generation Augmented Retrieval (GeAR) that incorporates well-designed fusion and decoding modules. This enables GeAR to generate the relevant text from documents based on the fused representation of the query and the document, thus learning to "focus on" the fine-grained information. Also when used as a retriever, GeAR does not add any computational burden over bi-encoders. To support the training of the new framework, we have introduced a pipeline to efficiently synthesize high-quality data by utilizing large language models. GeAR exhibits competitive retrieval and localization performance across diverse scenarios and datasets. Moreover, the qualitative analysis and the results generated by GeAR provide novel insights into the interpretation of retrieval results. The code, data, and models will be released after completing technical review to facilitate future research.

GeAR：生成增強檢索

GeAR: Generation Augmented Retrieval

摘要

Summary

Support