GeAR:生成增强检索
GeAR: Generation Augmented Retrieval
January 6, 2025
作者: Haoyu Liu, Shaohan Huang, Jianfeng Liu, Yuefeng Zhan, Hao Sun, Weiwei Deng, Feng Sun, Furu Wei, Qi Zhang
cs.AI
摘要
文档检索技术构建了大规模信息系统发展的基础。目前的方法是构建一个双编码器并计算语义相似性。然而,这种标量相似度难以反映足够的信息,阻碍了我们对检索结果的理解。此外,这种计算过程主要强调全局语义,忽略了查询与文档中复杂文本之间的细粒度语义关系。在本文中,我们提出了一种名为生成增强检索(GeAR)的新方法,该方法融合了精心设计的融合和解码模块。这使得GeAR能够基于查询和文档的融合表示生成相关文本,从而学会“关注”细粒度信息。此外,作为检索器使用时,GeAR不会增加任何计算负担。为了支持新框架的训练,我们引入了一个流水线,通过利用大型语言模型高效合成高质量数据。GeAR在各种场景和数据集中展现出竞争力的检索和定位性能。此外,GeAR生成的定性分析和结果提供了对检索结果解释的新见解。代码、数据和模型将在完成技术审查后发布,以促进未来研究。
English
Document retrieval techniques form the foundation for the development of
large-scale information systems. The prevailing methodology is to construct a
bi-encoder and compute the semantic similarity. However, such scalar similarity
is difficult to reflect enough information and impedes our comprehension of the
retrieval results. In addition, this computational process mainly emphasizes
the global semantics and ignores the fine-grained semantic relationship between
the query and the complex text in the document. In this paper, we propose a new
method called Generation Augmented Retrieval
(GeAR) that incorporates well-designed fusion and decoding modules.
This enables GeAR to generate the relevant text from documents based on the
fused representation of the query and the document, thus learning to "focus on"
the fine-grained information. Also when used as a retriever, GeAR does not add
any computational burden over bi-encoders. To support the training of the new
framework, we have introduced a pipeline to efficiently synthesize high-quality
data by utilizing large language models. GeAR exhibits competitive retrieval
and localization performance across diverse scenarios and datasets. Moreover,
the qualitative analysis and the results generated by GeAR provide novel
insights into the interpretation of retrieval results. The code, data, and
models will be released after completing technical review to facilitate future
research.Summary
AI-Generated Summary