密集检索器的崩溃:短小、早期与字面偏差凌驾于事实证据之上
Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence
March 6, 2025
作者: Mohsen Fayyaz, Ali Modarressi, Hinrich Schuetze, Nanyun Peng
cs.AI
摘要
密集检索模型在信息检索(IR)应用中广泛使用,例如检索增强生成(RAG)。由于它们通常作为这些系统的第一步,其鲁棒性对于避免故障至关重要。在本研究中,我们通过重新利用关系抽取数据集(如Re-DocRED),设计了控制实验,以量化Dragon+和Contriever等检索器中启发式偏差(如偏好较短文档)的影响。我们的发现揭示了显著的脆弱性:检索器往往依赖表面模式,如过度优先考虑文档开头、较短文档、重复实体和字面匹配。此外,它们倾向于忽略文档是否包含查询答案,缺乏深层次的语义理解。值得注意的是,当多种偏差结合时,模型表现出灾难性的性能下降,在不到3%的情况下选择包含答案的文档,而非偏向于不包含答案的偏差文档。此外,我们展示了这些偏差对下游应用(如RAG)有直接影响,其中检索偏好的文档可能误导大型语言模型(LLMs),导致性能下降34%,甚至比不提供任何文档更差。
English
Dense retrieval models are commonly used in Information Retrieval (IR)
applications, such as Retrieval-Augmented Generation (RAG). Since they often
serve as the first step in these systems, their robustness is critical to avoid
failures. In this work, by repurposing a relation extraction dataset (e.g.
Re-DocRED), we design controlled experiments to quantify the impact of
heuristic biases, such as favoring shorter documents, in retrievers like
Dragon+ and Contriever. Our findings reveal significant vulnerabilities:
retrievers often rely on superficial patterns like over-prioritizing document
beginnings, shorter documents, repeated entities, and literal matches.
Additionally, they tend to overlook whether the document contains the query's
answer, lacking deep semantic understanding. Notably, when multiple biases
combine, models exhibit catastrophic performance degradation, selecting the
answer-containing document in less than 3% of cases over a biased document
without the answer. Furthermore, we show that these biases have direct
consequences for downstream applications like RAG, where retrieval-preferred
documents can mislead LLMs, resulting in a 34% performance drop than not
providing any documents at all.Summary
AI-Generated Summary