密集检索器的崩溃：短小、早期与字面偏差凌驾于事实证据之上

摘要

密集检索模型在信息检索（IR）应用中广泛使用，例如检索增强生成（RAG）。由于它们通常作为这些系统的第一步，其鲁棒性对于避免故障至关重要。在本研究中，我们通过重新利用关系抽取数据集（如Re-DocRED），设计了控制实验，以量化Dragon+和Contriever等检索器中启发式偏差（如偏好较短文档）的影响。我们的发现揭示了显著的脆弱性：检索器往往依赖表面模式，如过度优先考虑文档开头、较短文档、重复实体和字面匹配。此外，它们倾向于忽略文档是否包含查询答案，缺乏深层次的语义理解。值得注意的是，当多种偏差结合时，模型表现出灾难性的性能下降，在不到3%的情况下选择包含答案的文档，而非偏向于不包含答案的偏差文档。此外，我们展示了这些偏差对下游应用（如RAG）有直接影响，其中检索偏好的文档可能误导大型语言模型（LLMs），导致性能下降34%，甚至比不提供任何文档更差。

English

Dense retrieval models are commonly used in Information Retrieval (IR) applications, such as Retrieval-Augmented Generation (RAG). Since they often serve as the first step in these systems, their robustness is critical to avoid failures. In this work, by repurposing a relation extraction dataset (e.g. Re-DocRED), we design controlled experiments to quantify the impact of heuristic biases, such as favoring shorter documents, in retrievers like Dragon+ and Contriever. Our findings reveal significant vulnerabilities: retrievers often rely on superficial patterns like over-prioritizing document beginnings, shorter documents, repeated entities, and literal matches. Additionally, they tend to overlook whether the document contains the query's answer, lacking deep semantic understanding. Notably, when multiple biases combine, models exhibit catastrophic performance degradation, selecting the answer-containing document in less than 3% of cases over a biased document without the answer. Furthermore, we show that these biases have direct consequences for downstream applications like RAG, where retrieval-preferred documents can mislead LLMs, resulting in a 34% performance drop than not providing any documents at all.

密集检索器的崩溃：短小、早期与字面偏差凌驾于事实证据之上

Collapse of Dense Retrievers: Short, Early, and Literal Biases Outranking Factual Evidence

摘要

Summary

Support