困惑度陷阱:基于预训练语言模型的检索器高估低困惑度文档
Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents
March 11, 2025
作者: Haoyu Wang, Sunhao Dai, Haiyuan Zhao, Liang Pang, Xiao Zhang, Gang Wang, Zhenhua Dong, Jun Xu, Ji-Rong Wen
cs.AI
摘要
先前的研究发现,基于预训练语言模型(PLM)的检索模型对大型语言模型(LLM)生成的内容表现出偏好,即使这些文档的语义质量与人类撰写的相当,也会赋予其更高的相关性评分。这一现象被称为来源偏差,威胁着信息获取生态系统的可持续发展。然而,来源偏差的根本原因尚未得到深入探讨。本文通过因果图解释了信息检索的过程,发现基于PLM的检索器在学习用于相关性估计的困惑度特征时,倾向于将低困惑度的文档排名更高,从而导致了来源偏差。理论分析进一步揭示,这一现象源于语言建模任务与检索任务中损失函数梯度之间的正相关性。基于此分析,我们提出了一种因果启发的推理时去偏方法,称为因果诊断与校正(CDC)。CDC首先诊断困惑度的偏差效应,随后从整体估计的相关性评分中分离出这一偏差效应。跨三个领域的实验结果展示了CDC卓越的去偏效果,验证了我们所提出的解释框架的有效性。源代码可在https://github.com/WhyDwelledOnAi/Perplexity-Trap获取。
English
Previous studies have found that PLM-based retrieval models exhibit a
preference for LLM-generated content, assigning higher relevance scores to
these documents even when their semantic quality is comparable to human-written
ones. This phenomenon, known as source bias, threatens the sustainable
development of the information access ecosystem. However, the underlying causes
of source bias remain unexplored. In this paper, we explain the process of
information retrieval with a causal graph and discover that PLM-based
retrievers learn perplexity features for relevance estimation, causing source
bias by ranking the documents with low perplexity higher. Theoretical analysis
further reveals that the phenomenon stems from the positive correlation between
the gradients of the loss functions in language modeling task and retrieval
task. Based on the analysis, a causal-inspired inference-time debiasing method
is proposed, called Causal Diagnosis and Correction (CDC). CDC first diagnoses
the bias effect of the perplexity and then separates the bias effect from the
overall estimated relevance score. Experimental results across three domains
demonstrate the superior debiasing effectiveness of CDC, emphasizing the
validity of our proposed explanatory framework. Source codes are available at
https://github.com/WhyDwelledOnAi/Perplexity-Trap.Summary
AI-Generated Summary