重审语言模型中的不确定性量化评估: 与响应长度偏差结果的虚假交互
Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results
April 18, 2025
作者: Andrea Santilli, Adam Golinski, Michael Kirchhof, Federico Danieli, Arno Blaas, Miao Xiong, Luca Zappella, Sinead Williamson
cs.AI
摘要
在语言模型(LMs)中进行不确定性量化(UQ)对于提升其安全性和可靠性至关重要。评估通常采用如AUROC等性能指标来衡量UQ方法(例如,负序列概率)与任务正确性函数(如ROUGE-L)之间的相关性。本文指出,常用的正确性函数通过夸大某些UQ方法的性能,导致UQ评估存在偏差。我们评估了7种正确性函数——从基于词汇和嵌入的指标到LLM作为评判者的方法——覆盖了4个数据集×4个模型×6种UQ方法。分析表明,这些正确性函数在错误中的长度偏差与UQ方法中的长度偏差相互作用,扭曲了UQ评估。我们发现,LLM作为评判者的方法在长度偏差方面表现最为中立,因此是缓解这些偏差的潜在解决方案。
English
Uncertainty Quantification (UQ) in Language Models (LMs) is crucial for
improving their safety and reliability. Evaluations often use performance
metrics like AUROC to assess how well UQ methods (e.g., negative sequence
probabilities) correlate with task correctness functions (e.g., ROUGE-L). In
this paper, we show that commonly used correctness functions bias UQ
evaluations by inflating the performance of certain UQ methods. We evaluate 7
correctness functions -- from lexical-based and embedding-based metrics to
LLM-as-a-judge approaches -- across 4 datasets x 4 models x 6 UQ methods. Our
analysis reveals that length biases in the errors of these correctness
functions distort UQ assessments by interacting with length biases in UQ
methods. We identify LLM-as-a-judge approaches as among the least length-biased
choices and hence a potential solution to mitigate these biases.Summary
AI-Generated Summary