语言模型更倾向于所了解的内容：通过置信偏好进行相对置信度估计

摘要

语言模型（LMs）应提供可靠的置信度估计，以帮助用户检测其输出中的错误，并在必要时请教人类专家。要求语言模型评估其置信度（“请给自己的置信度打分，范围从0到1。”）是评估其不确定性的一种自然方式。然而，模型往往难以提供绝对的置信度评估（即独立于其他问题评估回答问题时的置信度），而它们产生的粗粒度分数对于评估其答案的正确性并不实用。我们提出了相对置信度估计，其中我们将问题相互比较，并要求模型对置信度进行相对判断（“你在回答哪个问题时更有信心是正确的？”）。将每个问题视为一系列与其他问题对抗的“选手”，并将模型的偏好视为比赛结果，我们可以使用类似Elo评分和Bradley-Terry的排名聚合方法，将模型的置信度偏好转化为置信度分数。我们在五款最先进的LMs（GPT-4、GPT-4o、Gemini 1.5 Pro、Claude 3.5 Sonnet和Llama 3.1 405B）上评估了相对置信度估计与绝对置信度估计以及自一致性置信度方法，在14个具有挑战性的STEM、社会科学和常识推理问答任务上。我们的结果表明，相对置信度估计始终比绝对置信度估计提供更可靠的置信度分数，在所有模型和数据集上，选择性分类AUC的平均增益超过直接绝对置信度估计方法的3.5%，超过自一致性方法的1.7%。

English

Language models (LMs) should provide reliable confidence estimates to help users detect mistakes in their outputs and defer to human experts when necessary. Asking a language model to assess its confidence ("Score your confidence from 0-1.") is a natural way of evaluating its uncertainty. However, models struggle to provide absolute assessments of confidence (i.e. judging confidence in answering a question independent of other questions) and the coarse-grained scores they produce are not useful for evaluating the correctness of their answers. We propose relative confidence estimation, where we match up questions against each other and ask the model to make relative judgments of confidence ("Which question are you more confident in answering correctly?"). Treating each question as a "player" in a series of matchups against other questions and the model's preferences as match outcomes, we can use rank aggregation methods like Elo rating and Bradley-Terry to translate the model's confidence preferences into confidence scores. We evaluate relative confidence estimation against absolute confidence estimation and self-consistency confidence methods on five state-of-the-art LMs -- GPT-4, GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet, and Llama 3.1 405B -- across 14 challenging STEM, social science, and commonsense reasoning question answering tasks. Our results demonstrate that relative confidence estimation consistently provides more reliable confidence scores than absolute confidence estimation, with average gains of 3.5% in selective classification AUC over direct absolute confidence estimation methods and 1.7% over self-consistency approaches across all models and datasets.

语言模型更倾向于所了解的内容：通过置信偏好进行相对置信度估计

Language Models Prefer What They Know: Relative Confidence Estimation via Confidence Preferences

摘要

Summary

Support