当大型语言模型对其答案感到不安时——且这种不确定性是合理的
When an LLM is apprehensive about its answers -- and when its uncertainty is justified
March 3, 2025
作者: Petr Sychev, Andrey Goncharov, Daniil Vyazhev, Edvard Khalafyan, Alexey Zaytsev
cs.AI
摘要
不确定性估计对于评估大型语言模型(LLMs)至关重要,尤其是在高风险领域,错误答案可能导致严重后果。众多方法在解决这一问题时,往往聚焦于特定类型的不确定性,而忽视了其他类型。我们探讨了哪些估计方法,特别是基于词元熵和模型自评(MASJ),能够适用于不同主题的多项选择题解答任务。我们的实验涵盖了三种不同规模的LLMs:Phi-4、Mistral和Qwen,参数规模从1.5B到72B不等,涉及14个主题。尽管MASJ的表现与随机错误预测器相似,但响应熵在知识依赖型领域中能有效预测模型错误,并作为问题难度的有效指标:在生物学领域,ROC AUC达到0.73。然而,在推理依赖型领域中,这种相关性消失:对于数学问题,ROC-AUC仅为0.55。更根本地,我们发现熵度量需要一定的推理量。因此,与数据不确定性相关的熵应被整合进不确定性估计框架中,而MASJ则需进一步优化。此外,现有的MMLU-Pro样本存在偏差,应平衡不同子领域所需的推理量,以提供更公平的LLMs性能评估。
English
Uncertainty estimation is crucial for evaluating Large Language Models
(LLMs), particularly in high-stakes domains where incorrect answers result in
significant consequences. Numerous approaches consider this problem, while
focusing on a specific type of uncertainty, ignoring others. We investigate
what estimates, specifically token-wise entropy and model-as-judge (MASJ),
would work for multiple-choice question-answering tasks for different question
topics. Our experiments consider three LLMs: Phi-4, Mistral, and Qwen of
different sizes from 1.5B to 72B and 14 topics. While MASJ performs similarly
to a random error predictor, the response entropy predicts model error in
knowledge-dependent domains and serves as an effective indicator of question
difficulty: for biology ROC AUC is 0.73. This correlation vanishes for the
reasoning-dependent domain: for math questions ROC-AUC is 0.55. More
principally, we found out that the entropy measure required a reasoning amount.
Thus, data-uncertainty related entropy should be integrated within uncertainty
estimates frameworks, while MASJ requires refinement. Moreover, existing
MMLU-Pro samples are biased, and should balance required amount of reasoning
for different subdomains to provide a more fair assessment of LLMs performance.Summary
AI-Generated Summary