当大型语言模型对其答案感到不安时——且这种不确定性是合理的

摘要

不确定性估计对于评估大型语言模型（LLMs）至关重要，尤其是在高风险领域，错误答案可能导致严重后果。众多方法在解决这一问题时，往往聚焦于特定类型的不确定性，而忽视了其他类型。我们探讨了哪些估计方法，特别是基于词元熵和模型自评（MASJ），能够适用于不同主题的多项选择题解答任务。我们的实验涵盖了三种不同规模的LLMs：Phi-4、Mistral和Qwen，参数规模从1.5B到72B不等，涉及14个主题。尽管MASJ的表现与随机错误预测器相似，但响应熵在知识依赖型领域中能有效预测模型错误，并作为问题难度的有效指标：在生物学领域，ROC AUC达到0.73。然而，在推理依赖型领域中，这种相关性消失：对于数学问题，ROC-AUC仅为0.55。更根本地，我们发现熵度量需要一定的推理量。因此，与数据不确定性相关的熵应被整合进不确定性估计框架中，而MASJ则需进一步优化。此外，现有的MMLU-Pro样本存在偏差，应平衡不同子领域所需的推理量，以提供更公平的LLMs性能评估。

English

Uncertainty estimation is crucial for evaluating Large Language Models (LLMs), particularly in high-stakes domains where incorrect answers result in significant consequences. Numerous approaches consider this problem, while focusing on a specific type of uncertainty, ignoring others. We investigate what estimates, specifically token-wise entropy and model-as-judge (MASJ), would work for multiple-choice question-answering tasks for different question topics. Our experiments consider three LLMs: Phi-4, Mistral, and Qwen of different sizes from 1.5B to 72B and 14 topics. While MASJ performs similarly to a random error predictor, the response entropy predicts model error in knowledge-dependent domains and serves as an effective indicator of question difficulty: for biology ROC AUC is 0.73. This correlation vanishes for the reasoning-dependent domain: for math questions ROC-AUC is 0.55. More principally, we found out that the entropy measure required a reasoning amount. Thus, data-uncertainty related entropy should be integrated within uncertainty estimates frameworks, while MASJ requires refinement. Moreover, existing MMLU-Pro samples are biased, and should balance required amount of reasoning for different subdomains to provide a more fair assessment of LLMs performance.

当大型语言模型对其答案感到不安时——且这种不确定性是合理的

When an LLM is apprehensive about its answers -- and when its uncertainty is justified

摘要

Summary

Support

Support