ChatPaper.aiChatPaper

多项选择题:推理使大型语言模型(LLMs)即使错误时也更加自信

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

January 16, 2025
作者: Tairan Fu, Javier Conde, Gonzalo Martínez, María Grandury, Pedro Reviriego
cs.AI

摘要

评估LLM的最常用方法之一是多项选择题(MCQ)测试。MCQ基准允许在几乎任何规模的主题上对LLM知识进行测试,因为结果可以被自动处理。为了帮助LLM回答,可以在提示中包含称为少量样本的几个示例。此外,LLM可以被要求直接选择选项回答问题,或者先提供推理然后再选择答案,这被称为思维链。除了检查所选答案是否正确外,评估还可以查看LLM对其回答的估计概率,作为LLM对回答的信心的指示。在本文中,我们研究了LLM对其答案的信心如何取决于模型是被要求直接回答还是在回答之前提供推理。对七种不同模型中各种主题的问题进行评估的结果表明,当LLM在回答之前提供推理时,它们对自己的答案更有信心。这种情况发生在所选答案是否正确的情况下。我们的假设是,这种行为是由于推理改变了所选答案的概率,因为LLM根据输入问题和支持所做选择的推理来预测答案。因此,LLM估计的概率似乎具有固有的局限性,应该理解这些局限性以便在评估程序中使用它们。有趣的是,在人类中也观察到了相同的行为,即解释答案会增加对其正确性的信心。
English
One of the most widely used methods to evaluate LLMs are Multiple Choice Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on almost any topic at scale as the results can be processed automatically. To help the LLM answer, a few examples called few shots can be included in the prompt. Moreover, the LLM can be asked to answer the question directly with the selected option or to first provide the reasoning and then the selected answer, which is known as chain of thought. In addition to checking whether the selected answer is correct, the evaluation can look at the LLM-estimated probability of its response as an indication of the confidence of the LLM in the response. In this paper, we study how the LLM confidence in its answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. The results of the evaluation of questions on a wide range of topics in seven different models show that LLMs are more confident in their answers when they provide reasoning before the answer. This occurs regardless of whether the selected answer is correct. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer, as the LLM predicts the answer based on the input question and the reasoning that supports the selection made. Therefore, LLM estimated probabilities seem to have intrinsic limitations that should be understood in order to use them in evaluation procedures. Interestingly, the same behavior has been observed in humans, for whom explaining an answer increases confidence in its correctness.

Summary

AI-Generated Summary

PDF342January 20, 2025