객관식 문제: 추론은 대형 언어 모델(LLMs)이 틀렸을 때에도 더 자신감 있게 만든다.

초록

LLM(Large Language Model)를 평가하는 가장 널리 사용되는 방법 중 하나는 객관식 문제(Multiple Choice Question, MCQ) 테스트입니다. MCQ 벤치마크는 결과를 자동으로 처리할 수 있기 때문에 LLM 지식을 거의 모든 주제에서 대규모로 테스트할 수 있습니다. LLM이 답변하는 데 도움이 되도록 몇 가지 예시인 퓨 샷(few shots)를 프롬프트에 포함시킬 수 있습니다. 게다가 LLM은 직접 선택한 옵션으로 답변하거나 먼저 추론을 제시한 후 선택한 답변을 제공하도록 요청받을 수 있으며, 이를 사고 과정(chain of thought)이라고 합니다. 선택한 답변이 올바른지 확인하는 것 외에도, 평가는 LLM이 응답에 대한 자신감의 지표로서 응답의 LLM 추정 확률을 살펴볼 수 있습니다. 본 논문에서는 모델이 직접 답변하도록 요청되었는지 아니면 답변하기 전에 추론을 제공하도록 요청되었는지에 따라 LLM이 답변에 대한 자신감이 어떻게 달라지는지 연구합니다. 일곱 가지 다양한 모델에서 다양한 주제에 대한 질문을 평가한 결과, LLM은 답변하기 전에 추론을 제공할 때 더 자신감을 갖는 것으로 나타났습니다. 이는 선택한 답변이 올바른지 여부와 관계없이 발생합니다. 우리의 가설은 이러한 행동이 선택한 답변의 확률을 수정하는 추론 때문에 발생한다는 것이며, LLM은 입력 질문과 선택한 답변을 지원하는 추론을 기반으로 답변을 예측합니다. 따라서 LLM 추정 확률은 평가 절차에서 사용하기 위해 이해해야 할 본질적인 한계가 있는 것으로 보입니다. 흥미로운 점은 사람들에게도 동일한 행동이 관찰되었는데, 답변을 설명함으로써 정확성에 대한 자신감이 증가합니다.

English

One of the most widely used methods to evaluate LLMs are Multiple Choice Question (MCQ) tests. MCQ benchmarks enable the testing of LLM knowledge on almost any topic at scale as the results can be processed automatically. To help the LLM answer, a few examples called few shots can be included in the prompt. Moreover, the LLM can be asked to answer the question directly with the selected option or to first provide the reasoning and then the selected answer, which is known as chain of thought. In addition to checking whether the selected answer is correct, the evaluation can look at the LLM-estimated probability of its response as an indication of the confidence of the LLM in the response. In this paper, we study how the LLM confidence in its answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. The results of the evaluation of questions on a wide range of topics in seven different models show that LLMs are more confident in their answers when they provide reasoning before the answer. This occurs regardless of whether the selected answer is correct. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer, as the LLM predicts the answer based on the input question and the reasoning that supports the selection made. Therefore, LLM estimated probabilities seem to have intrinsic limitations that should be understood in order to use them in evaluation procedures. Interestingly, the same behavior has been observed in humans, for whom explaining an answer increases confidence in its correctness.

객관식 문제: 추론은 대형 언어 모델(LLMs)이 틀렸을 때에도 더 자신감 있게 만든다.

Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong

초록

Support