다음 중 대형 언어 모델(LLM)을 이용한 객관식 평가를 가장 잘 설명하는 것은 무엇인가? A) 강제적 B) 결함이 있는 C) 수정 가능한 D) 위의 모든 것

초록

다중 선택형 질문 응답(MCQA)은 그 간결성과 인간과 유사한 테스트 방식으로 인해 대형 언어 모델(LLM) 평가에 널리 사용되지만, 우리는 이를 개선할 필요가 있다고 주장한다. 먼저, MCQA 형식의 결점을 밝히는데, 이는 1) 생성 능력과 주관성을 테스트하기 어렵고, 2) LLM 사용 사례와 부합하지 않으며, 3) 지식을 완전히 테스트하지 못한다는 점이다. 대신, 우리는 인간 테스트를 기반으로 한 생성형 형식을 지지한다. 이 형식에서는 LLM이 답변을 구성하고 설명함으로써 사용자 요구와 지식을 더 잘 포착하면서도 채점이 용이하다. 또한, MCQA가 유용한 형식일지라도, 그 데이터셋은 정보 누출, 답변 불가능성, 지름길, 포화 상태 등의 문제를 겪고 있다. 각 문제에 대해, 우리는 교육학에서의 해결책을 제시한다. 예를 들어, MCQ 작성을 안내하는 채점 기준, 추측을 억제하는 채점 방법, 더 어려운 MCQ를 구성하기 위한 문항 반응 이론 등이 있다. 마지막으로, 우리는 MCQA에서의 LLM 오류—강건성, 편향, 신뢰할 수 없는 설명—를 논의하며, 앞서 제시한 해결책이 이러한 문제를 더 잘 측정하거나 해결할 수 있음을 보인다. MCQA를 완전히 버릴 필요는 없지만, 교육적 테스트를 기반으로 과제를 개선하고 평가를 발전시키기 위한 더 많은 노력을 촉구한다.

English

Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.

다음 중 대형 언어 모델(LLM)을 이용한 객관식 평가를 가장 잘 설명하는 것은 무엇인가? A) 강제적 B) 결함이 있는 C) 수정 가능한 D) 위의 모든 것

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

초록

Support