以下哪项最能描述使用大语言模型进行多项选择评估的特点?A) 强制性的 B) 存在缺陷的 C) 可修正的 D) 以上所有
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
February 19, 2025
作者: Nishant Balepur, Rachel Rudinger, Jordan Lee Boyd-Graber
cs.AI
摘要
多项选择题问答(MCQA)因其简洁性和类人测试特性,在大语言模型(LLM)评估中广受欢迎,但我们主张对其进行改革。首先,我们揭示了MCQA格式的缺陷,它难以:1)测试生成能力与主观性;2)匹配LLM的实际应用场景;3)全面检验知识掌握程度。我们转而提倡基于人类测试的生成式评估方式——让LLM构建并解释答案——这样能更好地捕捉用户需求与知识掌握情况,同时保持评分简便。接着,我们指出即便MCQA在某些情况下是有效的评估形式,其数据集仍存在泄露、不可答性、捷径效应及饱和等问题。针对每项问题,我们借鉴教育领域的解决方案,如制定评分标准指导多选题编写;采用评分方法抑制猜测行为;以及运用项目反应理论构建更具挑战性的多选题。最后,我们探讨了LLM在MCQA中的错误表现——鲁棒性不足、偏见及不忠实的解释——并展示了我们先前提出的解决方案如何更有效地衡量或解决这些问题。虽然我们无需完全摒弃MCQA,但我们鼓励基于教育测试原理对该任务进行更多优化,以推动评估方法的进步。
English
Multiple choice question answering (MCQA) is popular for LLM evaluation due
to its simplicity and human-like testing, but we argue for its reform. We first
reveal flaws in MCQA's format, as it struggles to: 1) test
generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge.
We instead advocate for generative formats based on human testing-where LLMs
construct and explain answers-better capturing user needs and knowledge while
remaining easy to score. We then show even when MCQA is a useful format, its
datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In
each issue, we give fixes from education, like rubrics to guide MCQ writing;
scoring methods to bridle guessing; and Item Response Theory to build harder
MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful
explanations-showing how our prior solutions better measure or address these
issues. While we do not need to desert MCQA, we encourage more efforts in
refining the task based on educational testing, advancing evaluations.Summary
AI-Generated Summary