ChatPaper.aiChatPaper

WiCkeD:一种提升多项选择基准测试难度的简易方法

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

February 25, 2025
作者: Ahmed Elhady, Eneko Agirre, Mikel Artetxe
cs.AI

摘要

我们推出了WiCkeD,这是一种通过随机将选择题中的一个选项替换为“以上都不是”(这一方法常见于教育测试中)来提升现有多选题基准复杂度的简易方法。我们展示了WiCkeD能够自动应用于任何现有基准,使其更具挑战性。我们将WiCkeD应用于6个热门基准,并利用它评估了18个开源权重的大型语言模型(LLMs)。相较于数据集原始版本,模型性能平均下降了12.1个百分点。在三个MMLU数据集上采用思维链推理时,WiCkeD变体导致的性能下降与直接使用LLMs时观察到的下降幅度相近,表明WiCkeD对于具备增强推理能力的模型同样构成挑战。此外,WiCkeD还揭示出部分模型对额外推理需求更为敏感,为原始基准提供了额外的信息维度。我们已在https://github.com/ahmedselhady/wicked-benchmarks公开了代码与数据。
English
We introduce WiCkeD, a simple method to increase the complexity of existing multiple-choice benchmarks by randomly replacing a choice with "None of the above", a method often used in educational tests. We show that WiCkeD can be automatically applied to any existing benchmark, making it more challenging. We apply WiCkeD to 6 popular benchmarks and use it to evaluate 18 open-weight LLMs. The performance of the models drops 12.1 points on average with respect to the original versions of the datasets. When using chain-of-thought on 3 MMLU datasets, the performance drop for the WiCkeD variant is similar to the one observed when using the LLMs directly, showing that WiCkeD is also challenging for models with enhanced reasoning abilities. WiCkeD also uncovers that some models are more sensitive to the extra reasoning required, providing additional information with respect to the original benchmarks. We relase our code and data at https://github.com/ahmedselhady/wicked-benchmarks.

Summary

AI-Generated Summary

PDF22February 26, 2025