BIG-Bench 极限挑战
BIG-Bench Extra Hard
February 26, 2025
作者: Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K. Jain, Virginia Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu, Uri Shalit, Silvia Chiappa, Kate Olszewska, Yi Tay, Vinh Q. Tran, Quoc V. Le, Orhan Firat
cs.AI
摘要
大型语言模型(LLMs)正日益广泛应用于日常应用中,这要求其具备强大的通用推理能力和多样化的推理技能。然而,当前的LLM推理基准主要集中于数学和编程能力,在评估更广泛的推理熟练度方面存在不足。BIG-Bench数据集是一个例外,它作为评估LLMs通用推理能力的关键基准,得益于其多样化的挑战性任务集,这些任务允许在一个统一框架内对跨多种技能的通用推理进行全面评估。然而,LLMs的最新进展导致其在BIG-Bench及其更难版本BIG-Bench Hard(BBH)上趋于饱和。顶尖模型在BBH的许多任务中接近满分,从而削弱了其实用性。为应对这一局限,我们引入了BIG-Bench Extra Hard(BBEH),这是一个旨在拓展LLM推理评估边界的新基准。BBEH将BBH中的每个任务替换为一个探究相似推理能力但难度显著提升的新任务。我们在BBEH上评估了多种模型,观察到最佳通用模型的(调和)平均准确率为9.8%,而最佳推理专用模型为44.8%,这表明仍有很大的改进空间,并突显了实现LLMs稳健通用推理的持续挑战。我们已将BBEH公开发布于:https://github.com/google-deepmind/bbeh。
English
Large language models (LLMs) are increasingly deployed in everyday
applications, demanding robust general reasoning capabilities and diverse
reasoning skillset. However, current LLM reasoning benchmarks predominantly
focus on mathematical and coding abilities, leaving a gap in evaluating broader
reasoning proficiencies. One particular exception is the BIG-Bench dataset,
which has served as a crucial benchmark for evaluating the general reasoning
capabilities of LLMs, thanks to its diverse set of challenging tasks that
allowed for a comprehensive assessment of general reasoning across various
skills within a unified framework. However, recent advances in LLMs have led to
saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH).
State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus
diminishing its utility. To address this limitation, we introduce BIG-Bench
Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM
reasoning evaluation. BBEH replaces each task in BBH with a novel task that
probes a similar reasoning capability but exhibits significantly increased
difficulty. We evaluate various models on BBEH and observe a (harmonic) average
accuracy of 9.8\% for the best general-purpose model and 44.8\% for the best
reasoning-specialized model, indicating substantial room for improvement and
highlighting the ongoing challenge of achieving robust general reasoning in
LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.Summary
AI-Generated Summary