全能测试基准ONEBench:针对开放式能力的样本级基准测试
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
December 9, 2024
作者: Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge
cs.AI
摘要
传统的固定测试集在评估基础模型的开放式能力方面存在不足。为了解决这一问题,我们提出了ONEBench(OpeN-Ended Benchmarking),这是一种新的测试范式,将各个评估数据集整合到一个统一的、不断扩展的样本池中。ONEBench允许用户从该样本池中生成定制的开放式评估基准,以对应特定感兴趣的能力。通过跨测试集聚合样本,ONEBench使得能够评估超出原始测试集覆盖范围的多样能力,同时减轻过拟合和数据集偏差。最重要的是,它将模型评估框架为选择和聚合样本级测试的集体过程。
从特定任务基准转向ONEBench引入了两个挑战:(1)异质性和(2)不完整性。异质性指的是对多样度指标的聚合,而不完整性描述了在不同数据子集上评估模型的比较。为了解决这些挑战,我们探索算法将稀疏测量聚合为可靠的模型分数。我们的聚合算法确保可识别性(渐近地恢复地面真实分数)和快速收敛,从而使得在较少数据下能够准确地对模型进行排名。在同质数据集上,我们展示了我们的聚合算法提供的排名与平均分数产生的排名高度相关。我们还展示了对大约95%的缺失测量值的稳健性,将评估成本降低了最多20倍,而模型排名几乎没有变化。我们引入了ONEBench-LLM用于语言模型和ONEBench-LMM用于视觉-语言模型,统一了这些领域的评估。总的来说,我们提出了一种针对开放式评估的技术,可以聚合不完整、异质的样本级测量,使基准不断增长,与快速发展的基础模型一起。
English
Traditional fixed test sets fall short in evaluating open-ended capabilities
of foundation models. To address this, we propose ONEBench(OpeN-Ended
Benchmarking), a new testing paradigm that consolidates individual evaluation
datasets into a unified, ever-expanding sample pool. ONEBench allows users to
generate custom, open-ended evaluation benchmarks from this pool, corresponding
to specific capabilities of interest. By aggregating samples across test sets,
ONEBench enables the assessment of diverse capabilities beyond those covered by
the original test sets, while mitigating overfitting and dataset bias. Most
importantly, it frames model evaluation as a collective process of selecting
and aggregating sample-level tests.
The shift from task-specific benchmarks to ONEBench introduces two
challenges: (1)heterogeneity and (2)incompleteness. Heterogeneity refers to the
aggregation over diverse metrics, while incompleteness describes comparing
models evaluated on different data subsets. To address these challenges, we
explore algorithms to aggregate sparse measurements into reliable model scores.
Our aggregation algorithm ensures identifiability(asymptotically recovering
ground-truth scores) and rapid convergence, enabling accurate model ranking
with less data. On homogenous datasets, we show our aggregation algorithm
provides rankings that highly correlate with those produced by average scores.
We also demonstrate robustness to ~95% of measurements missing, reducing
evaluation cost by up to 20x with little-to-no change in model rankings. We
introduce ONEBench-LLM for language models and ONEBench-LMM for vision-language
models, unifying evaluations across these domains. Overall, we present a
technique for open-ended evaluation, which can aggregate over incomplete,
heterogeneous sample-level measurements to continually grow a benchmark
alongside the rapidly developing foundation models.Summary
AI-Generated Summary