ONEBench 測試所有：基於樣本級別的基準測試開放式能力

摘要

傳統固定的測試集在評估基礎模型的開放式能力方面存在不足。為了解決這個問題，我們提出了ONEBench（OpeN-Ended Benchmarking），這是一種新的測試範式，將個別評估數據集整合到一個統一的、不斷擴大的樣本池中。ONEBench允許用戶從這個樣本池中生成定製的、開放式的評估基準，以對應特定感興趣的能力。通過跨測試集聚合樣本，ONEBench使得能夠評估原始測試集未涵蓋的多樣能力，同時減輕過度擬合和數據集偏差。最重要的是，它將模型評估框架定義為選擇和聚合樣本級測試的集體過程。從特定任務基準轉向ONEBench引入了兩個挑戰：（1）異質性和（2）不完整性。異質性指的是對多樣度指標的聚合，而不完整性描述了對不同數據子集上評估的模型進行比較。為了應對這些挑戰，我們探索了將稀疏測量聚合成可靠模型分數的算法。我們的聚合算法確保可識別性（漸近地恢復地面真實分數）和快速收斂，從而實現在更少數據的情況下準確地對模型進行排名。在同質數據集上，我們展示了我們的聚合算法提供的排名與平均分數產生的排名高度相關。我們還展示了對約95%的缺失測量的魯棒性，將評估成本降低了最多20倍，並且模型排名幾乎沒有變化。我們引入了ONEBench-LLM用於語言模型和ONEBench-LMM用於視覺-語言模型，將這些領域的評估統一起來。總的來說，我們提出了一種開放式評估技術，可以將不完整、異質的樣本級測量聚合起來，並隨著快速發展的基礎模型不斷擴展基準。

English

Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models. To address this, we propose ONEBench(OpeN-Ended Benchmarking), a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. ONEBench allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias. Most importantly, it frames model evaluation as a collective process of selecting and aggregating sample-level tests. The shift from task-specific benchmarks to ONEBench introduces two challenges: (1)heterogeneity and (2)incompleteness. Heterogeneity refers to the aggregation over diverse metrics, while incompleteness describes comparing models evaluated on different data subsets. To address these challenges, we explore algorithms to aggregate sparse measurements into reliable model scores. Our aggregation algorithm ensures identifiability(asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model ranking with less data. On homogenous datasets, we show our aggregation algorithm provides rankings that highly correlate with those produced by average scores. We also demonstrate robustness to ~95% of measurements missing, reducing evaluation cost by up to 20x with little-to-no change in model rankings. We introduce ONEBench-LLM for language models and ONEBench-LMM for vision-language models, unifying evaluations across these domains. Overall, we present a technique for open-ended evaluation, which can aggregate over incomplete, heterogeneous sample-level measurements to continually grow a benchmark alongside the rapidly developing foundation models.

ONEBench 測試所有：基於樣本級別的基準測試開放式能力

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

摘要

Support