모두를 테스트하는 ONEBench: 오픈 엔드 능력에 대한 샘플 수준의 벤치마킹

초록

전통적인 고정 테스트 세트는 기초 모델의 개방형 능력을 평가하는 데 한계가 있습니다. 이를 해결하기 위해 우리는 개별 평가 데이터 세트를 통합하여 점진적으로 확장되는 통합된 샘플 풀로 ONEBench(OpeN-Ended Benchmarking)를 제안합니다. ONEBench를 사용하면 사용자가 이 풀에서 특정 관심 능력에 해당하는 사용자 정의 개방형 평가 벤치마크를 생성할 수 있습니다. 테스트 세트를 효과적으로 집계함으로써 ONEBench는 원래의 테스트 세트에서 다루는 것 이상의 다양한 능력을 평가하고, 오버피팅과 데이터 집합 편향을 완화합니다. 가장 중요한 것은 모델 평가를 샘플 수준 테스트의 선택과 집계의 집단적 과정으로 구성한다는 점입니다. 과제별 벤치마크에서 ONEBench로의 전환은 두 가지 도전 과제를 도입합니다: (1)이질성과 (2)불완전성. 이질성은 다양한 메트릭을 통합하는 것을 의미하며, 불완전성은 서로 다른 데이터 하위 집합에서 평가된 모델을 비교하는 것을 설명합니다. 이러한 도전에 대응하기 위해 우리는 희소 측정값을 신뢰할 수 있는 모델 점수로 집계하는 알고리즘을 탐색합니다. 우리의 집계 알고리즘은 식별 가능성(점진적으로 실제 점수를 복원)과 빠른 수렴을 보장하여 적은 데이터로 정확한 모델 순위를 지원합니다. 동질적 데이터 세트에서는 우리의 집계 알고리즘이 평균 점수로 생성된 순위와 매우 상관 관계가 있는 것을 보여줍니다. 또한 약 95%의 측정값이 누락되어도 강건성을 시연하여 평가 비용을 최대 20배까지 줄이면서 모델 순위에 거의 변화가 없음을 입증합니다. 우리는 언어 모델을 위한 ONEBench-LLM과 시각-언어 모델을 위한 ONEBench-LMM을 소개하여 이러한 영역을 통합적으로 평가합니다. 전반적으로, 우리는 빠르게 발전하는 기초 모델과 함께 벤치마크를 지속적으로 확장할 수 있는 개방형 평가 기술을 제시합니다.

English

Traditional fixed test sets fall short in evaluating open-ended capabilities of foundation models. To address this, we propose ONEBench(OpeN-Ended Benchmarking), a new testing paradigm that consolidates individual evaluation datasets into a unified, ever-expanding sample pool. ONEBench allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias. Most importantly, it frames model evaluation as a collective process of selecting and aggregating sample-level tests. The shift from task-specific benchmarks to ONEBench introduces two challenges: (1)heterogeneity and (2)incompleteness. Heterogeneity refers to the aggregation over diverse metrics, while incompleteness describes comparing models evaluated on different data subsets. To address these challenges, we explore algorithms to aggregate sparse measurements into reliable model scores. Our aggregation algorithm ensures identifiability(asymptotically recovering ground-truth scores) and rapid convergence, enabling accurate model ranking with less data. On homogenous datasets, we show our aggregation algorithm provides rankings that highly correlate with those produced by average scores. We also demonstrate robustness to ~95% of measurements missing, reducing evaluation cost by up to 20x with little-to-no change in model rankings. We introduce ONEBench-LLM for language models and ONEBench-LMM for vision-language models, unifying evaluations across these domains. Overall, we present a technique for open-ended evaluation, which can aggregate over incomplete, heterogeneous sample-level measurements to continually grow a benchmark alongside the rapidly developing foundation models.

모두를 테스트하는 ONEBench: 오픈 엔드 능력에 대한 샘플 수준의 벤치마킹

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

초록

Support