ChatPaper.aiChatPaper

YourBench:面向所有人的简易定制评估集

YourBench: Easy Custom Evaluation Sets for Everyone

April 2, 2025
作者: Sumuk Shashidhar, Clémentine Fourrier, Alina Lozovskia, Thomas Wolf, Gokhan Tur, Dilek Hakkani-Tür
cs.AI

摘要

有效评估大型语言模型(LLMs)仍是一个关键瓶颈,因为传统的静态基准测试面临饱和与污染问题,而人工评估则成本高昂且耗时。这阻碍了及时或特定领域的评估,而这些评估对于实际应用至关重要。我们推出了YourBench,一个创新的开源框架,通过直接从用户提供的文档中动态、自动生成可靠、最新且领域定制的基准测试,无需手动标注,以低成本解决了这些限制。我们通过使用极少的源文本复制了7个多样化的MMLU子集,证明了其有效性,总推理成本不到15美元,同时完美保留了原始基准测试中观察到的模型性能相对排名(Spearman Rho = 1)。为确保YourBench生成的数据基于提供的输入而非依赖模型中的后验参数知识,我们还引入了Tempora-0325,一个包含超过7K份多样化文档的新数据集,这些文档均于2025年3月之后发布。我们的全面分析涵盖了来自7个主要家族的26个最先进模型,跨越不同规模(3-671B参数),通过严格的算法检查(如引用验证)和人工评估来验证生成评估的质量。我们发布了YourBench库、Tempora-0325数据集、基于Tempora的15万+问答对以及所有评估和推理轨迹,以促进可重复研究,并赋能社区按需生成定制基准测试,推动更相关、更可信的LLM评估。
English
Evaluating large language models (LLMs) effectively remains a critical bottleneck, as traditional static benchmarks suffer from saturation and contamination, while human evaluations are costly and slow. This hinders timely or domain-specific assessment, crucial for real-world applications. We introduce YourBench, a novel, open-source framework that addresses these limitations by enabling dynamic, automated generation of reliable, up-to-date, and domain-tailored benchmarks cheaply and without manual annotation, directly from user-provided documents. We demonstrate its efficacy by replicating 7 diverse MMLU subsets using minimal source text, achieving this for under 15 USD in total inference costs while perfectly preserving the relative model performance rankings (Spearman Rho = 1) observed on the original benchmark. To ensure that YourBench generates data grounded in provided input instead of relying on posterior parametric knowledge in models, we also introduce Tempora-0325, a novel dataset of over 7K diverse documents, published exclusively after March 2025. Our comprehensive analysis spans 26 SoTA models from 7 major families across varying scales (3-671B parameters) to validate the quality of generated evaluations through rigorous algorithmic checks (e.g., citation grounding) and human assessments. We release the YourBench library, the Tempora-0325 dataset, 150k+ question answer pairs based on Tempora and all evaluation and inference traces to facilitate reproducible research and empower the community to generate bespoke benchmarks on demand, fostering more relevant and trustworthy LLM evaluation.

Summary

AI-Generated Summary

PDF203April 3, 2025