BenchMAX:用于大型语言模型的全面多语言评估套件
BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models
February 11, 2025
作者: Xu Huang, Wenhao Zhu, Hanxu Hu, Conghui He, Lei Li, Shujian Huang, Fei Yuan
cs.AI
摘要
以往的多语言基准主要专注于简单的理解任务,但对于大型语言模型(LLMs),我们强调在指令跟随、推理、长文本理解、代码生成等方面的熟练程度。然而,跨语言测量这些高级能力是未被充分探索的。为了解决这种差异,我们引入了BenchMAX,一个多向多语言评估基准,允许在各种语言之间进行这些重要能力的公平比较。为了保持高质量,三位母语标注者在数据从英语机器翻译成其他16种语言后,独立地对每个任务中的样本进行标注。此外,我们提出了一个源自数据集构建的新型翻译挑战。对BenchMAX的广泛实验揭示了核心能力在不同语言之间的效果差异,突出了不能仅通过扩大模型规模来弥合的性能差距。BenchMAX作为一个全面的多语言评估平台,提供了一个有前途的测试平台,促进多语言语言模型的发展。数据集和代码可公开获取。
English
Previous multilingual benchmarks focus primarily on simple understanding
tasks, but for large language models(LLMs), we emphasize proficiency in
instruction following, reasoning, long context understanding, code generation,
and so on. However, measuring these advanced capabilities across languages is
underexplored. To address the disparity, we introduce BenchMAX, a multi-way
multilingual evaluation benchmark that allows for fair comparisons of these
important abilities across languages. To maintain high quality, three distinct
native-speaking annotators independently annotate each sample within all tasks
after the data was machine-translated from English into 16 other languages.
Additionally, we present a novel translation challenge stemming from dataset
construction. Extensive experiments on BenchMAX reveal varying effectiveness of
core capabilities across languages, highlighting performance gaps that cannot
be bridged by simply scaling up model size. BenchMAX serves as a comprehensive
multilingual evaluation platform, providing a promising test bed to promote the
development of multilingual language models. The dataset and code are publicly
accessible.Summary
AI-Generated Summary