从2000多个多语言基准测试中汲取的深刻教训
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
April 22, 2025
作者: Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
cs.AI
摘要
随着大型语言模型(LLMs)在语言能力上的持续进步,稳健的多语言评估已成为推动技术公平发展的关键。本立场文件审视了2021年至2024年间来自148个国家的2000多个多语言(非英语)基准测试,以评估过去、现在及未来的多语言基准实践。研究发现,尽管投入了数千万美元,英语在这些基准中仍显著占据主导地位。此外,大多数基准依赖于原始语言内容而非翻译,且主要来源于中国、印度、德国、英国和美国等高资源国家。进一步对比基准表现与人类判断,揭示了显著差异:STEM相关任务与人类评估呈现强相关性(0.70至0.85),而传统自然语言处理任务如问答(例如XQuAD)则显示出弱得多的相关性(0.11至0.30)。此外,将英语基准翻译成其他语言效果有限,本地化基准与当地人类判断的一致性(0.68)远高于翻译版本(0.47),这强调了创建文化和语言定制化基准的重要性,而非单纯依赖翻译。通过这一全面分析,我们指出了当前多语言评估实践中的六大关键局限,据此提出了有效多语言基准测试的指导原则,并勾勒了推动该领域进展的五大关键研究方向。最后,我们呼吁全球协作,开发以现实应用为导向、与人类判断高度契合的基准测试。
English
As large language models (LLMs) continue to advance in linguistic
capabilities, robust multilingual evaluation has become essential for promoting
equitable technological progress. This position paper examines over 2,000
multilingual (non-English) benchmarks from 148 countries, published between
2021 and 2024, to evaluate past, present, and future practices in multilingual
benchmarking. Our findings reveal that, despite significant investments
amounting to tens of millions of dollars, English remains significantly
overrepresented in these benchmarks. Additionally, most benchmarks rely on
original language content rather than translations, with the majority sourced
from high-resource countries such as China, India, Germany, the UK, and the
USA. Furthermore, a comparison of benchmark performance with human judgments
highlights notable disparities. STEM-related tasks exhibit strong correlations
with human evaluations (0.70 to 0.85), while traditional NLP tasks like
question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30).
Moreover, translating English benchmarks into other languages proves
insufficient, as localized benchmarks demonstrate significantly higher
alignment with local human judgments (0.68) than their translated counterparts
(0.47). This underscores the importance of creating culturally and
linguistically tailored benchmarks rather than relying solely on translations.
Through this comprehensive analysis, we highlight six key limitations in
current multilingual evaluation practices, propose the guiding principles
accordingly for effective multilingual benchmarking, and outline five critical
research directions to drive progress in the field. Finally, we call for a
global collaborative effort to develop human-aligned benchmarks that prioritize
real-world applications.Summary
AI-Generated Summary