从2000多个多语言基准测试中汲取的深刻教训

摘要

随着大型语言模型（LLMs）在语言能力上的持续进步，稳健的多语言评估已成为推动技术公平发展的关键。本立场文件审视了2021年至2024年间来自148个国家的2000多个多语言（非英语）基准测试，以评估过去、现在及未来的多语言基准实践。研究发现，尽管投入了数千万美元，英语在这些基准中仍显著占据主导地位。此外，大多数基准依赖于原始语言内容而非翻译，且主要来源于中国、印度、德国、英国和美国等高资源国家。进一步对比基准表现与人类判断，揭示了显著差异：STEM相关任务与人类评估呈现强相关性（0.70至0.85），而传统自然语言处理任务如问答（例如XQuAD）则显示出弱得多的相关性（0.11至0.30）。此外，将英语基准翻译成其他语言效果有限，本地化基准与当地人类判断的一致性（0.68）远高于翻译版本（0.47），这强调了创建文化和语言定制化基准的重要性，而非单纯依赖翻译。通过这一全面分析，我们指出了当前多语言评估实践中的六大关键局限，据此提出了有效多语言基准测试的指导原则，并勾勒了推动该领域进展的五大关键研究方向。最后，我们呼吁全球协作，开发以现实应用为导向、与人类判断高度契合的基准测试。

English

As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal that, despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks. Additionally, most benchmarks rely on original language content rather than translations, with the majority sourced from high-resource countries such as China, India, Germany, the UK, and the USA. Furthermore, a comparison of benchmark performance with human judgments highlights notable disparities. STEM-related tasks exhibit strong correlations with human evaluations (0.70 to 0.85), while traditional NLP tasks like question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30). Moreover, translating English benchmarks into other languages proves insufficient, as localized benchmarks demonstrate significantly higher alignment with local human judgments (0.68) than their translated counterparts (0.47). This underscores the importance of creating culturally and linguistically tailored benchmarks rather than relying solely on translations. Through this comprehensive analysis, we highlight six key limitations in current multilingual evaluation practices, propose the guiding principles accordingly for effective multilingual benchmarking, and outline five critical research directions to drive progress in the field. Finally, we call for a global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.

从2000多个多语言基准测试中汲取的深刻教训

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

摘要

Summary

Support

Support