從2000多個多語言基準測試中汲取的深刻教訓

摘要

隨著大型語言模型（LLMs）在語言能力上的持續進步，穩健的多語言評估已成為促進公平技術發展的關鍵。本立場文件檢視了來自148個國家、於2021年至2024年間發布的超過2,000個多語言（非英語）基準，以評估過去、現在及未來的多語言基準實踐。我們的研究發現，儘管投入了數千萬美元的巨額資金，英語在這些基準中仍然顯著過度代表。此外，大多數基準依賴於原始語言內容而非翻譯，且主要來源於高資源國家，如中國、印度、德國、英國和美國。進一步地，基準表現與人類判斷的比較揭示了顯著差異。STEM相關任務與人類評估呈現出強相關性（0.70至0.85），而傳統的自然語言處理任務，如問答（例如XQuAD），則顯示出較弱的相關性（0.11至0.30）。此外，將英語基準翻譯成其他語言被證明是不夠的，因為本地化基準與當地人類判斷的對齊度（0.68）顯著高於其翻譯版本（0.47）。這強調了創建文化和語言定制的基準的重要性，而非僅僅依賴翻譯。通過這項全面分析，我們指出了當前多語言評估實踐中的六個主要限制，提出了有效的多語言基準制定的指導原則，並概述了推動該領域進步的五個關鍵研究方向。最後，我們呼籲全球合作，開發以現實應用為優先、與人類判斷對齊的基準。

English

As large language models (LLMs) continue to advance in linguistic capabilities, robust multilingual evaluation has become essential for promoting equitable technological progress. This position paper examines over 2,000 multilingual (non-English) benchmarks from 148 countries, published between 2021 and 2024, to evaluate past, present, and future practices in multilingual benchmarking. Our findings reveal that, despite significant investments amounting to tens of millions of dollars, English remains significantly overrepresented in these benchmarks. Additionally, most benchmarks rely on original language content rather than translations, with the majority sourced from high-resource countries such as China, India, Germany, the UK, and the USA. Furthermore, a comparison of benchmark performance with human judgments highlights notable disparities. STEM-related tasks exhibit strong correlations with human evaluations (0.70 to 0.85), while traditional NLP tasks like question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30). Moreover, translating English benchmarks into other languages proves insufficient, as localized benchmarks demonstrate significantly higher alignment with local human judgments (0.68) than their translated counterparts (0.47). This underscores the importance of creating culturally and linguistically tailored benchmarks rather than relying solely on translations. Through this comprehensive analysis, we highlight six key limitations in current multilingual evaluation practices, propose the guiding principles accordingly for effective multilingual benchmarking, and outline five critical research directions to drive progress in the field. Finally, we call for a global collaborative effort to develop human-aligned benchmarks that prioritize real-world applications.

從2000多個多語言基準測試中汲取的深刻教訓

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

摘要

Summary

Support

Support