從2000多個多語言基準測試中汲取的深刻教訓
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
April 22, 2025
作者: Minghao Wu, Weixuan Wang, Sinuo Liu, Huifeng Yin, Xintong Wang, Yu Zhao, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
cs.AI
摘要
隨著大型語言模型(LLMs)在語言能力上的持續進步,穩健的多語言評估已成為促進公平技術發展的關鍵。本立場文件檢視了來自148個國家、於2021年至2024年間發布的超過2,000個多語言(非英語)基準,以評估過去、現在及未來的多語言基準實踐。我們的研究發現,儘管投入了數千萬美元的巨額資金,英語在這些基準中仍然顯著過度代表。此外,大多數基準依賴於原始語言內容而非翻譯,且主要來源於高資源國家,如中國、印度、德國、英國和美國。進一步地,基準表現與人類判斷的比較揭示了顯著差異。STEM相關任務與人類評估呈現出強相關性(0.70至0.85),而傳統的自然語言處理任務,如問答(例如XQuAD),則顯示出較弱的相關性(0.11至0.30)。此外,將英語基準翻譯成其他語言被證明是不夠的,因為本地化基準與當地人類判斷的對齊度(0.68)顯著高於其翻譯版本(0.47)。這強調了創建文化和語言定制的基準的重要性,而非僅僅依賴翻譯。通過這項全面分析,我們指出了當前多語言評估實踐中的六個主要限制,提出了有效的多語言基準制定的指導原則,並概述了推動該領域進步的五個關鍵研究方向。最後,我們呼籲全球合作,開發以現實應用為優先、與人類判斷對齊的基準。
English
As large language models (LLMs) continue to advance in linguistic
capabilities, robust multilingual evaluation has become essential for promoting
equitable technological progress. This position paper examines over 2,000
multilingual (non-English) benchmarks from 148 countries, published between
2021 and 2024, to evaluate past, present, and future practices in multilingual
benchmarking. Our findings reveal that, despite significant investments
amounting to tens of millions of dollars, English remains significantly
overrepresented in these benchmarks. Additionally, most benchmarks rely on
original language content rather than translations, with the majority sourced
from high-resource countries such as China, India, Germany, the UK, and the
USA. Furthermore, a comparison of benchmark performance with human judgments
highlights notable disparities. STEM-related tasks exhibit strong correlations
with human evaluations (0.70 to 0.85), while traditional NLP tasks like
question answering (e.g., XQuAD) show much weaker correlations (0.11 to 0.30).
Moreover, translating English benchmarks into other languages proves
insufficient, as localized benchmarks demonstrate significantly higher
alignment with local human judgments (0.68) than their translated counterparts
(0.47). This underscores the importance of creating culturally and
linguistically tailored benchmarks rather than relying solely on translations.
Through this comprehensive analysis, we highlight six key limitations in
current multilingual evaluation practices, propose the guiding principles
accordingly for effective multilingual benchmarking, and outline five critical
research directions to drive progress in the field. Finally, we call for a
global collaborative effort to develop human-aligned benchmarks that prioritize
real-world applications.Summary
AI-Generated Summary