GlotEval:面向大规模多语言评估的大型语言模型测试套件
GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models
April 5, 2025
作者: Hengyu Luo, Zihao Li, Joseph Attieh, Sawal Devkota, Ona de Gibert, Shaoxiong Ji, Peiqin Lin, Bhavani Sai Praneeth Varma Mantina, Ananda Sreenidhi, Raúl Vázquez, Mengjie Wang, Samea Yusofi, Jörg Tiedemann
cs.AI
摘要
全球范围内,大型语言模型(LLMs)正以前所未有的速度发展,各地区越来越多地采用这些模型应用于其母语场景。在多样化的语言环境中,尤其是在低资源语言中评估这些模型,已成为学术界和工业界面临的一大挑战。现有的评估框架过度集中于英语及少数高资源语言,从而忽视了LLMs在多语言及低资源情境下的实际表现。为填补这一空白,我们推出了GlotEval,一个专为大规模多语言评估设计的轻量级框架。GlotEval支持七大核心任务(机器翻译、文本分类、摘要生成、开放式生成、阅读理解、序列标注及内在评估),覆盖数十至数百种语言,强调一致的多语言基准测试、语言特定的提示模板以及非英语中心的机器翻译策略,从而精准诊断模型在不同语言环境下的优势与不足。通过一项多语言翻译案例研究,GlotEval展示了其在多语言及特定语言评估中的适用性。
English
Large language models (LLMs) are advancing at an unprecedented pace globally,
with regions increasingly adopting these models for applications in their
primary language. Evaluation of these models in diverse linguistic
environments, especially in low-resource languages, has become a major
challenge for academia and industry. Existing evaluation frameworks are
disproportionately focused on English and a handful of high-resource languages,
thereby overlooking the realistic performance of LLMs in multilingual and
lower-resource scenarios. To address this gap, we introduce GlotEval, a
lightweight framework designed for massively multilingual evaluation.
Supporting seven key tasks (machine translation, text classification,
summarization, open-ended generation, reading comprehension, sequence labeling,
and intrinsic evaluation), spanning over dozens to hundreds of languages,
GlotEval highlights consistent multilingual benchmarking, language-specific
prompt templates, and non-English-centric machine translation. This enables a
precise diagnosis of model strengths and weaknesses in diverse linguistic
contexts. A multilingual translation case study demonstrates GlotEval's
applicability for multilingual and language-specific evaluations.Summary
AI-Generated Summary