GlotEval：大規模多語言評估大型語言模型的測試套件

摘要

大型语言模型（LLMs）在全球范围内以前所未有的速度发展，各地区越来越多地采用这些模型应用于其母语场景。在多样化的语言环境中，尤其是在低资源语言中，对这些模型的评估已成为学术界和工业界面临的主要挑战。现有的评估框架过度集中于英语和少数高资源语言，从而忽视了LLMs在多语言及低资源情境下的实际表现。为填补这一空白，我们引入了GlotEval，一个专为大规模多语言评估设计的轻量级框架。GlotEval支持七大核心任务（机器翻译、文本分类、摘要生成、开放式生成、阅读理解、序列标注及内在评估），涵盖数十至数百种语言，强调一致的多语言基准测试、语言特定的提示模板以及非英语中心的机器翻译。这使得我们能够精准诊断模型在不同语言环境下的优势与不足。通过一个多语言翻译的案例研究，展示了GlotEval在多语言及特定语言评估中的适用性。

English

Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary language. Evaluation of these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks are disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this gap, we introduce GlotEval, a lightweight framework designed for massively multilingual evaluation. Supporting seven key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, and intrinsic evaluation), spanning over dozens to hundreds of languages, GlotEval highlights consistent multilingual benchmarking, language-specific prompt templates, and non-English-centric machine translation. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval's applicability for multilingual and language-specific evaluations.

GlotEval：大規模多語言評估大型語言模型的測試套件

GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

摘要

Summary

Support

Support