ChatPaper.aiChatPaper

利用区域知识评估多语言语言理解

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

November 29, 2024
作者: Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Islam, Perttu Isotalo, Maral Jabbarishiviari, Börje F. Karlsson, Eldar Khalilov, Christopher Klamm, Fajri Koto, Dominik Krzemiński, Gabriel Adriano de Melo, Syrielle Montariol, Yiyang Nan, Joel Niklaus, Jekaterina Novikova, Johan Samir Obando Ceron, Debjit Paul, Esther Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell, Roshan Santhosh, Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia Soltani Moakhar, Ran Tamir, Ayush Kumar Tarun, Azmine Toushik Wasi, Thenuka Ovin Weerasinghe, Serhan Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, Antoine Bosselut
cs.AI

摘要

大型语言模型(LLM)在不同语言之间的性能差异阻碍了它们在许多地区的有效部署,抑制了生成式人工智能工具在许多社区中的潜在经济和社会价值。然而,在许多语言中开发功能性LLM(即多语言LLM)受制于除英语以外其他语言缺乏高质量评估资源。此外,当前的多语言基准构建实践通常是将英语资源翻译,忽略了多语言系统将被使用的环境中的区域和文化知识。在这项工作中,我们构建了一个由本地考试来源的197,243个问答对组成的评估套件,以衡量多语言LLM在各种区域背景下的能力。我们的新资源,名为INCLUDE,是一个跨44种书面语言的综合知识和推理中心基准,评估多语言LLM在实际语言环境中的表现。
English
The performance differential of large language models (LLM) between languages hinders their effective deployment in many regions, inhibiting the potential economic and societal value of generative AI tools in many communities. However, the development of functional LLMs in many languages (\ie, multilingual LLMs) is bottlenecked by the lack of high-quality evaluation resources in languages other than English. Moreover, current practices in multilingual benchmark construction often translate English resources, ignoring the regional and cultural knowledge of the environments in which multilingual systems would be used. In this work, we construct an evaluation suite of 197,243 QA pairs from local exam sources to measure the capabilities of multilingual LLMs in a variety of regional contexts. Our novel resource, INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across 44 written languages that evaluates multilingual LLMs for performance in the actual language environments where they would be deployed.

Summary

AI-Generated Summary

PDF142December 3, 2024