利用區域知識評估多語言語言理解
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
November 29, 2024
作者: Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas, Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen, Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Florez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Islam, Perttu Isotalo, Maral Jabbarishiviari, Börje F. Karlsson, Eldar Khalilov, Christopher Klamm, Fajri Koto, Dominik Krzemiński, Gabriel Adriano de Melo, Syrielle Montariol, Yiyang Nan, Joel Niklaus, Jekaterina Novikova, Johan Samir Obando Ceron, Debjit Paul, Esther Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell, Roshan Santhosh, Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia Soltani Moakhar, Ran Tamir, Ayush Kumar Tarun, Azmine Toushik Wasi, Thenuka Ovin Weerasinghe, Serhan Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, Antoine Bosselut
cs.AI
摘要
大型語言模型(LLM)在不同語言之間的性能差異阻礙了它們在許多地區的有效部署,限制了生成式人工智慧工具在許多社區中的潛在經濟和社會價值。然而,在許多語言中發展功能性LLM(即多語言LLM)受到高質量評估資源在英語以外語言的缺乏而受阻。此外,目前的多語言基準構建實踐通常是將英語資源翻譯,忽略了多語言系統將被使用的環境中的區域和文化知識。在這項工作中,我們從當地考試來源構建了一個包含197,243個問答對的評估套件,以衡量多語言LLM在各種區域背景中的能力。我們的新穎資源,名為INCLUDE,是一個跨44種書面語言的全面知識和推理中心基準,評估多語言LLM在實際語言環境中的表現。
English
The performance differential of large language models (LLM) between languages
hinders their effective deployment in many regions, inhibiting the potential
economic and societal value of generative AI tools in many communities.
However, the development of functional LLMs in many languages (\ie,
multilingual LLMs) is bottlenecked by the lack of high-quality evaluation
resources in languages other than English. Moreover, current practices in
multilingual benchmark construction often translate English resources, ignoring
the regional and cultural knowledge of the environments in which multilingual
systems would be used. In this work, we construct an evaluation suite of
197,243 QA pairs from local exam sources to measure the capabilities of
multilingual LLMs in a variety of regional contexts. Our novel resource,
INCLUDE, is a comprehensive knowledge- and reasoning-centric benchmark across
44 written languages that evaluates multilingual LLMs for performance in the
actual language environments where they would be deployed.Summary
AI-Generated Summary