ChatPaper.aiChatPaper

Babel:开放多语言大语言模型,服务全球超90%的语言使用者

Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers

March 2, 2025
作者: Yiran Zhao, Chaoqun Liu, Yue Deng, Jiahao Ying, Mahani Aljunied, Zhaodonghui Li, Lidong Bing, Hou Pong Chan, Yu Rong, Deli Zhao, Wenxuan Zhang
cs.AI

摘要

大型语言模型(LLMs)已彻底革新了自然语言处理(NLP)领域,然而开源的多语言LLMs仍较为稀缺,现有模型往往在语言覆盖范围上受限。这类模型通常优先考虑资源丰富的语言,而广泛使用但资源匮乏的语言则常被忽视。为解决这一不平衡现象,我们推出了Babel,一个开源的多语言LLM,它涵盖了按使用人数排名前25的语言,支持全球超过90%的人口,并包含了许多被其他开源多语言LLMs忽视的语言。不同于传统的持续预训练方法,Babel通过层扩展技术增加参数数量,从而提升了Babel的性能上限。我们推出了两个版本:Babel-9B,专为高效推理和微调设计;以及Babel-83B,为开源多语言LLMs树立了新标杆。在多语言任务上的广泛评估显示,相较于同等规模的开源LLMs,Babel展现了卓越的性能。此外,利用开源的有监督微调数据集,Babel取得了显著成效,其中Babel-9B-Chat在10B规模的LLMs中领先,而Babel-83B-Chat则为多语言任务设立了新标准,达到了与商业模型相当的水平。
English
Large language models (LLMs) have revolutionized natural language processing (NLP), yet open-source multilingual LLMs remain scarce, with existing models often limited in language coverage. Such models typically prioritize well-resourced languages, while widely spoken but under-resourced languages are often overlooked. To address this disparity, we introduce Babel, an open multilingual LLM that covers the top 25 languages by number of speakers, supports over 90% of the global population, and includes many languages neglected by other open multilingual LLMs. Unlike traditional continue pretraining approaches, Babel expands its parameter count through a layer extension technique that elevates Babel's performance ceiling. We introduce two variants: Babel-9B, designed for efficient inference and fine-tuning, and Babel-83B, which sets a new standard for open multilingual LLMs. Extensive evaluations on multilingual tasks demonstrate its superior performance compared to open LLMs of comparable size. In addition, using open-source supervised fine-tuning datasets, Babel achieves remarkable performance, with Babel-9B-Chat leading among 10B-sized LLMs and Babel-83B-Chat setting a new standard for multilingual tasks, reaching the same level of commercial models.

Summary

AI-Generated Summary

PDF583March 6, 2025