EuroLLM：針對歐洲的多語言語言模型

摘要

開放式權重LLM的品質已經顯著提升，但它們仍然主要集中在英語上。在本文中，我們介紹EuroLLM項目，旨在開發一套能夠理解和生成所有歐盟官方語言以及其他幾種相關語言文本的開放式權重多語言LLM。我們概述了迄今為止取得的進展，詳細說明了我們的數據收集和過濾過程，比例律的發展，我們多語言分詞器的創建，以及數據混合和建模配置。此外，我們發布了我們的初始模型：EuroLLM-1.7B和EuroLLM-1.7B-Instruct，並報告它們在多語言通用基準和機器翻譯上的表現。

English

The quality of open-weight LLMs has seen significant improvement, yet they remain predominantly focused on English. In this paper, we introduce the EuroLLM project, aimed at developing a suite of open-weight multilingual LLMs capable of understanding and generating text in all official European Union languages, as well as several additional relevant languages. We outline the progress made to date, detailing our data collection and filtering process, the development of scaling laws, the creation of our multilingual tokenizer, and the data mix and modeling configurations. Additionally, we release our initial models: EuroLLM-1.7B and EuroLLM-1.7B-Instruct and report their performance on multilingual general benchmarks and machine translation.

EuroLLM：針對歐洲的多語言語言模型

EuroLLM: Multilingual Language Models for Europe

摘要

Summary

Support

Support