Sailor2:驾驭包容性多语言大模型,扬帆东南亚
Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs
February 18, 2025
作者: Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
cs.AI
摘要
Sailor2 是一系列针对东南亚(SEA)语言的尖端多语言模型,提供 1B、8B 和 20B 三种规模,以适应多样化的应用需求。基于 Qwen2.5,Sailor2 在 5000 亿个 token(其中 4000 亿为东南亚语言专用,1000 亿为回放 token)上进行了持续预训练,支持 13 种东南亚语言,同时保持对中文和英文的熟练度。Sailor2-20B 模型在东南亚语言上与 GPT-4o 的对决中取得了 50-50 的胜率。我们还提供了一份详尽的指南,涵盖数据整理、预训练、后训练、模型定制和评估五大关键方面,旨在高效开发多语言模型。我们希望 Sailor2 模型(采用 Apache 2.0 许可证)能够推动东南亚地区的语言发展,同时 Sailor2 指南能激励研究人员为其他服务不足的语言构建更具包容性的大语言模型。
English
Sailor2 is a family of cutting-edge multilingual language models for
South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit
diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous
pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to
support 13 SEA languages while retaining proficiency in Chinese and English.
Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA
languages. We also deliver a comprehensive cookbook on how to develop the
multilingual model in an efficient manner, including five key aspects: data
curation, pre-training, post-training, model customization and evaluation. We
hope that Sailor2 model (Apache 2.0 license) will drive language development in
the SEA region, and Sailor2 cookbook will inspire researchers to build more
inclusive LLMs for other under-served languages.Summary
AI-Generated Summary