ChatPaper.aiChatPaper

Marco-LLM:通过大规模多语言训练实现跨语言增强

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

December 5, 2024
作者: Lingfeng Ming, Bo Zeng, Chenyang Lyu, Tianqi Shi, Yu Zhao, Xue Yang, Yefeng Liu, Yiyu Wang, Linlong Xu, Yangyang Liu, Xiaohu Zhao, Hao Wang, Heng Liu, Hao Zhou, Huifeng Yin, Zifu Shang, Haijun Li, Longyue Wang, Weihua Luo, Kaifu Zhang
cs.AI

摘要

近年来,大型语言模型(LLMs)取得了显著进展;然而,它们的出色性能仍然主要局限于主要世界语言,主要是英语。许多LLMs在多语言任务中仍然面临挑战,特别是在处理低资源语言时。为了解决这个问题,我们引入了Marco-LLM:大规模多语言训练用于跨语言增强LLM。我们已经为几种低资源语言收集了大量多语言数据,并使用Qwen2模型进行了广泛的持续预训练。这一努力产生了一个名为Marco-LLM的多语言LLM。通过对各种多语言基准进行全面评估,包括MMMLU、AGIEval、Belebele、Flores-200、XCOPA等,Marco-LLM在最先进的LLMs基础上展现出了显著的改进。此外,Marco-LLM在任意-任意机器翻译任务中取得了显著的提升,显示了我们多语言LLM的有效性。Marco-LLM是一款开创性的多语言LLM,旨在不仅在多语言任务中表现出色,包括低资源语言,而且在英语和其他主要语言中保持强劲表现,缩小高资源和低资源语言能力之间的性能差距。通过架起语言之间的桥梁,这一努力展示了我们致力于确保LLMs在各种语言中准确工作的决心。
English
Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.

Summary

AI-Generated Summary

PDF102December 6, 2024