Marco-LLM：通过大规模多语言训练实现跨语言增强

摘要

近年来，大型语言模型（LLMs）取得了显著进展；然而，它们的出色性能仍然主要局限于主要世界语言，主要是英语。许多LLMs在多语言任务中仍然面临挑战，特别是在处理低资源语言时。为了解决这个问题，我们引入了Marco-LLM：大规模多语言训练用于跨语言增强LLM。我们已经为几种低资源语言收集了大量多语言数据，并使用Qwen2模型进行了广泛的持续预训练。这一努力产生了一个名为Marco-LLM的多语言LLM。通过对各种多语言基准进行全面评估，包括MMMLU、AGIEval、Belebele、Flores-200、XCOPA等，Marco-LLM在最先进的LLMs基础上展现出了显著的改进。此外，Marco-LLM在任意-任意机器翻译任务中取得了显著的提升，显示了我们多语言LLM的有效性。Marco-LLM是一款开创性的多语言LLM，旨在不仅在多语言任务中表现出色，包括低资源语言，而且在英语和其他主要语言中保持强劲表现，缩小高资源和低资源语言能力之间的性能差距。通过架起语言之间的桥梁，这一努力展示了我们致力于确保LLMs在各种语言中准确工作的决心。

English

Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.

Marco-LLM：通过大规模多语言训练实现跨语言增强

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

摘要

Summary

Support

Support