Marco-LLM：透過大規模多語言訓練實現跨語言增強

摘要

大型語言模型（LLMs）近年來取得了顯著進展；然而，它們優異的表現仍然主要限於主要世界語言，尤其是英語。許多LLMs在多語言任務上仍然面臨挑戰，特別是在處理低資源語言時。為了應對這個問題，我們引入了Marco-LLM：用於跨語言增強LLM的大規模多語言訓練。我們已經為幾種低資源語言收集了大量多語言數據，並使用Qwen2模型進行了廣泛的持續預訓練。這一努力產生了一個名為Marco-LLM的多語言LLM。通過對各種多語言基準測試，包括MMMLU、AGIEval、Belebele、Flores-200、XCOPA等的全面評估，Marco-LLM展示了比最先進的LLMs更大的改進。此外，Marco-LLM在任意-任意機器翻譯任務中實現了顯著的增強，展示了我們多語言LLM的有效性。Marco-LLM是一個開創性的多語言LLM，旨在不僅在多語言任務中表現出色，包括低資源語言，還在英語和其他主要語言中保持強大表現，縮小高資源和低資源語言能力之間的表現差距。通過搭建語言之間的橋樑，這一努力展示了我們確保LLMs在各種語言中準確工作的承諾。

English

Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.

Marco-LLM：透過大規模多語言訓練實現跨語言增強

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

摘要

Summary

Support