Marco-LLM:透過大規模多語言訓練實現跨語言增強
Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement
December 5, 2024
作者: Lingfeng Ming, Bo Zeng, Chenyang Lyu, Tianqi Shi, Yu Zhao, Xue Yang, Yefeng Liu, Yiyu Wang, Linlong Xu, Yangyang Liu, Xiaohu Zhao, Hao Wang, Heng Liu, Hao Zhou, Huifeng Yin, Zifu Shang, Haijun Li, Longyue Wang, Weihua Luo, Kaifu Zhang
cs.AI
摘要
大型語言模型(LLMs)近年來取得了顯著進展;然而,它們優異的表現仍然主要限於主要世界語言,尤其是英語。許多LLMs在多語言任務上仍然面臨挑戰,特別是在處理低資源語言時。為了應對這個問題,我們引入了Marco-LLM:用於跨語言增強LLM的大規模多語言訓練。我們已經為幾種低資源語言收集了大量多語言數據,並使用Qwen2模型進行了廣泛的持續預訓練。這一努力產生了一個名為Marco-LLM的多語言LLM。通過對各種多語言基準測試,包括MMMLU、AGIEval、Belebele、Flores-200、XCOPA等的全面評估,Marco-LLM展示了比最先進的LLMs更大的改進。此外,Marco-LLM在任意-任意機器翻譯任務中實現了顯著的增強,展示了我們多語言LLM的有效性。Marco-LLM是一個開創性的多語言LLM,旨在不僅在多語言任務中表現出色,包括低資源語言,還在英語和其他主要語言中保持強大表現,縮小高資源和低資源語言能力之間的表現差距。通過搭建語言之間的橋樑,這一努力展示了我們確保LLMs在各種語言中準確工作的承諾。
English
Large Language Models (LLMs) have achieved remarkable progress in recent
years; however, their excellent performance is still largely limited to major
world languages, primarily English. Many LLMs continue to face challenges with
multilingual tasks, especially when it comes to low-resource languages. To
address this issue, we introduced Marco-LLM: Massive multilingual training for
cross-lingual enhancement LLM. We have collected a substantial amount of
multilingual data for several low-resource languages and conducted extensive
continual pre-training using the Qwen2 models. This effort has resulted in a
multilingual LLM named Marco-LLM. Through comprehensive evaluations on various
multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA
and many others, Marco-LLM has demonstrated substantial improvements over
state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements
in any-to-any machine translation tasks, showing the effectiveness of our
multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not
only perform exceptionally well in multilingual tasks, including low-resource
languages, but also maintain strong performance in English and other major
languages, closing the performance gap between high- and low-resource language
capabilities. By bridging languages, this effort demonstrates our dedication to
ensuring LLMs work accurately across various languages.Summary
AI-Generated Summary