Marco-LLM: 교차언어 강화를 위한 대규모 다국어 훈련을 통한 언어 간 연결

초록

대형 언어 모델(LLMs)은 최근 몇 년 동안 놀라운 진전을 이루었지만, 그들의 우수한 성능은 여전히 주로 영어와 같은 주요 세계 언어로 제한되어 있다. 많은 LLM은 특히 저자원 언어에 대한 다국어 작업에서 여전히 도전에 직면하고 있다. 이 문제를 해결하기 위해, 우리는 Marco-LLM을 소개했다: 대규모 다국어 훈련을 통한 다국어 향상 LLM. 저자원 언어를 위해 상당한 양의 다국어 데이터를 수집하고 Qwen2 모델을 사용하여 광범위한 지속적 사전 훈련을 실시했다. 이 노력은 Marco-LLM이라는 다국어 LLM을 만들어 냈다. MMMLU, AGIEval, Belebele, Flores-200, XCOPA 등 다양한 다국어 벤치마크에서 포괄적인 평가를 통해 Marco-LLM은 최첨단 LLM 대비 상당한 개선을 보여 주었다. 더 나아가, Marco-LLM은 어떤-어떤 기계 번역 작업에서 상당한 향상을 이루어 우리 다국어 LLM의 효과를 입증했다. Marco-LLM은 저자원 언어를 포함한 다국어 작업에서 우수한 성과를 보이는 것뿐만 아니라 영어와 다른 주요 언어에서 강력한 성능을 유지하여 고-저자원 언어 능력 사이의 성능 차이를 줄이기 위해 설계된 선구적인 다국어 LLM이다. 이 노력은 언어 간의 연결을 통해 다양한 언어에서 정확하게 작동하는 LLM을 보장하기 위한 우리의 헌신을 보여 주고 있다.

English

Large Language Models (LLMs) have achieved remarkable progress in recent years; however, their excellent performance is still largely limited to major world languages, primarily English. Many LLMs continue to face challenges with multilingual tasks, especially when it comes to low-resource languages. To address this issue, we introduced Marco-LLM: Massive multilingual training for cross-lingual enhancement LLM. We have collected a substantial amount of multilingual data for several low-resource languages and conducted extensive continual pre-training using the Qwen2 models. This effort has resulted in a multilingual LLM named Marco-LLM. Through comprehensive evaluations on various multilingual benchmarks, including MMMLU, AGIEval, Belebele, Flores-200, XCOPA and many others, Marco-LLM has demonstrated substantial improvements over state-of-the-art LLMs. Furthermore, Marco-LLM achieved substantial enhancements in any-to-any machine translation tasks, showing the effectiveness of our multilingual LLM. Marco-LLM is a pioneering multilingual LLM designed to not only perform exceptionally well in multilingual tasks, including low-resource languages, but also maintain strong performance in English and other major languages, closing the performance gap between high- and low-resource language capabilities. By bridging languages, this effort demonstrates our dedication to ensuring LLMs work accurately across various languages.

Marco-LLM: 교차언어 강화를 위한 대규모 다국어 훈련을 통한 언어 간 연결

Marco-LLM: Bridging Languages via Massive Multilingual Training for Cross-Lingual Enhancement

초록

Summary

Support