Pangea:一個包含 39 種語言的完全開放式多語言多模式LLM。
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
October 21, 2024
作者: Xiang Yue, Yueqi Song, Akari Asai, Seungone Kim, Jean de Dieu Nyandwi, Simran Khanuja, Anjali Kantharuban, Lintang Sutawika, Sathyanarayanan Ramamoorthy, Graham Neubig
cs.AI
摘要
儘管多模式大型語言模型(MLLMs)近年來取得了重大進展,但其發展主要集中在英語和西方為中心的數據集和任務上,導致世界上大多數語言和多元文化背景得不到充分代表。本文介紹了Pangea,一種多語言多模式LLM,其在PangeaIns上進行訓練,該數據集包含39種語言的多樣化600萬條指令。PangeaIns具有以下特點:1)高質量的英語指令,2)經過精心機器翻譯的指令,以及3)具有文化相關性的多模式任務,以確保跨文化覆蓋。為了嚴格評估模型的能力,我們引入了PangeaBench,這是一個全面的評估套件,包括14個數據集,涵蓋47種語言。結果顯示,Pangea在多語言環境和多元文化背景下明顯優於現有的開源模型。消融研究進一步揭示了英語數據比例、語言流行度以及多模式訓練樣本數對整體性能的重要性。我們完全開源我們的數據、代碼和訓練檢查點,以促進包容性和強大的多語言MLLMs的發展,推動在更廣泛的語言和文化範疇中實現公平和可及性。
English
Despite recent advances in multimodal large language models (MLLMs), their
development has predominantly focused on English- and western-centric datasets
and tasks, leaving most of the world's languages and diverse cultural contexts
underrepresented. This paper introduces Pangea, a multilingual multimodal LLM
trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages.
PangeaIns features: 1) high-quality English instructions, 2) carefully
machine-translated instructions, and 3) culturally relevant multimodal tasks to
ensure cross-cultural coverage. To rigorously assess models' capabilities, we
introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets
covering 47 languages. Results show that Pangea significantly outperforms
existing open-source models in multilingual settings and diverse cultural
contexts. Ablation studies further reveal the importance of English data
proportions, language popularity, and the number of multimodal training samples
on overall performance. We fully open-source our data, code, and trained
checkpoints, to facilitate the development of inclusive and robust multilingual
MLLMs, promoting equity and accessibility across a broader linguistic and
cultural spectrum.Summary
AI-Generated Summary