판지아: 39개 언어를 위한 완전히 개방된 다중언어 다중모달 LLM

초록

최근에는 다중 모달 대형 언어 모델(MLLMs)의 발전이 있었지만, 그 발전은 주로 영어 및 서양 중심의 데이터셋과 작업에 초점을 맞추어 전 세계의 언어와 다양한 문화적 맥락이 미흡한 상태입니다. 본 논문에서는 39개 언어를 아우르는 다양한 6백만 개의 지침 데이터셋인 PangeaIns에서 훈련된 다국어 다중 모달 LLM인 Pangea를 소개합니다. PangeaIns는 1) 고품질의 영어 지침, 2) 신중하게 기계 번역된 지침, 그리고 3) 문화적으로 관련된 다중 모달 작업을 특징으로 하여 다문화적인 커버리지를 보장합니다. 모델의 성능을 엄격하게 평가하기 위해 47개 언어를 아우르는 14개 데이터셋을 포함하는 종합적인 평가 스위트인 PangeaBench를 소개합니다. 결과는 Pangea가 다양한 문화적 맥락에서 다국어 환경에서 기존의 오픈 소스 모델들을 현격하게 능가함을 보여줍니다. Ablation 연구는 영어 데이터 비율, 언어 인기도, 그리고 다중 모달 훈련 샘플 수가 전체 성능에 미치는 중요성을 더 자세히 밝혀냅니다. 우리는 포괄적이고 견고한 다국어 MLLMs의 개발을 촉진하고 보다 넓은 언어 및 문화적 스펙트럼에서의 공정성과 접근성을 촉진하기 위해 데이터, 코드 및 훈련된 체크포인트를 완전히 오픈 소스로 제공합니다.

English

Despite recent advances in multimodal large language models (MLLMs), their development has predominantly focused on English- and western-centric datasets and tasks, leaving most of the world's languages and diverse cultural contexts underrepresented. This paper introduces Pangea, a multilingual multimodal LLM trained on PangeaIns, a diverse 6M instruction dataset spanning 39 languages. PangeaIns features: 1) high-quality English instructions, 2) carefully machine-translated instructions, and 3) culturally relevant multimodal tasks to ensure cross-cultural coverage. To rigorously assess models' capabilities, we introduce PangeaBench, a holistic evaluation suite encompassing 14 datasets covering 47 languages. Results show that Pangea significantly outperforms existing open-source models in multilingual settings and diverse cultural contexts. Ablation studies further reveal the importance of English data proportions, language popularity, and the number of multimodal training samples on overall performance. We fully open-source our data, code, and trained checkpoints, to facilitate the development of inclusive and robust multilingual MLLMs, promoting equity and accessibility across a broader linguistic and cultural spectrum.

판지아: 39개 언어를 위한 완전히 개방된 다중언어 다중모달 LLM

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

초록

Support