Centurio: 大規模ビジョン言語モデルの多言語能力のドライバーについて

要旨

現在までのほとんどの大規模ビジョン言語モデル（LVLM）は、主に英語データで訓練されており、これにより非英語の入力を理解するのに苦労し、望ましいターゲット言語での出力を生成することができません。既存の取り組みは、多言語訓練データを追加することでこれらの問題を緩和していますが、これは主に独自の方法で行われており、異なるトレーニングミックスが異なる言語グループにとってどのように影響を与えるかについての洞察が欠如しています。本研究では、大規模多言語LVLMのトレーニング戦略について包括的な調査を行います。まず、13の下流ビジョン言語タスクと43の言語にわたる一連の多段階実験を実施し、次の点を系統的に調査します：（1）英語のパフォーマンスを低下させることなく含めることができるトレーニング言語の数、（2）事前トレーニングおよび（3）指示調整データの最適な言語分布。さらに、（4）多言語テキストイン画像理解を改善する方法を調査し、そのタスクの新しいベンチマークを導入します。驚くべきことに、我々の分析では、（i）100のトレーニング言語を同時に含めることができ、（ii）非英語データの25-50\%で、多言語パフォーマンスを大幅に向上させながら強力な英語パフォーマンスを維持できることがわかりました。さらに、（iii）事前トレーニングおよび指示調整に非英語OCRデータを含めることが、多言語テキストイン画像理解を改善する上で重要であることがわかりました。最後に、すべての知見をまとめて、Centurioという100言語のLVLMを訓練し、14のタスクと56の言語をカバーする評価で最先端のパフォーマンスを提供します。

English

Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50\% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.

Centurio: 大規模ビジョン言語モデルの多言語能力のドライバーについて

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

要旨

Summary

Support