Centurio:關於大型視覺語言模型多語能力驅動因素

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

January 9, 2025
作者: Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, Goran Glavaš
cs.AI

摘要

迄今為止,大多數大型視覺語言模型(LVLMs)主要在英文數據上進行訓練,這使它們難以理解非英文輸入並無法生成所需目標語言的輸出。現有的努力通過添加多語言訓練數據來緩解這些問題,但這樣做在很大程度上是臨時應變的,缺乏對不同訓練組合如何影響不同語言群體的洞察。在這項工作中,我們對大規模多語言LVLMs的訓練策略進行了全面調查。首先,我們進行了一系列跨越13個下游視覺語言任務和43種語言的多階段實驗,系統地研究了:(1)可以包含多少訓練語言而不降低英文性能,以及(2)預訓練的最佳語言分佈,以及(3)指導調整數據。此外,我們(4)研究了如何改進多語言文本在圖像中的理解,並為該任務引入了一個新的基準。令人驚訝的是,我們的分析顯示,可以(i)同時包含多達100種訓練語言,(ii)只需25-50%的非英文數據,就可以大大提高多語言性能,同時保持強大的英文性能。我們進一步發現,(iii)在預訓練和指導調整中包含非英文OCR數據對於改善多語言文本在圖像中的理解至關重要。最後,我們將所有發現綜合起來,訓練了Centurio,一個100種語言的LVLM,在涵蓋14個任務和56種語言的評估中提供了最先進的性能。
English
Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50\% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.

Summary

AI-Generated Summary

PDF183January 10, 2025