Centurio:关于大型视觉语言模型多语能力驱动因素

Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model

January 9, 2025
作者: Gregor Geigle, Florian Schneider, Carolin Holtermann, Chris Biemann, Radu Timofte, Anne Lauscher, Goran Glavaš
cs.AI

摘要

迄今为止,大多数大规模视觉-语言模型(LVLMs)主要在英语数据上进行训练,这使它们难以理解非英语输入并无法生成所需目标语言的输出。现有的努力通过添加多语言训练数据来缓解这些问题,但这种做法在很大程度上是临时的,缺乏对不同训练组合如何影响不同语言群体的洞察。在这项工作中,我们对大规模多语言LVLMs的训练策略进行了全面调查。首先,我们进行了一系列跨越13个下游视觉-语言任务和43种语言的多阶段实验,系统地研究:(1)可以包含多少训练语言而不降低英语性能,以及(2)预训练的最佳语言分布以及(3)指导微调数据。此外,我们(4)研究了如何改进多语言文本-图像理解,并引入了该任务的新基准。令人惊讶的是,我们的分析显示,可以(i)同时包含多达100种训练语言(ii),只需25-50\%的非英语数据,就能大大提高多语言性能,同时保持强大的英语性能。我们进一步发现,(iii)在预训练和指导微调中包含非英语OCR数据对于改进多语言文本-图像理解至关重要。最后,我们将所有发现汇总,并训练了Centurio,一个包含100种语言的LVLM,在涵盖14个任务和56种语言的评估中提供了最先进的性能。
English
Most Large Vision-Language Models (LVLMs) to date are trained predominantly on English data, which makes them struggle to understand non-English input and fail to generate output in the desired target language. Existing efforts mitigate these issues by adding multilingual training data, but do so in a largely ad-hoc manner, lacking insight into how different training mixes tip the scale for different groups of languages. In this work, we present a comprehensive investigation into the training strategies for massively multilingual LVLMs. First, we conduct a series of multi-stage experiments spanning 13 downstream vision-language tasks and 43 languages, systematically examining: (1) the number of training languages that can be included without degrading English performance and (2) optimal language distributions of pre-training as well as (3) instruction-tuning data. Further, we (4) investigate how to improve multilingual text-in-image understanding, and introduce a new benchmark for the task. Surprisingly, our analysis reveals that one can (i) include as many as 100 training languages simultaneously (ii) with as little as 25-50\% of non-English data, to greatly improve multilingual performance while retaining strong English performance. We further find that (iii) including non-English OCR data in pre-training and instruction-tuning is paramount for improving multilingual text-in-image understanding. Finally, we put all our findings together and train Centurio, a 100-language LVLM, offering state-of-the-art performance in an evaluation covering 14 tasks and 56 languages.

Summary

AI-Generated Summary

PDF183January 10, 2025