视觉语言模型是否应使用图像数据进行预训练?
Should VLMs be Pre-trained with Image Data?
March 10, 2025
作者: Sedrick Keh, Jean Mercat, Samir Yitzhak Gadre, Kushal Arora, Igor Vasiljevic, Benjamin Burchfiel, Shuran Song, Russ Tedrake, Thomas Kollar, Ludwig Schmidt, Achal Dave
cs.AI
摘要
经过图像数据进一步训练的大型预训练语言模型(LLMs)在视觉-语言任务上表现优异。虽然在第二训练阶段加入图像有效解锁了这一能力,但相较于早期整合图像训练的视觉-语言模型(VLMs),这种两阶段训练流程带来的增益或损失尚不明确。为探究此问题,我们训练了涵盖多种数据集、规模、图文比例及引入视觉标记前预训练程度的模型。随后,我们对这些模型进行微调,并在一系列视觉-语言及纯文本任务上评估其下游性能。研究发现,采用图文混合数据进行预训练的模型在视觉-语言任务上表现更佳,同时保持了在纯文本评估中的强劲表现。在平均6项多样化任务中,我们发现对于10亿参数模型,在预训练进程80%时引入视觉标记,相较于在完全预训练后引入,平均提升了2%的性能。
English
Pre-trained LLMs that are further trained with image data perform well on
vision-language tasks. While adding images during a second training phase
effectively unlocks this capability, it is unclear how much of a gain or loss
this two-step pipeline gives over VLMs which integrate images earlier into the
training process. To investigate this, we train models spanning various
datasets, scales, image-text ratios, and amount of pre-training done before
introducing vision tokens. We then fine-tune these models and evaluate their
downstream performance on a suite of vision-language and text-only tasks. We
find that pre-training with a mixture of image and text data allows models to
perform better on vision-language tasks while maintaining strong performance on
text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B
model, introducing visual tokens 80% of the way through pre-training results in
a 2% average improvement over introducing visual tokens to a fully pre-trained
model.Summary
AI-Generated Summary