大型视觉编码器的多模态自回归预训练
Multimodal Autoregressive Pre-training of Large Vision Encoders
November 21, 2024
作者: Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby
cs.AI
摘要
我们介绍了一种新颖的大规模视觉编码器预训练方法。借鉴了最近在视觉模型自回归预训练方面的进展,我们将这一框架扩展到多模态设置,即图像和文本。在本文中,我们提出了AIMV2,这是一组通用视觉编码器,其特点是简单直观的预训练过程、可扩展性以及在各种下游任务中表现出色。这是通过将视觉编码器与一个多模态解码器配对实现的,后者自回归地生成原始图像块和文本标记。我们的编码器不仅在多模态评估中表现出色,还在诸如定位、定位和分类等视觉基准测试中表现优异。值得注意的是,我们的AIMV2-3B编码器在ImageNet-1k上以冻结主干获得了89.5%的准确率。此外,AIMV2在各种环境中始终优于最先进的对比模型(例如CLIP、SigLIP)在多模态图像理解方面的表现。
English
We introduce a novel method for pre-training of large-scale vision encoders.
Building on recent advancements in autoregressive pre-training of vision
models, we extend this framework to a multimodal setting, i.e., images and
text. In this paper, we present AIMV2, a family of generalist vision encoders
characterized by a straightforward pre-training process, scalability, and
remarkable performance across a range of downstream tasks. This is achieved by
pairing the vision encoder with a multimodal decoder that autoregressively
generates raw image patches and text tokens. Our encoders excel not only in
multimodal evaluations but also in vision benchmarks such as localization,
grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5%
accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently
outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in
multimodal image understanding across diverse settings.Summary
AI-Generated Summary