大視覺編碼器的多模式自回歸預訓練
Multimodal Autoregressive Pre-training of Large Vision Encoders
November 21, 2024
作者: Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby
cs.AI
摘要
我們提出了一種新穎的大規模視覺編碼器預訓練方法。基於最近在視覺模型自回歸預訓練方面的進展,我們將這一框架擴展到多模態設置,即圖像和文本。在本文中,我們介紹了AIMV2,這是一個通用視覺編碼器家族,其特點是簡單的預訓練過程、可擴展性以及在各種下游任務中顯著的性能。這是通過將視覺編碼器與多模態解碼器配對來實現的,該解碼器自回歸生成原始圖像補丁和文本標記。我們的編碼器不僅在多模態評估中表現出色,而且在視覺基準測試中(如定位、定位和分類)也表現出色。值得注意的是,我們的AIMV2-3B編碼器在凍結主幹的情況下在ImageNet-1k上實現了89.5%的準確率。此外,AIMV2在各種不同設置下始終優於最先進的對比模型(例如CLIP、SigLIP)在多模態圖像理解方面的表現。
English
We introduce a novel method for pre-training of large-scale vision encoders.
Building on recent advancements in autoregressive pre-training of vision
models, we extend this framework to a multimodal setting, i.e., images and
text. In this paper, we present AIMV2, a family of generalist vision encoders
characterized by a straightforward pre-training process, scalability, and
remarkable performance across a range of downstream tasks. This is achieved by
pairing the vision encoder with a multimodal decoder that autoregressively
generates raw image patches and text tokens. Our encoders excel not only in
multimodal evaluations but also in vision benchmarks such as localization,
grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5%
accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently
outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in
multimodal image understanding across diverse settings.Summary
AI-Generated Summary