大視覺編碼器的多模式自回歸預訓練

摘要

我們提出了一種新穎的大規模視覺編碼器預訓練方法。基於最近在視覺模型自回歸預訓練方面的進展，我們將這一框架擴展到多模態設置，即圖像和文本。在本文中，我們介紹了AIMV2，這是一個通用視覺編碼器家族，其特點是簡單的預訓練過程、可擴展性以及在各種下游任務中顯著的性能。這是通過將視覺編碼器與多模態解碼器配對來實現的，該解碼器自回歸生成原始圖像補丁和文本標記。我們的編碼器不僅在多模態評估中表現出色，而且在視覺基準測試中（如定位、定位和分類）也表現出色。值得注意的是，我們的AIMV2-3B編碼器在凍結主幹的情況下在ImageNet-1k上實現了89.5%的準確率。此外，AIMV2在各種不同設置下始終優於最先進的對比模型（例如CLIP、SigLIP）在多模態圖像理解方面的表現。

English

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

大視覺編碼器的多模式自回歸預訓練

Multimodal Autoregressive Pre-training of Large Vision Encoders

摘要

Summary

Support