대규모 비전 인코더의 다중 모달 자기 회귀 사전 훈련

초록

대규모 비전 인코더 사전 훈련을 위한 새로운 방법을 소개합니다. 비전 모델의 자기 회귀 사전 훈련의 최근 발전을 기반으로, 우리는 이 프레임워크를 이미지와 텍스트와 같은 다중 모달 환경으로 확장합니다. 본 논문에서는 간단한 사전 훈련 과정, 확장성 및 다양한 하위 작업에서 높은 성능을 보이는 AIMV2라는 일반 비전 인코더 패밀리를 제시합니다. 이는 비전 인코더를 자동 회귀적으로 원시 이미지 패치와 텍스트 토큰을 생성하는 다중 모달 디코더와 짝지어 달성됩니다. 우리의 인코더는 다중 모달 평가 뿐만 아니라 위치 지정, 기준 및 분류와 같은 비전 벤치마크에서도 뛰어난 성과를 보입니다. 특히, 우리의 AIMV2-3B 인코더는 얼음이 꽉 막힌 상태에서 ImageNet-1k에서 89.5%의 정확도를 달성합니다. 더욱이, AIMV2는 다양한 환경에서 다중 모달 이미지 이해에서 최첨단 대조 모델 (예: CLIP, SigLIP)을 일관되게 능가합니다.

English

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

대규모 비전 인코더의 다중 모달 자기 회귀 사전 훈련

Multimodal Autoregressive Pre-training of Large Vision Encoders

초록

Support