재구성 및 생성을 위한 시각 토크나이저 확장에 대한 교훈

초록

시각적 토큰화를 통한 자동 인코딩은 픽셀을 잠재 공간으로 압축하여 이미지 및 비디오 생성 모델의 최첨단을 강화합니다. 최근의 발전에 중추적인 역할을 한 Transformer 기반 생성기의 확장은 주로 이루어졌지만, 토크나이저 구성 요소 자체는 드물게 확장되어 왔으며, 이는 자동 인코더 설계 선택이 재구성 목표 및 하류 생성 성능에 어떻게 영향을 미치는지에 대한 의문을 남겨 두고 있습니다. 본 연구는 이 공백을 채우기 위해 자동 인코더의 확장에 대한 탐색을 목표로 합니다. 이 탐색을 용이하게 하기 위해 우리는 일반적인 합성곱 백본을 향상된 Vision Transformer 아키텍처로 대체한 Tokenization (ViTok)을 도입합니다. 우리는 ImageNet-1K를 크게 초과하는 대규모 이미지 및 비디오 데이터셋에서 ViTok을 훈련시켜, 토크나이저 확장에 대한 데이터 제약을 제거합니다. 먼저 자동 인코더 병목 현상의 확장이 재구성 및 생성에 어떻게 영향을 미치는지 연구하였고, 재구성과 매우 상관관계가 있음을 발견했으나 생성과의 관계는 더 복잡하다는 것을 알아냈습니다. 다음으로 자동 인코더의 인코더와 디코더를 별도로 확장하는 것이 재구성 및 생성 성능에 미치는 영향을 탐구하였습니다. 중요한 점은 인코더를 확장하면 재구성이나 생성 양쪽 모두에는 미미한 이득이 있지만, 디코더를 확장하면 재구성이 향상되지만 생성에 대한 이점은 혼합된 결과를 보입니다. 우리의 탐색을 기반으로, 우리는 ImageNet-1K 및 COCO 재구성 작업 (256p 및 512p)에서 최첨단 자동 인코더와 경쟁력 있는 성능을 달성하면서, UCF-101의 16프레임 128p 비디오 재구성에서 기존 자동 인코더보다 2-5배 적은 FLOPs로 뛰어난 성과를 거두는 경량 자동 인코더인 ViTok을 설계합니다. Diffusion Transformers와 통합되었을 때, ViTok은 ImageNet-1K의 이미지 생성에 대해 경쟁력 있는 성능을 보여주며, UCF-101의 클래스 조건부 비디오 생성에 대한 최첨단 벤치마크를 설정합니다.

English

Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.

재구성 및 생성을 위한 시각 토크나이저 확장에 대한 교훈

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

초록

Support