從擴展視覺分詞器進行重建和生成的經驗教訓
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
January 16, 2025
作者: Philippe Hansen-Estruch, David Yan, Ching-Yao Chung, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen
cs.AI
摘要
透過自編碼的視覺標記化,使得最先進的圖像和視頻生成模型能夠將像素壓縮為潛在空間。儘管最近的進展主要集中在擴展基於Transformer的生成器,但標記器組件本身很少被擴展,這樣就引發了有關自編碼器設計選擇如何影響其重建目標和下游生成性能的問題。我們的工作旨在探索自編碼器的擴展,以填補這一空白。為了促進這一探索,我們將典型的卷積主幹替換為增強的視覺Transformer架構,用於標記化(ViTok)。我們在大規模圖像和視頻數據集上訓練ViTok,遠超過ImageNet-1K,消除了標記器擴展的數據限制。我們首先研究了擴展自編碼器瓶頸對重建和生成的影響,發現儘管它與重建高度相關,但與生成的關係更加複雜。接著,我們探討了分別擴展自編碼器的編碼器和解碼器對重建和生成性能的影響。重要的是,我們發現擴展編碼器對於重建或生成幾乎沒有增益,而擴展解碼器則提高了重建,但對生成的好處則參差不齊。基於我們的探索,我們設計了ViTok作為一個輕量級自編碼器,在ImageNet-1K和COCO重建任務(256p和512p)上實現了與最先進自編碼器競爭力相當的性能,同時在UCF-101的16幀128p視頻重建上表現優於現有自編碼器,並且計算量(FLOPs)減少了2-5倍。當與擴散Transformer集成時,ViTok在ImageNet-1K上的圖像生成表現出競爭力,並在UCF-101的類條件視頻生成方面設定了新的最先進基準。
English
Visual tokenization via auto-encoding empowers state-of-the-art image and
video generative models by compressing pixels into a latent space. Although
scaling Transformer-based generators has been central to recent advances, the
tokenizer component itself is rarely scaled, leaving open questions about how
auto-encoder design choices influence both its objective of reconstruction and
downstream generative performance. Our work aims to conduct an exploration of
scaling in auto-encoders to fill in this blank. To facilitate this exploration,
we replace the typical convolutional backbone with an enhanced Vision
Transformer architecture for Tokenization (ViTok). We train ViTok on
large-scale image and video datasets far exceeding ImageNet-1K, removing data
constraints on tokenizer scaling. We first study how scaling the auto-encoder
bottleneck affects both reconstruction and generation -- and find that while it
is highly correlated with reconstruction, its relationship with generation is
more complex. We next explored the effect of separately scaling the
auto-encoders' encoder and decoder on reconstruction and generation
performance. Crucially, we find that scaling the encoder yields minimal gains
for either reconstruction or generation, while scaling the decoder boosts
reconstruction but the benefits for generation are mixed. Building on our
exploration, we design ViTok as a lightweight auto-encoder that achieves
competitive performance with state-of-the-art auto-encoders on ImageNet-1K and
COCO reconstruction tasks (256p and 512p) while outperforming existing
auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x
fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates
competitive performance on image generation for ImageNet-1K and sets new
state-of-the-art benchmarks for class-conditional video generation on UCF-101.Summary
AI-Generated Summary