扩展视觉标记生成器用于重建和生成的经验教训

摘要

通过自动编码实现视觉标记化，将像素压缩到潜在空间，为最先进的图像和视频生成模型提供支持。尽管最近的进展中，基于Transformer的生成器的扩展至关重要，但标记化组件本身很少被扩展，这引发了有关自动编码器设计选择如何影响其重建目标和下游生成性能的问题。我们的工作旨在探讨自动编码器的扩展，以填补这一空白。为了促进这一探索，我们将典型的卷积主干替换为增强的视觉Transformer架构用于标记化（ViTok）。我们在大规模图像和视频数据集上训练ViTok，远远超过ImageNet-1K，消除了标记器扩展的数据限制。我们首先研究了扩展自动编码器瓶颈如何影响重建和生成，发现虽然它与重建高度相关，但与生成的关系更为复杂。接下来，我们探讨了分别扩展自动编码器的编码器和解码器对重建和生成性能的影响。关键是，我们发现扩展编码器对重建或生成几乎没有带来增益，而扩展解码器可以提升重建，但对生成的益处却是参差不齐的。在我们探索的基础上，我们设计了ViTok作为一种轻量级自动编码器，在ImageNet-1K和COCO重建任务（256p和512p）上实现了与最先进自动编码器的竞争性能，同时在UCF-101的16帧128p视频重建任务上胜过现有的自动编码器，计算量减少2-5倍。当与扩散Transformer集成时，ViTok在ImageNet-1K上的图像生成表现出竞争性能，并为UCF-101上的类别条件视频生成设立了新的最先进基准。

English

Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. Although scaling Transformer-based generators has been central to recent advances, the tokenizer component itself is rarely scaled, leaving open questions about how auto-encoder design choices influence both its objective of reconstruction and downstream generative performance. Our work aims to conduct an exploration of scaling in auto-encoders to fill in this blank. To facilitate this exploration, we replace the typical convolutional backbone with an enhanced Vision Transformer architecture for Tokenization (ViTok). We train ViTok on large-scale image and video datasets far exceeding ImageNet-1K, removing data constraints on tokenizer scaling. We first study how scaling the auto-encoder bottleneck affects both reconstruction and generation -- and find that while it is highly correlated with reconstruction, its relationship with generation is more complex. We next explored the effect of separately scaling the auto-encoders' encoder and decoder on reconstruction and generation performance. Crucially, we find that scaling the encoder yields minimal gains for either reconstruction or generation, while scaling the decoder boosts reconstruction but the benefits for generation are mixed. Building on our exploration, we design ViTok as a lightweight auto-encoder that achieves competitive performance with state-of-the-art auto-encoders on ImageNet-1K and COCO reconstruction tasks (256p and 512p) while outperforming existing auto-encoders on 16-frame 128p video reconstruction for UCF-101, all with 2-5x fewer FLOPs. When integrated with Diffusion Transformers, ViTok demonstrates competitive performance on image generation for ImageNet-1K and sets new state-of-the-art benchmarks for class-conditional video generation on UCF-101.

扩展视觉标记生成器用于重建和生成的经验教训

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

摘要

Summary

Support