視覺標記分解與生成

摘要

視覺分詞器對於影像生成至關重要。它們將視覺數據轉換為離散標記，使基於Transformer的模型在影像生成方面表現卓越。儘管VQ基礎的分詞器（如VQGAN）取得了成功，但由於受限的詞彙大小，面臨著重大限制。簡單擴展代碼書往往會導致訓練不穩定和性能收益減少，使可擴展性成為一個重要挑戰。在這項工作中，我們引入了分解量化（FQ）這一新穎方法，通過將大型代碼書分解為多個獨立的子代碼書，來振興基於VQ的分詞器。這種因式分解降低了大型代碼書的查找複雜度，實現了更高效和可擴展的視覺分詞。為確保每個子代碼書捕捉到獨特和互補的信息，我們提出了一種解耦規範化方法，明確減少冗餘，促進子代碼書之間的多樣性。此外，我們將表示學習整合到訓練過程中，利用預訓練的視覺模型（如CLIP和DINO）將語義豐富性融入到學習表示中。這種設計確保我們的分詞器捕捉到多樣的語義層次，從而產生更具表現力和解耦的表示。實驗表明，提出的FQGAN模型顯著提高了視覺分詞器的重建質量，實現了最先進的性能。我們進一步展示了這種分詞器可以有效地適應自回歸影像生成。https://showlab.github.io/FQGAN

English

Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy, promoting diversity across the sub-codebooks. Furthermore, we integrate representation learning into the training process, leveraging pretrained vision models like CLIP and DINO to infuse semantic richness into the learned representations. This design ensures our tokenizer captures diverse semantic levels, leading to more expressive and disentangled representations. Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance. We further demonstrate that this tokenizer can be effectively adapted into auto-regressive image generation. https://showlab.github.io/FQGAN

視覺標記分解與生成

Factorized Visual Tokenization and Generation

摘要

Summary

Support