分解式视觉标记化与生成

摘要

视觉标记器对图像生成至关重要。它们将视觉数据转换为离散标记，使基于Transformer的模型在图像生成方面表现出色。尽管VQ基于的标记器（如VQGAN）取得了成功，但由于受限的词汇量大小，它们面临着重大限制。简单地扩展码书往往会导致训练不稳定和性能收益减少，使可扩展性成为一个关键挑战。在这项工作中，我们引入了分解量化（FQ），这是一种通过将大码书分解为多个独立子码书来使VQ基于的标记器焕发活力的新方法。这种因式分解减少了大码书的查找复杂性，实现了更高效和可扩展的视觉标记化。为了确保每个子码书捕获独特和互补信息，我们提出了一种解缠规则，明确减少冗余，促进子码书之间的多样性。此外，我们将表示学习整合到训练过程中，利用像CLIP和DINO这样的预训练视觉模型，将语义丰富性融入到学习表示中。这种设计确保我们的标记器捕获多样的语义层次，从而产生更具表现力和解缠的表示。实验证明，所提出的FQGAN模型显著提高了视觉标记器的重建质量，实现了最先进的性能。我们进一步证明，这种标记器可以有效地应用于自回归图像生成。https://showlab.github.io/FQGAN

English

Visual tokenizers are fundamental to image generation. They convert visual data into discrete tokens, enabling transformer-based models to excel at image generation. Despite their success, VQ-based tokenizers like VQGAN face significant limitations due to constrained vocabulary sizes. Simply expanding the codebook often leads to training instability and diminishing performance gains, making scalability a critical challenge. In this work, we introduce Factorized Quantization (FQ), a novel approach that revitalizes VQ-based tokenizers by decomposing a large codebook into multiple independent sub-codebooks. This factorization reduces the lookup complexity of large codebooks, enabling more efficient and scalable visual tokenization. To ensure each sub-codebook captures distinct and complementary information, we propose a disentanglement regularization that explicitly reduces redundancy, promoting diversity across the sub-codebooks. Furthermore, we integrate representation learning into the training process, leveraging pretrained vision models like CLIP and DINO to infuse semantic richness into the learned representations. This design ensures our tokenizer captures diverse semantic levels, leading to more expressive and disentangled representations. Experiments show that the proposed FQGAN model substantially improves the reconstruction quality of visual tokenizers, achieving state-of-the-art performance. We further demonstrate that this tokenizer can be effectively adapted into auto-regressive image generation. https://showlab.github.io/FQGAN

分解式视觉标记化与生成

Factorized Visual Tokenization and Generation

摘要

Summary

Support

Support