GigaTok：将视觉分词器扩展至30亿参数，实现自回归图像生成

摘要

在自回归（AR）图像生成中，视觉分词器将图像压缩为紧凑的离散潜在标记，从而通过下一标记预测高效训练下游自回归模型以实现视觉生成。尽管扩大视觉分词器规模能提升图像重建质量，却往往导致下游生成质量下降——这一挑战在现有文献中尚未得到充分解决。为此，我们引入了GigaTok，这是首个在扩大视觉分词器规模时，同步提升图像重建、生成及表征学习性能的方法。我们识别出潜在空间复杂度增加是重建与生成之间矛盾的关键因素。为缓解此问题，我们提出了语义正则化，它将分词器特征与预训练视觉编码器中的语义一致特征对齐。这一约束在扩大规模时防止了潜在空间过度复杂化，从而在重建和下游自回归生成两方面均实现了持续改进。基于语义正则化，我们探索了扩大分词器规模的三大关键实践：（1）采用一维分词器以增强可扩展性，（2）在同时扩展编码器和解码器时优先考虑解码器扩展，（3）运用熵损失以稳定十亿级规模分词器的训练。通过将参数规模扩展至30亿，GigaTok在重建、下游AR生成及下游AR表征质量上均达到了业界领先水平。

English

In autoregressive (AR) image generation, visual tokenizers compress images into compact discrete latent tokens, enabling efficient training of downstream autoregressive models for visual generation via next-token prediction. While scaling visual tokenizers improves image reconstruction quality, it often degrades downstream generation quality -- a challenge not adequately addressed in existing literature. To address this, we introduce GigaTok, the first approach to simultaneously improve image reconstruction, generation, and representation learning when scaling visual tokenizers. We identify the growing complexity of latent space as the key factor behind the reconstruction vs. generation dilemma. To mitigate this, we propose semantic regularization, which aligns tokenizer features with semantically consistent features from a pre-trained visual encoder. This constraint prevents excessive latent space complexity during scaling, yielding consistent improvements in both reconstruction and downstream autoregressive generation. Building on semantic regularization, we explore three key practices for scaling tokenizers:(1) using 1D tokenizers for better scalability, (2) prioritizing decoder scaling when expanding both encoder and decoder, and (3) employing entropy loss to stabilize training for billion-scale tokenizers. By scaling to 3 space billion parameters, GigaTok achieves state-of-the-art performance in reconstruction, downstream AR generation, and downstream AR representation quality.

GigaTok：将视觉分词器扩展至30亿参数，实现自回归图像生成

GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation

摘要

Summary

Support

Support