利用分组球形量化扩展图像标记器
Scaling Image Tokenizers with Grouped Spherical Quantization
December 3, 2024
作者: Jiangtao Wang, Zhen Qin, Yifan Zhang, Vincent Tao Hu, Björn Ommer, Rania Briq, Stefan Kesselheim
cs.AI
摘要
由于其可扩展性和紧凑性,视觉分词器引起了广泛关注;先前的研究依赖于老式 GAN-based 超参数、带有偏见的比较以及对扩展行为缺乏全面分析。为了解决这些问题,我们引入了分组球面量化(GSQ),采用球形码书初始化和查找正则化来限制码书潜在到球面表面。我们对图像分词器训练策略的实证分析表明,GSQ-GAN 在更少的训练迭代次数内实现了优越的重建质量,为扩展研究奠定了坚实基础。在此基础上,我们系统地研究了 GSQ 的扩展行为,特别是在潜在维度、码书大小和压缩比方面,以及它们对模型性能的影响。我们的研究结果揭示了在高和低空间压缩水平下的不同行为,突显了在表示高维潜在空间方面的挑战。我们展示了GSQ可以将高维潜在重构为紧凑的低维空间,从而实现了具有改进质量的高效扩展。因此,GSQ-GAN 实现了 16 倍下采样,重建 FID(rFID)为 0.50。
English
Vision tokenizers have gained a lot of attraction due to their scalability
and compactness; previous works depend on old-school GAN-based hyperparameters,
biased comparisons, and a lack of comprehensive analysis of the scaling
behaviours. To tackle those issues, we introduce Grouped Spherical Quantization
(GSQ), featuring spherical codebook initialization and lookup regularization to
constrain codebook latent to a spherical surface. Our empirical analysis of
image tokenizer training strategies demonstrates that GSQ-GAN achieves superior
reconstruction quality over state-of-the-art methods with fewer training
iterations, providing a solid foundation for scaling studies. Building on this,
we systematically examine the scaling behaviours of GSQ, specifically in latent
dimensionality, codebook size, and compression ratios, and their impact on
model performance. Our findings reveal distinct behaviours at high and low
spatial compression levels, underscoring challenges in representing
high-dimensional latent spaces. We show that GSQ can restructure
high-dimensional latent into compact, low-dimensional spaces, thus enabling
efficient scaling with improved quality. As a result, GSQ-GAN achieves a 16x
down-sampling with a reconstruction FID (rFID) of 0.50.Summary
AI-Generated Summary