使用分組球面量化擴展圖像分詞器
Scaling Image Tokenizers with Grouped Spherical Quantization
December 3, 2024
作者: Jiangtao Wang, Zhen Qin, Yifan Zhang, Vincent Tao Hu, Björn Ommer, Rania Briq, Stefan Kesselheim
cs.AI
摘要
視覺分詞器因其可擴展性和緊湊性而受到廣泛關注;先前的研究依賴於老派 GAN-based 超參數、帶有偏見的比較,以及對擴展行為缺乏全面分析。為了應對這些問題,我們引入了分組球面量化(GSQ),具備球面碼簿初始化和查找正則化,以將碼簿潛在約束於球面上。我們對圖像分詞器訓練策略的實證分析顯示,GSQ-GAN 在較少的訓練迭代次數下實現了優越的重建質量,為擴展研究奠定了堅實基礎。基於此,我們系統地研究了 GSQ 的擴展行為,特別是在潛在維度、碼簿大小和壓縮比方面,以及它們對模型性能的影響。我們的研究發現揭示了在高低空間壓縮水平下的不同行為,突顯了在表示高維潛在空間方面的挑戰。我們表明,GSQ 能夠將高維潛在重組為緊湊、低維空間,從而實現具有改善質量的有效擴展。因此,GSQ-GAN 實現了 16 倍的下採樣,並具有 0.50 的重建 FID(rFID)。
English
Vision tokenizers have gained a lot of attraction due to their scalability
and compactness; previous works depend on old-school GAN-based hyperparameters,
biased comparisons, and a lack of comprehensive analysis of the scaling
behaviours. To tackle those issues, we introduce Grouped Spherical Quantization
(GSQ), featuring spherical codebook initialization and lookup regularization to
constrain codebook latent to a spherical surface. Our empirical analysis of
image tokenizer training strategies demonstrates that GSQ-GAN achieves superior
reconstruction quality over state-of-the-art methods with fewer training
iterations, providing a solid foundation for scaling studies. Building on this,
we systematically examine the scaling behaviours of GSQ, specifically in latent
dimensionality, codebook size, and compression ratios, and their impact on
model performance. Our findings reveal distinct behaviours at high and low
spatial compression levels, underscoring challenges in representing
high-dimensional latent spaces. We show that GSQ can restructure
high-dimensional latent into compact, low-dimensional spaces, thus enabling
efficient scaling with improved quality. As a result, GSQ-GAN achieves a 16x
down-sampling with a reconstruction FID (rFID) of 0.50.Summary
AI-Generated Summary