使用分組球面量化擴展圖像分詞器

摘要

視覺分詞器因其可擴展性和緊湊性而受到廣泛關注；先前的研究依賴於老派 GAN-based 超參數、帶有偏見的比較，以及對擴展行為缺乏全面分析。為了應對這些問題，我們引入了分組球面量化（GSQ），具備球面碼簿初始化和查找正則化，以將碼簿潛在約束於球面上。我們對圖像分詞器訓練策略的實證分析顯示，GSQ-GAN 在較少的訓練迭代次數下實現了優越的重建質量，為擴展研究奠定了堅實基礎。基於此，我們系統地研究了 GSQ 的擴展行為，特別是在潛在維度、碼簿大小和壓縮比方面，以及它們對模型性能的影響。我們的研究發現揭示了在高低空間壓縮水平下的不同行為，突顯了在表示高維潛在空間方面的挑戰。我們表明，GSQ 能夠將高維潛在重組為緊湊、低維空間，從而實現具有改善質量的有效擴展。因此，GSQ-GAN 實現了 16 倍的下採樣，並具有 0.50 的重建 FID（rFID）。

English

Vision tokenizers have gained a lot of attraction due to their scalability and compactness; previous works depend on old-school GAN-based hyperparameters, biased comparisons, and a lack of comprehensive analysis of the scaling behaviours. To tackle those issues, we introduce Grouped Spherical Quantization (GSQ), featuring spherical codebook initialization and lookup regularization to constrain codebook latent to a spherical surface. Our empirical analysis of image tokenizer training strategies demonstrates that GSQ-GAN achieves superior reconstruction quality over state-of-the-art methods with fewer training iterations, providing a solid foundation for scaling studies. Building on this, we systematically examine the scaling behaviours of GSQ, specifically in latent dimensionality, codebook size, and compression ratios, and their impact on model performance. Our findings reveal distinct behaviours at high and low spatial compression levels, underscoring challenges in representing high-dimensional latent spaces. We show that GSQ can restructure high-dimensional latent into compact, low-dimensional spaces, thus enabling efficient scaling with improved quality. As a result, GSQ-GAN achieves a 16x down-sampling with a reconstruction FID (rFID) of 0.50.

使用分組球面量化擴展圖像分詞器

Scaling Image Tokenizers with Grouped Spherical Quantization

摘要

Support