GigaTok:將視覺標記器擴展至30億參數以實現自回歸圖像生成
GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters for Autoregressive Image Generation
April 11, 2025
作者: Tianwei Xiong, Jun Hao Liew, Zilong Huang, Jiashi Feng, Xihui Liu
cs.AI
摘要
在自回歸(AR)圖像生成中,視覺標記器將圖像壓縮為緊湊的離散潛在標記,從而通過下一個標記預測實現下游自回歸模型的高效訓練,用於視覺生成。雖然擴展視覺標記器能提升圖像重建質量,但這往往會降低下游生成質量——這一挑戰在現有文獻中尚未得到充分解決。為此,我們引入了GigaTok,這是首個在擴展視覺標記器時同時提升圖像重建、生成及表示學習的方法。我們發現潛在空間日益增長的複雜性是重建與生成困境背後的關鍵因素。為緩解這一問題,我們提出了語義正則化,它將標記器特徵與預訓練視覺編碼器的語義一致特徵對齊。這一約束在擴展過程中防止了潛在空間過度複雜化,從而在重建和下游自回歸生成兩方面均取得了持續改進。基於語義正則化,我們探索了擴展標記器的三項關鍵實踐:(1)使用一維標記器以獲得更好的可擴展性,(2)在同時擴展編碼器和解碼器時優先考慮解碼器擴展,以及(3)採用熵損失來穩定億級規模標記器的訓練。通過擴展至30億參數,GigaTok在重建、下游AR生成及下游AR表示質量上均達到了業界領先水平。
English
In autoregressive (AR) image generation, visual tokenizers compress images
into compact discrete latent tokens, enabling efficient training of downstream
autoregressive models for visual generation via next-token prediction. While
scaling visual tokenizers improves image reconstruction quality, it often
degrades downstream generation quality -- a challenge not adequately addressed
in existing literature. To address this, we introduce GigaTok, the first
approach to simultaneously improve image reconstruction, generation, and
representation learning when scaling visual tokenizers. We identify the growing
complexity of latent space as the key factor behind the reconstruction vs.
generation dilemma. To mitigate this, we propose semantic regularization, which
aligns tokenizer features with semantically consistent features from a
pre-trained visual encoder. This constraint prevents excessive latent space
complexity during scaling, yielding consistent improvements in both
reconstruction and downstream autoregressive generation. Building on semantic
regularization, we explore three key practices for scaling tokenizers:(1) using
1D tokenizers for better scalability, (2) prioritizing decoder scaling when
expanding both encoder and decoder, and (3) employing entropy loss to stabilize
training for billion-scale tokenizers. By scaling to 3 space billion
parameters, GigaTok achieves state-of-the-art performance in reconstruction,
downstream AR generation, and downstream AR representation quality.Summary
AI-Generated Summary