穩定影像自回歸建模的潛在空間:一個統一的觀點
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective
October 16, 2024
作者: Yongxin Zhu, Bocheng Li, Hang Zhang, Xin Li, Linli Xu, Lidong Bing
cs.AI
摘要
基於潛在空間的圖像生成模型,例如潛在擴散模型(LDMs)和遮罩圖像模型(MIMs),在圖像生成任務中取得了顯著的成功。這些模型通常利用像VQGAN或VAE這樣的重建自編碼器,將像素編碼為更緊湊的潛在空間,並從潛在空間而非直接從像素中學習數據分佈。然而,這種做法引發了一個重要問題:這真的是最佳選擇嗎?為了回應這一問題,我們從一個有趣的觀察開始:儘管共享相同的潛在空間,自回歸模型在圖像生成方面明顯落後於LDMs和MIMs。這一發現與自然語言處理領域形成鮮明對比,在該領域中,自回歸模型GPT已經建立了強大的地位。為了解決這一差異,我們提出了一個關於潛在空間和生成模型關係的統一觀點,強調圖像生成建模中潛在空間的穩定性。此外,我們提出了一種簡單但有效的離散圖像分詞器,以穩定圖像生成建模中的潛在空間。實驗結果表明,使用我們的分詞器(DiGIT)進行圖像自回歸建模有助於圖像理解和圖像生成,其中採用下一個標記預測原則,這對於GPT模型來說是內在直觀的,但對其他生成模型來說是具有挑戰性的。值得注意的是,首次,一種針對圖像的GPT風格自回歸模型優於LDMs,當模型尺寸擴大時,也展現出類似GPT的顯著改進。我們的研究結果強調了優化潛在空間和整合離散分詞在推進圖像生成模型能力方面的潛力。代碼可在https://github.com/DAMO-NLP-SG/DiGIT找到。
English
Latent-based image generative models, such as Latent Diffusion Models (LDMs)
and Mask Image Models (MIMs), have achieved notable success in image generation
tasks. These models typically leverage reconstructive autoencoders like VQGAN
or VAE to encode pixels into a more compact latent space and learn the data
distribution in the latent space instead of directly from pixels. However, this
practice raises a pertinent question: Is it truly the optimal choice? In
response, we begin with an intriguing observation: despite sharing the same
latent space, autoregressive models significantly lag behind LDMs and MIMs in
image generation. This finding contrasts sharply with the field of NLP, where
the autoregressive model GPT has established a commanding presence. To address
this discrepancy, we introduce a unified perspective on the relationship
between latent space and generative models, emphasizing the stability of latent
space in image generative modeling. Furthermore, we propose a simple but
effective discrete image tokenizer to stabilize the latent space for image
generative modeling. Experimental results show that image autoregressive
modeling with our tokenizer (DiGIT) benefits both image understanding and image
generation with the next token prediction principle, which is inherently
straightforward for GPT models but challenging for other generative models.
Remarkably, for the first time, a GPT-style autoregressive model for images
outperforms LDMs, which also exhibits substantial improvement akin to GPT when
scaling up model size. Our findings underscore the potential of an optimized
latent space and the integration of discrete tokenization in advancing the
capabilities of image generative models. The code is available at
https://github.com/DAMO-NLP-SG/DiGIT.Summary
AI-Generated Summary