重建 vs. 生成:在潛在擴散模型中馴服優化困境
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
January 2, 2025
作者: Jingfeng Yao, Xinggang Wang
cs.AI
摘要
擁有Transformer架構的潛在擴散模型在生成高保真度圖像方面表現卓越。然而,最近的研究揭示了這種兩階段設計中的一個優化困境:增加視覺分詞器中每個標記的特徵維度可提高重建質量,但需要更大的擴散模型和更多的訓練迭代次數才能達到可比較的生成性能。因此,現有系統通常會選擇次優解,或者因為分詞器內部信息丟失而產生視覺異常,或者因為昂貴的計算成本而無法完全收斂。我們認為這個困境源於學習無限制高維潛在空間的固有困難。為解決這個問題,我們提出在訓練視覺分詞器時將潛在空間與預訓練的視覺基礎模型對齊。我們提出的VA-VAE(Vision foundation model Aligned Variational AutoEncoder)顯著擴展了潛在擴散模型的重建-生成邊界,實現了高維潛在空間中擴散Transformer(DiT)更快的收斂。為充分利用VA-VAE的潛力,我們建立了一個具有改進訓練策略和架構設計的增強型DiT基線,稱為LightningDiT。該集成系統在ImageNet 256x256生成方面實現了最先進的性能,FID分數為1.35,同時通過在僅64個時代中達到2.11的FID分數展現了卓越的訓練效率--與原始DiT相比,收斂速度提高了超過21倍。模型和代碼可在以下鏈接找到:https://github.com/hustvl/LightningDiT。
English
Latent diffusion models with Transformer architectures excel at generating
high-fidelity images. However, recent studies reveal an optimization dilemma in
this two-stage design: while increasing the per-token feature dimension in
visual tokenizers improves reconstruction quality, it requires substantially
larger diffusion models and more training iterations to achieve comparable
generation performance. Consequently, existing systems often settle for
sub-optimal solutions, either producing visual artifacts due to information
loss within tokenizers or failing to converge fully due to expensive
computation costs. We argue that this dilemma stems from the inherent
difficulty in learning unconstrained high-dimensional latent spaces. To address
this, we propose aligning the latent space with pre-trained vision foundation
models when training the visual tokenizers. Our proposed VA-VAE (Vision
foundation model Aligned Variational AutoEncoder) significantly expands the
reconstruction-generation frontier of latent diffusion models, enabling faster
convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces.
To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with
improved training strategies and architecture designs, termed LightningDiT. The
integrated system achieves state-of-the-art (SOTA) performance on ImageNet
256x256 generation with an FID score of 1.35 while demonstrating remarkable
training efficiency by reaching an FID score of 2.11 in just 64
epochs--representing an over 21 times convergence speedup compared to the
original DiT. Models and codes are available at:
https://github.com/hustvl/LightningDiT.Summary
AI-Generated Summary