重建 vs. 生成：在潛在擴散模型中馴服優化困境

摘要

擁有Transformer架構的潛在擴散模型在生成高保真度圖像方面表現卓越。然而，最近的研究揭示了這種兩階段設計中的一個優化困境：增加視覺分詞器中每個標記的特徵維度可提高重建質量，但需要更大的擴散模型和更多的訓練迭代次數才能達到可比較的生成性能。因此，現有系統通常會選擇次優解，或者因為分詞器內部信息丟失而產生視覺異常，或者因為昂貴的計算成本而無法完全收斂。我們認為這個困境源於學習無限制高維潛在空間的固有困難。為解決這個問題，我們提出在訓練視覺分詞器時將潛在空間與預訓練的視覺基礎模型對齊。我們提出的VA-VAE（Vision foundation model Aligned Variational AutoEncoder）顯著擴展了潛在擴散模型的重建-生成邊界，實現了高維潛在空間中擴散Transformer（DiT）更快的收斂。為充分利用VA-VAE的潛力，我們建立了一個具有改進訓練策略和架構設計的增強型DiT基線，稱為LightningDiT。該集成系統在ImageNet 256x256生成方面實現了最先進的性能，FID分數為1.35，同時通過在僅64個時代中達到2.11的FID分數展現了卓越的訓練效率--與原始DiT相比，收斂速度提高了超過21倍。模型和代碼可在以下鏈接找到：https://github.com/hustvl/LightningDiT。

English

Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: https://github.com/hustvl/LightningDiT.

重建 vs. 生成：在潛在擴散模型中馴服優化困境

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

摘要

Support