再構築 vs 生成：潜在的拡散モデルにおける最適化ジレンマの制御

要旨

トランスフォーマーアーキテクチャを用いた潜在拡散モデルは、高品質な画像生成に優れています。しかしながら、最近の研究では、この二段階設計における最適化のジレンマが明らかになっています。視覚トークナイザー内のトークンごとの特徴次元を増やすと再構成品質が向上する一方で、同等の生成性能を達成するには、大幅に大きな拡散モデルとより多くの訓練イテレーションが必要となります。その結果、既存のシステムはしばしば、視覚トークナイザー内の情報損失による視覚的なアーティファクトを生じるか、高コンピューテーションコストによる完全な収束を達成できない、サブオプティマルな解決策に妥協せざるを得ません。このジレンマは、制約のない高次元の潜在空間を学習することの困難性に起因すると主張します。この問題に対処するため、我々は、視覚トークナイザーの訓練時に潜在空間を事前学習されたビジョン基盤モデルと整合させることを提案します。提案されたVA-VAE（Vision foundation model Aligned Variational AutoEncoder）は、潜在拡散モデルの再構成-生成フロンティアを大幅に拡張し、高次元の潜在空間におけるDiffusion Transformers（DiT）の収束を迅速化します。VA-VAEの潜在能力を最大限に引き出すために、改良された訓練戦略とアーキテクチャ設計を備えた強化されたDiTベースラインであるLightningDiTを構築します。統合システムは、ImageNet 256x256生成においてFIDスコア1.35で最先端のパフォーマンスを達成し、64エポックでFIDスコア2.11に到達することで、元のDiTと比較して21倍以上の収束速度向上を実現します。モデルとコードは以下で入手可能です：https://github.com/hustvl/LightningDiT.

English

Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: https://github.com/hustvl/LightningDiT.

再構築 vs 生成：潜在的拡散モデルにおける最適化ジレンマの制御

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

要旨

Support