재구성 vs 생성: 잠재 확산 모델에서 최적화 딜레마를 다루다

초록

Transformer 아키텍처를 사용하는 잠재 확산 모델은 고품질 이미지 생성에서 뛰어난 성과를 보입니다. 그러나 최근 연구에서 이러한 두 단계 설계에서 최적화 딜레마가 드러났습니다: 시각 토크나이저 내의 토큰 당 피처 차원을 증가시키면 재구성 품질이 향상되지만, 비교 가능한 생성 성능을 달성하려면 상당히 큰 확산 모델과 더 많은 훈련 반복이 필요합니다. 결과적으로 기존 시스템은 종종 시각 토크나이저 내 정보 손실로 인한 시각적 아티팩트를 생성하거나 비싼 계산 비용으로 완전히 수렴하지 못하는 하위 최적 솔루션을 선택합니다. 이 딜레마는 제한 없는 고차원 잠재 공간을 학습하는 데 inherent한 어려움에서 비롯된다고 주장합니다. 이를 해결하기 위해 우리는 시각 토크나이저를 훈련할 때 사전 훈련된 비전 기반 모델과 잠재 공간을 정렬하는 것을 제안합니다. 우리의 제안인 VA-VAE (Vision foundation model Aligned Variational AutoEncoder)는 잠재 확산 모델의 재구성-생성 경계를 크게 확장하여 고차원 잠재 공간에서 Diffusion Transformers (DiT)의 빠른 수렴을 가능하게 합니다. VA-VAE의 전체 잠재력을 활용하기 위해 향상된 훈련 전략과 아키텍처 디자인을 갖춘 향상된 DiT 기준선인 LightningDiT를 구축합니다. 통합된 시스템은 ImageNet 256x256 생성에서 FID 점수 1.35로 최신 기술 성능을 달성하면서 원래 DiT에 비해 64 에포크에서 FID 점수 2.11에 도달하여 수렴 속도를 21배 이상 높였습니다. 모델 및 코드는 다음에서 확인할 수 있습니다: https://github.com/hustvl/LightningDiT.

English

Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual tokenizers improves reconstruction quality, it requires substantially larger diffusion models and more training iterations to achieve comparable generation performance. Consequently, existing systems often settle for sub-optimal solutions, either producing visual artifacts due to information loss within tokenizers or failing to converge fully due to expensive computation costs. We argue that this dilemma stems from the inherent difficulty in learning unconstrained high-dimensional latent spaces. To address this, we propose aligning the latent space with pre-trained vision foundation models when training the visual tokenizers. Our proposed VA-VAE (Vision foundation model Aligned Variational AutoEncoder) significantly expands the reconstruction-generation frontier of latent diffusion models, enabling faster convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces. To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with improved training strategies and architecture designs, termed LightningDiT. The integrated system achieves state-of-the-art (SOTA) performance on ImageNet 256x256 generation with an FID score of 1.35 while demonstrating remarkable training efficiency by reaching an FID score of 2.11 in just 64 epochs--representing an over 21 times convergence speedup compared to the original DiT. Models and codes are available at: https://github.com/hustvl/LightningDiT.

재구성 vs 생성: 잠재 확산 모델에서 최적화 딜레마를 다루다

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

초록

Support