重建与生成:在潜在扩散模型中驯服优化困境
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
January 2, 2025
作者: Jingfeng Yao, Xinggang Wang
cs.AI
摘要
搭配Transformer架构的潜在扩散模型在生成高保真图像方面表现出色。然而,最近的研究揭示了这种两阶段设计中的优化困境:增加视觉标记器中每个标记特征维度可以提高重建质量,但需要更大的扩散模型和更多的训练迭代才能达到可比较的生成性能。因此,现有系统通常会接受次优解,要么因为标记器内信息丢失而产生视觉伪影,要么因为昂贵的计算成本而无法完全收敛。我们认为这一困境源于学习无约束高维潜在空间的固有困难。为了解决这个问题,我们提出在训练视觉标记器时将潜在空间与预训练的视觉基础模型对齐。我们提出的VA-VAE(视觉基础模型对齐变分自动编码器)显著扩展了潜在扩散模型的重建-生成边界,实现了高维潜在空间中扩散Transformer(DiT)更快的收敛。为了充分利用VA-VAE的潜力,我们构建了一个增强的DiT基线,采用改进的训练策略和架构设计,称为LightningDiT。这一集成系统在ImageNet 256x256生成上取得了最先进的性能,FID分数为1.35,同时展示了显著的训练效率,在仅64个时期内达到了2.11的FID分数——相比原始DiT,收敛速度提高了超过21倍。模型和代码可在以下链接找到:https://github.com/hustvl/LightningDiT。
English
Latent diffusion models with Transformer architectures excel at generating
high-fidelity images. However, recent studies reveal an optimization dilemma in
this two-stage design: while increasing the per-token feature dimension in
visual tokenizers improves reconstruction quality, it requires substantially
larger diffusion models and more training iterations to achieve comparable
generation performance. Consequently, existing systems often settle for
sub-optimal solutions, either producing visual artifacts due to information
loss within tokenizers or failing to converge fully due to expensive
computation costs. We argue that this dilemma stems from the inherent
difficulty in learning unconstrained high-dimensional latent spaces. To address
this, we propose aligning the latent space with pre-trained vision foundation
models when training the visual tokenizers. Our proposed VA-VAE (Vision
foundation model Aligned Variational AutoEncoder) significantly expands the
reconstruction-generation frontier of latent diffusion models, enabling faster
convergence of Diffusion Transformers (DiT) in high-dimensional latent spaces.
To exploit the full potential of VA-VAE, we build an enhanced DiT baseline with
improved training strategies and architecture designs, termed LightningDiT. The
integrated system achieves state-of-the-art (SOTA) performance on ImageNet
256x256 generation with an FID score of 1.35 while demonstrating remarkable
training efficiency by reaching an FID score of 2.11 in just 64
epochs--representing an over 21 times convergence speedup compared to the
original DiT. Models and codes are available at:
https://github.com/hustvl/LightningDiT.Summary
AI-Generated Summary