D^2iT:动态扩散Transformer,实现精准图像生成
D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation
April 13, 2025
作者: Weinan Jia, Mengqi Huang, Nan Chen, Lei Zhang, Zhendong Mao
cs.AI
摘要
扩散模型因其生成高保真图像的能力而广受认可。尽管扩散变换器(DiT)架构展现出卓越的性能和可扩展性,但在扩散过程中,它对不同图像区域采用了固定的压缩策略,忽视了这些区域自然存在的信息密度差异。然而,过大的压缩会限制局部真实感,而过小的压缩则增加计算复杂度并损害全局一致性,最终影响生成图像的质量。为解决这些局限,我们提出通过识别不同区域的重要性来动态压缩图像区域,并引入一个新颖的两阶段框架,旨在提升图像生成的效率与效果:(1)在第一阶段,动态变分自编码器(DVAE)采用分层编码器,根据各区域特定的信息密度,以不同的下采样率编码图像区域,从而为扩散过程提供更准确、更自然的潜在编码。(2)在第二阶段,动态扩散变换器(D^2iT)通过预测多粒度噪声(包括平滑区域的粗粒度噪声和细节区域的细粒度噪声)来生成图像,这一过程结合了动态粒度变换器与动态内容变换器的新颖组合。通过将噪声的粗略预测与细节区域的修正相结合,该策略实现了全局一致性与局部真实感的统一。在多种生成任务上的全面实验验证了我们方法的有效性。代码将在https://github.com/jiawn-creator/Dynamic-DiT发布。
English
Diffusion models are widely recognized for their ability to generate
high-fidelity images. Despite the excellent performance and scalability of the
Diffusion Transformer (DiT) architecture, it applies fixed compression across
different image regions during the diffusion process, disregarding the
naturally varying information densities present in these regions. However,
large compression leads to limited local realism, while small compression
increases computational complexity and compromises global consistency,
ultimately impacting the quality of generated images. To address these
limitations, we propose dynamically compressing different image regions by
recognizing the importance of different regions, and introduce a novel
two-stage framework designed to enhance the effectiveness and efficiency of
image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical
encoder to encode different image regions at different downsampling rates,
tailored to their specific information densities, thereby providing more
accurate and natural latent codes for the diffusion process. (2) Dynamic
Diffusion Transformer (D^2iT) at second stage generates images by predicting
multi-grained noise, consisting of coarse-grained (less latent code in smooth
regions) and fine-grained (more latent codes in detailed regions), through an
novel combination of the Dynamic Grain Transformer and the Dynamic Content
Transformer. The strategy of combining rough prediction of noise with detailed
regions correction achieves a unification of global consistency and local
realism. Comprehensive experiments on various generation tasks validate the
effectiveness of our approach. Code will be released at
https://github.com/jiawn-creator/Dynamic-DiT.Summary
AI-Generated Summary