D^2iT:動態擴散變壓器用於精確圖像生成
D^2iT: Dynamic Diffusion Transformer for Accurate Image Generation
April 13, 2025
作者: Weinan Jia, Mengqi Huang, Nan Chen, Lei Zhang, Zhendong Mao
cs.AI
摘要
擴散模型因其生成高保真圖像的能力而廣受認可。儘管擴散變換器(DiT)架構展現了卓越的性能與可擴展性,其在擴散過程中對不同圖像區域採用固定的壓縮策略,忽略了這些區域自然存在的信息密度差異。然而,過大的壓縮會限制局部真實感,而過小的壓縮則會增加計算複雜度並損害全局一致性,最終影響生成圖像的質量。為解決這些限制,我們提出通過識別不同區域的重要性來動態壓縮圖像區域,並引入一個新穎的兩階段框架,旨在提升圖像生成的效能與效率:(1)第一階段的動態變分自編碼器(DVAE)採用分層編碼器,根據特定信息密度對不同圖像區域進行不同下採樣率的編碼,從而為擴散過程提供更精確且自然的潛在代碼。(2)第二階段的動態擴散變換器(D^2iT)通過預測多粒度噪聲來生成圖像,這些噪聲由粗粒度(平滑區域中較少的潛在代碼)和細粒度(細節區域中較多的潛在代碼)組成,這得益於動態粒度變換器與動態內容變換器的創新結合。結合噪聲的粗略預測與細節區域校正的策略,實現了全局一致性與局部真實感的統一。在多種生成任務上的全面實驗驗證了我們方法的有效性。代碼將發佈於https://github.com/jiawn-creator/Dynamic-DiT。
English
Diffusion models are widely recognized for their ability to generate
high-fidelity images. Despite the excellent performance and scalability of the
Diffusion Transformer (DiT) architecture, it applies fixed compression across
different image regions during the diffusion process, disregarding the
naturally varying information densities present in these regions. However,
large compression leads to limited local realism, while small compression
increases computational complexity and compromises global consistency,
ultimately impacting the quality of generated images. To address these
limitations, we propose dynamically compressing different image regions by
recognizing the importance of different regions, and introduce a novel
two-stage framework designed to enhance the effectiveness and efficiency of
image generation: (1) Dynamic VAE (DVAE) at first stage employs a hierarchical
encoder to encode different image regions at different downsampling rates,
tailored to their specific information densities, thereby providing more
accurate and natural latent codes for the diffusion process. (2) Dynamic
Diffusion Transformer (D^2iT) at second stage generates images by predicting
multi-grained noise, consisting of coarse-grained (less latent code in smooth
regions) and fine-grained (more latent codes in detailed regions), through an
novel combination of the Dynamic Grain Transformer and the Dynamic Content
Transformer. The strategy of combining rough prediction of noise with detailed
regions correction achieves a unification of global consistency and local
realism. Comprehensive experiments on various generation tasks validate the
effectiveness of our approach. Code will be released at
https://github.com/jiawn-creator/Dynamic-DiT.Summary
AI-Generated Summary