DDT:解耦扩散变换器
DDT: Decoupled Diffusion Transformer
April 8, 2025
作者: Shuai Wang, Zhi Tian, Weilin Huang, Limin Wang
cs.AI
摘要
擴散變換器展現了卓越的生成質量,儘管需要更長的訓練迭代次數和多次推理步驟。在每個去噪步驟中,擴散變換器對含噪輸入進行編碼以提取低頻語義成分,然後使用相同的模塊解碼高頻成分。這種方案造成了一個固有的優化困境:編碼低頻語義需要減少高頻成分,這在語義編碼與高頻解碼之間產生了張力。為解決這一挑戰,我們提出了一種新的\color{ddtD}解耦\color{ddtD}擴散\color{ddtT}變換器~(\color{ddtDDT}),其設計解耦了專門用於語義提取的條件編碼器和專用的速度解碼器。我們的實驗表明,隨著模型規模的增大,更強大的編碼器能帶來性能提升。對於ImageNet 256×256,我們的DDT-XL/2達到了新的最先進性能,FID為1.31(與之前的擴散變換器相比,訓練收斂速度幾乎快了4倍)。對於ImageNet 512×512,我們的DDT-XL/2達到了新的最先進FID,為1.28。此外,作為一個有益的副產品,我們的解耦架構通過在相鄰去噪步驟之間共享自條件,提高了推理速度。為了最小化性能下降,我們提出了一種新穎的統計動態規劃方法來識別最佳共享策略。
English
Diffusion transformers have demonstrated remarkable generation quality,
albeit requiring longer training iterations and numerous inference steps. In
each denoising step, diffusion transformers encode the noisy inputs to extract
the lower-frequency semantic component and then decode the higher frequency
with identical modules. This scheme creates an inherent optimization dilemma:
encoding low-frequency semantics necessitates reducing high-frequency
components, creating tension between semantic encoding and high-frequency
decoding. To resolve this challenge, we propose a new
\color{ddtD}ecoupled \color{ddtD}iffusion
\color{ddtT}ransformer~(\color{ddtDDT}), with a decoupled
design of a dedicated condition encoder for semantic extraction alongside a
specialized velocity decoder. Our experiments reveal that a more substantial
encoder yields performance improvements as model size increases. For ImageNet
256times256, Our DDT-XL/2 achieves a new state-of-the-art performance of
{1.31 FID}~(nearly 4times faster training convergence compared to previous
diffusion transformers). For ImageNet 512times512, Our DDT-XL/2 achieves a
new state-of-the-art FID of 1.28. Additionally, as a beneficial by-product, our
decoupled architecture enhances inference speed by enabling the sharing
self-condition between adjacent denoising steps. To minimize performance
degradation, we propose a novel statistical dynamic programming approach to
identify optimal sharing strategies.Summary
AI-Generated Summary