区域自适应采样的扩散变换器
Region-Adaptive Sampling for Diffusion Transformers
February 14, 2025
作者: Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang
cs.AI
摘要
扩散模型(DMs)已成为跨领域生成任务的首选方法。然而,其依赖多次顺序前向传播的特性严重限制了实时性能。以往的加速方法主要集中于减少采样步数或重用中间结果,由于卷积U-Net结构的限制,未能充分利用图像内各空间区域的变化。借助扩散变换器(DiTs)在处理可变数量标记上的灵活性,我们提出了RAS,一种无需训练的新型采样策略,它根据DiT模型的关注点动态分配图像内不同区域的采样比例。我们的核心观察是,在每一步采样过程中,模型集中于语义显著的区域,且这些关注区域在连续步骤间展现出强烈的连续性。基于这一洞察,RAS仅更新当前关注的区域,而其他区域则使用上一步缓存的噪声进行更新。模型关注点的确定基于前一步的输出,充分利用了我们观察到的时间一致性。我们在Stable Diffusion 3和Lumina-Next-T2I上评估了RAS,分别实现了高达2.36倍和2.51倍的加速,且生成质量下降极小。此外,用户研究表明,RAS在人类评估下提供了可比的生成质量,同时实现了1.6倍的加速。我们的方法为更高效的扩散变换器迈出了重要一步,增强了其在实时应用中的潜力。
English
Diffusion models (DMs) have become the leading choice for generative tasks
across diverse domains. However, their reliance on multiple sequential forward
passes significantly limits real-time performance. Previous acceleration
methods have primarily focused on reducing the number of sampling steps or
reusing intermediate results, failing to leverage variations across spatial
regions within the image due to the constraints of convolutional U-Net
structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in
handling variable number of tokens, we introduce RAS, a novel, training-free
sampling strategy that dynamically assigns different sampling ratios to regions
within an image based on the focus of the DiT model. Our key observation is
that during each sampling step, the model concentrates on semantically
meaningful regions, and these areas of focus exhibit strong continuity across
consecutive steps. Leveraging this insight, RAS updates only the regions
currently in focus, while other regions are updated using cached noise from the
previous step. The model's focus is determined based on the output from the
preceding step, capitalizing on the temporal consistency we observed. We
evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up
to 2.36x and 2.51x, respectively, with minimal degradation in generation
quality. Additionally, a user study reveals that RAS delivers comparable
qualities under human evaluation while achieving a 1.6x speedup. Our approach
makes a significant step towards more efficient diffusion transformers,
enhancing their potential for real-time applications.Summary
AI-Generated Summary