SmoothCache:一種針對擴散Transformer的通用推論加速技術
SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers
November 15, 2024
作者: Joseph Liu, Joshua Geddes, Ziyu Guo, Haomiao Jiang, Mahesh Kumar Nandwana
cs.AI
摘要
擴散Transformer(DiT)已成為強大的生成模型,適用於各種任務,包括圖像、視頻和語音合成。然而,由於需要重複評估資源密集型的注意力和前饋模組,其推論過程仍然耗時。為了應對這一問題,我們引入了SmoothCache,這是一種針對DiT架構的模型無關推論加速技術。SmoothCache利用觀察到的相鄰擴散時間步之間層輸出的高相似性。通過從一個小的校準集中分析層級表示誤差,SmoothCache在推論過程中自適應地緩存並重複使用關鍵特徵。我們的實驗表明,SmoothCache實現了8%至71%的加速,同時在各種模態下保持甚至提高了生成質量。我們展示了它在圖像生成的DiT-XL、文本到視頻的Open-Sora和文本到音頻的Stable Audio Open上的有效性,突顯了它在實時應用方面的潛力,以及擴大強大DiT模型的可用性。
English
Diffusion Transformers (DiT) have emerged as powerful generative models for
various tasks, including image, video, and speech synthesis. However, their
inference process remains computationally expensive due to the repeated
evaluation of resource-intensive attention and feed-forward modules. To address
this, we introduce SmoothCache, a model-agnostic inference acceleration
technique for DiT architectures. SmoothCache leverages the observed high
similarity between layer outputs across adjacent diffusion timesteps. By
analyzing layer-wise representation errors from a small calibration set,
SmoothCache adaptively caches and reuses key features during inference. Our
experiments demonstrate that SmoothCache achieves 8% to 71% speed up while
maintaining or even improving generation quality across diverse modalities. We
showcase its effectiveness on DiT-XL for image generation, Open-Sora for
text-to-video, and Stable Audio Open for text-to-audio, highlighting its
potential to enable real-time applications and broaden the accessibility of
powerful DiT models.Summary
AI-Generated Summary