適應性快取以擴散Transformer加速影片生成

摘要

生成時間一致且高保真度的影片可能在計算上具有高昂成本，特別是在較長的時間跨度上。最近的擴散Transformer（DiTs）雖然在這方面取得了顯著進展，但由於依賴更大的模型和更重的注意機制，導致推理速度變慢，進一步加劇了這些挑戰。在本文中，我們介紹了一種無需訓練的方法來加速影片DiTs，稱為自適應緩存（AdaCache），其動機在於“並非所有影片皆相同”：意即，有些影片只需較少的去噪步驟即可達到合理的品質。基於此，我們不僅通過擴散過程進行緩存計算，還為每個影片生成設計了一個適合的緩存計劃，最大程度地平衡品質和延遲之間的權衡。我們進一步引入了運動正則化（MoReg）方案，以在AdaCache中利用影片信息，基本上根據運動內容控制計算分配。總的來說，我們的即插即用貢獻可以顯著提高推理速度（例如，在Open-Sora 720p - 2s影片生成上最多可達4.7倍），而不會犧牲生成品質，跨多個影片DiT基線。

English

Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.

適應性快取以擴散Transformer加速影片生成

Adaptive Caching for Faster Video Generation with Diffusion Transformers

摘要

Summary

Support

Support