使用扩散变换器实现更快视频生成的自适应缓存

摘要

生成时间一致且高保真度的视频可能在计算上是昂贵的，尤其是在较长的时间跨度上。最近的扩散Transformer（DiTs）虽然在这方面取得了重大进展，但由于依赖更大的模型和更重的注意机制，导致推理速度变慢，进一步加剧了这些挑战。在本文中，我们介绍了一种无需训练的方法来加速视频DiTs，称为自适应缓存（AdaCache），其动机是“并非所有视频都是平等的”：即一些视频只需较少的去噪步骤即可达到合理的质量。基于此，我们不仅通过扩散过程缓存计算，还为每个视频生成设计了一个定制的缓存调度，最大化质量和延迟之间的权衡。我们进一步引入了一种运动正则化（MoReg）方案，以利用AdaCache内的视频信息，从根本上根据运动内容控制计算分配。总的来说，我们的即插即用贡献使得推理速度显著提升（例如，在Open-Sora 720p - 2s视频生成上高达4.7倍），而不会牺牲生成质量，适用于多个视频DiT基线。

English

Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.

使用扩散变换器实现更快视频生成的自适应缓存

Adaptive Caching for Faster Video Generation with Diffusion Transformers

摘要

Summary

Support

Support