확산 트랜스포머를 사용한 빠른 비디오 생성을 위한 적응형 캐싱

초록

시간적으로 일관된 고품질 비디오를 생성하는 것은 특히 긴 시간 범위에 걸쳐서는 계산 비용이 많이 들 수 있습니다. 보다 최근에 등장한 확산 트랜스포머(Diffusion Transformers, DiTs)는 이러한 맥락에서 상당한 진전을 이루었지만, 더 큰 모델과 더 무거운 주의 메커니즘에 의존하므로 추론 속도가 느려지는 등의 도전에 직면하고 있습니다. 본 논문에서는 비디오 DiTs를 가속화하기 위한 훈련 없는 방법인 적응형 캐싱(Adaptive Caching, AdaCache)을 소개합니다. 이 방법은 "모든 비디오가 동일하게 생성되는 것은 아니다"는 사실에서 출발하여, 일부 비디오는 다른 비디오보다 합리적인 품질을 얻기 위해 더 적은 노이즈 제거 단계가 필요하다는 점에서 동기부여를 받았습니다. 이를 바탕으로 확산 과정을 통해 계산을 캐싱뿐만 아니라 각 비디오 생성에 맞는 캐싱 일정을 설계하여 품질과 지연 시간의 균형을 최대화합니다. 또한 비디오 정보를 활용하기 위해 모션 정규화(Motion Regularization, MoReg) 체계를 도입하여 움직임 콘텐츠에 기반한 컴퓨팅 할당을 제어합니다. 이러한 플러그 앤 플레이 기여들은 여러 비디오 DiT 기준에 걸쳐 생성 품질을 희생하지 않으면서 상당한 추론 가속화(예: Open-Sora 720p - 2초 비디오 생성에서 최대 4.7배)를 제공합니다.

English

Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.

확산 트랜스포머를 사용한 빠른 비디오 생성을 위한 적응형 캐싱

Adaptive Caching for Faster Video Generation with Diffusion Transformers

초록

Summary

Support