移动视频传播

摘要

视频扩散模型已经取得了令人印象深刻的逼真性和可控性，但受到高计算需求的限制，限制了它们在移动设备上的使用。本文介绍了第一个针对移动设备优化的视频扩散模型。从稳定视频扩散（SVD）的时空UNet出发，我们通过降低帧分辨率、融入多尺度时间表示以及引入两种新的剪枝方案来减少内存和计算成本。此外，我们采用对抗微调将去噪减少到一步。我们的模型，命名为MobileVD，效率提高了523倍（1817.2对4.34 TFLOPs），质量略微下降（FVD 149对171），在小米14 Pro上为14x512x256像素的剪辑生成潜变量只需1.7秒。我们的结果可在https://qualcomm-ai-research.github.io/mobile-video-diffusion/ 上查看。

English

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/

移动视频传播

Mobile Video Diffusion

摘要

Support