OSV:一步就足以產生高品質的影像到影片轉換
OSV: One Step is Enough for High-Quality Image to Video Generation
September 17, 2024
作者: Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang
cs.AI
摘要
影片擴散模型展現出在生成高品質影片方面的巨大潛力,因此成為越來越受歡迎的研究焦點。然而,這些模型固有的迭代特性導致了相當大的計算和時間成本。儘管已經有一些努力加速影片擴散的方法,例如通過技術如一致性蒸餾來減少推論步驟以及 GAN 訓練,但這些方法往往在性能或訓練穩定性方面存在不足。在本研究中,我們提出了一個兩階段訓練框架,有效地結合了一致性蒸餾和 GAN 訓練,以應對這些挑戰。此外,我們提出了一種新穎的影片鑑別器設計,消除了解碼影片潛在特徵的需求,並提高了最終性能。我們的模型能夠僅通過一個步驟生成高品質影片,同時具有進行多步細化以進一步提升性能的靈活性。我們在 OpenWebVid-1M 基準測試上的量化評估顯示,我們的模型明顯優於現有方法。值得注意的是,我們的一步性能(FVD 171.15)超過了基於一致性蒸餾的方法 AnimateLCM 的 8 步性能(FVD 184.79),並接近先進的 Stable Video Diffusion 的 25 步性能(FVD 156.94)。
English
Video diffusion models have shown great potential in generating high-quality
videos, making them an increasingly popular focus. However, their inherent
iterative nature leads to substantial computational and time costs. While
efforts have been made to accelerate video diffusion by reducing inference
steps (through techniques like consistency distillation) and GAN training
(these approaches often fall short in either performance or training
stability). In this work, we introduce a two-stage training framework that
effectively combines consistency distillation with GAN training to address
these challenges. Additionally, we propose a novel video discriminator design,
which eliminates the need for decoding the video latents and improves the final
performance. Our model is capable of producing high-quality videos in merely
one-step, with the flexibility to perform multi-step refinement for further
performance enhancement. Our quantitative evaluation on the OpenWebVid-1M
benchmark shows that our model significantly outperforms existing methods.
Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of
the consistency distillation based method, AnimateLCM (FVD 184.79), and
approaches the 25-step performance of advanced Stable Video Diffusion (FVD
156.94).Summary
AI-Generated Summary