重新定義視頻傳播中的時間建模：向量化時間步驟方法

摘要

擴散模型已經革新了圖像生成，並且將其擴展到視頻生成已經顯示出潛力。然而，目前的視頻擴散模型（VDMs）依賴於在剪輯級別應用的標量時間步變量，這限制了它們對於像圖像到視頻生成這樣的各種任務所需的複雜時間依賴性的建模能力。為了解決這個限制，我們提出了一種框架感知的視頻擴散模型（FVDM），引入了一種新穎的向量化時間步變量（VTV）。與傳統的VDMs不同，我們的方法允許每個幀遵循獨立的噪聲時間表，增強了模型捕捉細粒度時間依賴性的能力。FVDM的靈活性在多個任務中得到展示，包括標準視頻生成、圖像到視頻生成、視頻插值和長視頻合成。通過各種VTV配置的多樣性，我們在生成的視頻質量上取得了優異表現，克服了在微調過程中的災難性遺忘和零樣本方法中有限的泛化能力等挑戰。我們的實證評估表明，FVDM在視頻生成質量方面優於最先進的方法，同時在擴展任務中也表現出色。通過解決現有VDMs的基本缺陷，FVDM在視頻合成中樹立了一個新的範式，為生成建模和多媒體應用帶來了重要影響。

English

Diffusion models have revolutionized image generation, and their extension to video generation has shown promise. However, current video diffusion models~(VDMs) rely on a scalar timestep variable applied at the clip level, which limits their ability to model complex temporal dependencies needed for various tasks like image-to-video generation. To address this limitation, we propose a frame-aware video diffusion model~(FVDM), which introduces a novel vectorized timestep variable~(VTV). Unlike conventional VDMs, our approach allows each frame to follow an independent noise schedule, enhancing the model's capacity to capture fine-grained temporal dependencies. FVDM's flexibility is demonstrated across multiple tasks, including standard video generation, image-to-video generation, video interpolation, and long video synthesis. Through a diverse set of VTV configurations, we achieve superior quality in generated videos, overcoming challenges such as catastrophic forgetting during fine-tuning and limited generalizability in zero-shot methods.Our empirical evaluations show that FVDM outperforms state-of-the-art methods in video generation quality, while also excelling in extended tasks. By addressing fundamental shortcomings in existing VDMs, FVDM sets a new paradigm in video synthesis, offering a robust framework with significant implications for generative modeling and multimedia applications.

重新定義視頻傳播中的時間建模：向量化時間步驟方法

Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach

摘要

Summary

Support

Support