擴散作為著色器:具 3D 意識的影片擴散用於多功能影片生成控制

Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

January 7, 2025
作者: Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu
cs.AI

摘要

擴散模型已經展示出在從文本提示或圖像生成高質量視頻方面的出色表現。然而,對於視頻生成過程的精確控制,如攝像頭操作或內容編輯,仍然是一個重大挑戰。現有的控制視頻生成方法通常僅限於單一控制類型,缺乏處理多樣控制需求的靈活性。在本文中,我們介紹了Shader作為擴散(DaS),這是一種支持統一架構內多個視頻控制任務的新方法。我們的關鍵見解是,實現多功能視頻控制需要利用3D控制信號,因為視頻基本上是動態3D內容的2D渲染。與先前僅限於2D控制信號的方法不同,DaS利用3D跟踪視頻作為控制輸入,使視頻擴散過程本質上具有3D感知。這種創新使DaS能夠通過簡單操作3D跟踪視頻實現廣泛的視頻控制。使用3D跟踪視頻的另一個優勢是它們能夠有效地連接幀,顯著增強所生成視頻的時間一致性。僅通過在8 H800 GPU上進行不到10k視頻的3天微調,DaS展示了在各種任務上的強大控制能力,包括網格到視頻生成、攝像頭控制、運動轉移和對象操作。
English
Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.

Summary

AI-Generated Summary

PDF222January 8, 2025