扩散作为着色器:面向三维视频的多功能视频生成控制
Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control
January 7, 2025
作者: Zekai Gu, Rui Yan, Jiahao Lu, Peng Li, Zhiyang Dou, Chenyang Si, Zhen Dong, Qifeng Liu, Cheng Lin, Ziwei Liu, Wenping Wang, Yuan Liu
cs.AI
摘要
扩散模型已经展示出在从文本提示或图像中生成高质量视频方面的出色性能。然而,对视频生成过程的精确控制,如摄像机操作或内容编辑,仍然是一个重大挑战。现有的受控视频生成方法通常局限于单一控制类型,缺乏处理多样化控制需求的灵活性。在本文中,我们介绍了一种名为Diffusion as Shader(DaS)的新方法,它支持统一架构内的多个视频控制任务。我们的关键见解是,实现多功能视频控制需要利用3D控制信号,因为视频从根本上是动态3D内容的2D渲染。与之前局限于2D控制信号的方法不同,DaS利用3D跟踪视频作为控制输入,使视频扩散过程本质上具备3D意识。这一创新使得DaS能够通过简单操作3D跟踪视频实现广泛的视频控制。使用3D跟踪视频的另一个优势在于它们能够有效地连接帧,显著增强所生成视频的时间一致性。通过在8个H800 GPU上进行不到10k个视频的3天微调,DaS展示了在各种任务中的强大控制能力,包括网格到视频生成、摄像机控制、动作转移和物体操作。
English
Diffusion models have demonstrated impressive performance in generating
high-quality videos from text prompts or images. However, precise control over
the video generation process, such as camera manipulation or content editing,
remains a significant challenge. Existing methods for controlled video
generation are typically limited to a single control type, lacking the
flexibility to handle diverse control demands. In this paper, we introduce
Diffusion as Shader (DaS), a novel approach that supports multiple video
control tasks within a unified architecture. Our key insight is that achieving
versatile video control necessitates leveraging 3D control signals, as videos
are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods
limited to 2D control signals, DaS leverages 3D tracking videos as control
inputs, making the video diffusion process inherently 3D-aware. This innovation
allows DaS to achieve a wide range of video controls by simply manipulating the
3D tracking videos. A further advantage of using 3D tracking videos is their
ability to effectively link frames, significantly enhancing the temporal
consistency of the generated videos. With just 3 days of fine-tuning on 8 H800
GPUs using less than 10k videos, DaS demonstrates strong control capabilities
across diverse tasks, including mesh-to-video generation, camera control,
motion transfer, and object manipulation.Summary
AI-Generated Summary