GS-DiT:通过高效密集的3D点跟踪推动伪4D高斯场的视频生成。
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking
January 5, 2025
作者: Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li
cs.AI
摘要
在视频生成中,4D视频控制至关重要,因为它使得可以利用复杂的镜头技术,如多摄像头拍摄和徐缩变焦,这些技术目前尚不受现有方法支持。直接训练视频扩散变换器(DiT)以控制4D内容需要昂贵的多视角视频。受单目动态新视图合成(MDVS)的启发,该方法优化4D表示并根据不同的4D元素(如摄像机姿态和物体运动编辑)渲染视频,我们引入了伪4D高斯场到视频生成中。具体来说,我们提出了一个新颖的框架,通过密集的3D点跟踪构建伪4D高斯场,并为所有视频帧渲染高斯场。然后,我们微调预训练的DiT,以生成遵循渲染视频指导的视频,被称为GS-DiT。为了提升GS-DiT的训练,我们还提出了一种高效的密集3D点跟踪(D3D-PT)方法,用于伪4D高斯场的构建。我们的D3D-PT在准确性上优于现有技术的稀疏3D点跟踪方法SpatialTracker,并将推理速度提升了两个数量级。在推理阶段,GS-DiT可以生成具有相同动态内容的视频,同时遵循不同的摄像机参数,解决了当前视频生成模型的一个重要限制。GS-DiT展示了强大的泛化能力,并将高斯飞溅的4D可控性扩展到视频生成,不仅仅局限于摄像机姿态。通过操纵高斯场和摄像机内参,它支持高级的电影效果,使其成为创意视频制作的强大工具。演示可在https://wkbian.github.io/Projects/GS-DiT/找到。
English
4D video control is essential in video generation as it enables the use of
sophisticated lens techniques, such as multi-camera shooting and dolly zoom,
which are currently unsupported by existing methods. Training a video Diffusion
Transformer (DiT) directly to control 4D content requires expensive multi-view
videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that
optimizes a 4D representation and renders videos according to different 4D
elements, such as camera pose and object motion editing, we bring pseudo 4D
Gaussian fields to video generation. Specifically, we propose a novel framework
that constructs a pseudo 4D Gaussian field with dense 3D point tracking and
renders the Gaussian field for all video frames. Then we finetune a pretrained
DiT to generate videos following the guidance of the rendered video, dubbed as
GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense
3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field
construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art
sparse 3D point tracking method, in accuracy and accelerates the inference
speed by two orders of magnitude. During the inference stage, GS-DiT can
generate videos with the same dynamic content while adhering to different
camera parameters, addressing a significant limitation of current video
generation models. GS-DiT demonstrates strong generalization capabilities and
extends the 4D controllability of Gaussian splatting to video generation beyond
just camera poses. It supports advanced cinematic effects through the
manipulation of the Gaussian field and camera intrinsics, making it a powerful
tool for creative video production. Demos are available at
https://wkbian.github.io/Projects/GS-DiT/.Summary
AI-Generated Summary