GS-DiT:透過高效密集的3D點追蹤推進偽4D高斯場的視頻生成。

GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking

January 5, 2025
作者: Weikang Bian, Zhaoyang Huang, Xiaoyu Shi, Yijin Li, Fu-Yun Wang, Hongsheng Li
cs.AI

摘要

在影片生成中,4D影片控制至關重要,因為它使得使用複雜的鏡頭技術成為可能,例如多攝影機拍攝和達利變焦,這些技術目前無法被現有方法支援。直接訓練影片擴散變換器(DiT)以控制4D內容需要昂貴的多視角影片。受到單眼動態新視角合成(MDVS)的啟發,該方法優化4D表示並根據不同的4D元素(如相機姿勢和物體運動編輯)渲染影片,我們引入了虛擬4D高斯場到影片生成中。具體來說,我們提出了一個新穎的框架,通過密集的3D點跟踪構建虛擬4D高斯場,並對所有影片幀渲染高斯場。然後,我們微調預訓練的DiT,以生成遵循渲染影片指導的影片,被稱為GS-DiT。為了加速GS-DiT的訓練,我們還提出了一種高效的密集3D點跟踪(D3D-PT)方法,用於虛擬4D高斯場的構建。我們的D3D-PT在準確性上優於當前最先進的稀疏3D點跟踪方法SpatialTracker,並將推理速度加速了兩個數量級。在推理階段,GS-DiT能夠生成具有相同動態內容的影片,同時遵循不同的相機參數,解決了當前影片生成模型的一個重要限制。GS-DiT展示了強大的泛化能力,將高斯飛濺的4D可控性擴展到超越僅相機姿勢的影片生成,通過操縱高斯場和相機內部參數,支持高級的電影效果,使其成為創意影片製作的強大工具。演示可在https://wkbian.github.io/Projects/GS-DiT/ 上找到。
English
4D video control is essential in video generation as it enables the use of sophisticated lens techniques, such as multi-camera shooting and dolly zoom, which are currently unsupported by existing methods. Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS) that optimizes a 4D representation and renders videos according to different 4D elements, such as camera pose and object motion editing, we bring pseudo 4D Gaussian fields to video generation. Specifically, we propose a novel framework that constructs a pseudo 4D Gaussian field with dense 3D point tracking and renders the Gaussian field for all video frames. Then we finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT. To boost the training of the GS-DiT, we also propose an efficient Dense 3D Point Tracking (D3D-PT) method for the pseudo 4D Gaussian field construction. Our D3D-PT outperforms SpatialTracker, the state-of-the-art sparse 3D point tracking method, in accuracy and accelerates the inference speed by two orders of magnitude. During the inference stage, GS-DiT can generate videos with the same dynamic content while adhering to different camera parameters, addressing a significant limitation of current video generation models. GS-DiT demonstrates strong generalization capabilities and extends the 4D controllability of Gaussian splatting to video generation beyond just camera poses. It supports advanced cinematic effects through the manipulation of the Gaussian field and camera intrinsics, making it a powerful tool for creative video production. Demos are available at https://wkbian.github.io/Projects/GS-DiT/.

Summary

AI-Generated Summary

PDF172January 7, 2025