AC3D:分析和改善在視頻傳播中的3D攝像頭控制的研究Transformer

AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers

November 27, 2024
作者: Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov
cs.AI

摘要

最近許多研究已將3D攝影機控制整合到基礎文本到視頻模型中,但由此產生的攝影機控制通常不精確,導致視頻生成質量下降。在本研究中,我們從第一原理的角度分析攝影機運動,揭示了能夠實現精確的3D攝影機操作而不影響合成質量的見解。首先,我們確定視頻中由攝影機運動引起的運動具有低頻性質。這促使我們調整訓練和測試姿勢條件安排,加快訓練收斂速度,同時提高視覺和運動質量。然後,通過探測無條件視頻擴散變壓器的表示,我們觀察到它們在幕後隱含地執行攝影機姿勢估計,並且只有它們的部分層包含攝影機信息。這提示我們將攝影條件注入限制在架構的子集中,以防止干擾其他視頻特徵,從而導致訓練參數減少4倍,提高訓練速度並提高10%的視覺質量。最後,我們通過一個精心策劃的包含20,000個多樣動態視頻和靜止攝影機的數據集,補充了用於攝影機控制學習的典型數據集。這有助於模型區分攝影機和場景運動之間的差異,並改善生成的姿勢條件視頻的動態。我們將這些發現結合起來設計了先進的3D攝影機控制(AC3D)架構,這是具有攝影機控制的生成式視頻建模的新最先進模型。
English
Numerous works have recently integrated 3D camera control into foundational text-to-video models, but the resulting camera control is often imprecise, and video generation quality suffers. In this work, we analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation without compromising synthesis quality. First, we determine that motion induced by camera movements in videos is low-frequency in nature. This motivates us to adjust train and test pose conditioning schedules, accelerating training convergence while improving visual and motion quality. Then, by probing the representations of an unconditional video diffusion transformer, we observe that they implicitly perform camera pose estimation under the hood, and only a sub-portion of their layers contain the camera information. This suggested us to limit the injection of camera conditioning to a subset of the architecture to prevent interference with other video features, leading to 4x reduction of training parameters, improved training speed and 10% higher visual quality. Finally, we complement the typical dataset for camera control learning with a curated dataset of 20K diverse dynamic videos with stationary cameras. This helps the model disambiguate the difference between camera and scene motion, and improves the dynamics of generated pose-conditioned videos. We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture, the new state-of-the-art model for generative video modeling with camera control.

Summary

AI-Generated Summary

PDF82December 2, 2024