AC3D:分析和改进视频传播中的3D摄像机控制变压器
AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers
November 27, 2024
作者: Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Aliaksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov
cs.AI
摘要
最近许多研究已将3D摄像头控制整合到基础文本到视频模型中,但由此产生的摄像头控制通常不够精确,导致视频生成质量下降。在本研究中,我们从第一原理的角度分析摄像头运动,揭示见解,实现精确的3D摄像头操作而不影响合成质量。首先,我们确定视频中由摄像头运动引起的运动是低频的。这激励我们调整训练和测试姿势调节时间表,加快训练收敛速度,同时提高视觉和运动质量。然后,通过探究无条件视频扩散变压器的表示,我们观察到它们在幕后隐式执行摄像头姿势估计,只有部分层包含摄像头信息。这启发我们将摄像头调节的注入限制在架构的子集中,以防止干扰其他视频特征,从而减少了4倍的训练参数,提高了训练速度和10%的视觉质量。最后,我们通过一个精心策划的包含2万个不同动态视频和静止摄像头的数据集,补充了摄像头控制学习的典型数据集。这有助于模型区分摄像头和场景运动的差异,改善生成的姿势调节视频的动态性。我们综合这些发现,设计了先进的3D摄像头控制(AC3D)架构,这是具有摄像头控制的生成式视频建模的最新技术模型。
English
Numerous works have recently integrated 3D camera control into foundational
text-to-video models, but the resulting camera control is often imprecise, and
video generation quality suffers. In this work, we analyze camera motion from a
first principles perspective, uncovering insights that enable precise 3D camera
manipulation without compromising synthesis quality. First, we determine that
motion induced by camera movements in videos is low-frequency in nature. This
motivates us to adjust train and test pose conditioning schedules, accelerating
training convergence while improving visual and motion quality. Then, by
probing the representations of an unconditional video diffusion transformer, we
observe that they implicitly perform camera pose estimation under the hood, and
only a sub-portion of their layers contain the camera information. This
suggested us to limit the injection of camera conditioning to a subset of the
architecture to prevent interference with other video features, leading to 4x
reduction of training parameters, improved training speed and 10% higher visual
quality. Finally, we complement the typical dataset for camera control learning
with a curated dataset of 20K diverse dynamic videos with stationary cameras.
This helps the model disambiguate the difference between camera and scene
motion, and improves the dynamics of generated pose-conditioned videos. We
compound these findings to design the Advanced 3D Camera Control (AC3D)
architecture, the new state-of-the-art model for generative video modeling with
camera control.Summary
AI-Generated Summary