具有显式三维建模的全局一致视频扩散
World-consistent Video Diffusion with Explicit 3D Modeling
December 2, 2024
作者: Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu
cs.AI
摘要
最近扩散模型的进展在图像和视频生成方面设立了新的基准,实现了在单帧和多帧上下文中逼真的视觉合成。然而,这些模型仍然在高效和明确地生成3D一致内容方面存在困难。为了解决这一问题,我们提出了一种新颖的框架,即“世界一致视频扩散”(WVD),它通过使用XYZ图像来编码每个图像像素的全局3D坐标,从而融入了明确的3D监督。更具体地说,我们训练一个扩散变换器来学习RGB和XYZ帧的联合分布。这种方法通过灵活的修补策略支持多任务适应性。例如,WVD可以从地面真实RGB估计XYZ帧,或者使用沿指定摄像机轨迹的XYZ投影生成新的RGB帧。通过这种方式,WVD统一了诸如单图像到3D生成、多视图立体和摄像机控制视频生成等任务。我们的方法在多个基准测试中展现出竞争性能,为3D一致视频和图像生成提供了可扩展的解决方案,只需一个预训练模型。
English
Recent advancements in diffusion models have set new benchmarks in image and
video generation, enabling realistic visual synthesis across single- and
multi-frame contexts. However, these models still struggle with efficiently and
explicitly generating 3D-consistent content. To address this, we propose
World-consistent Video Diffusion (WVD), a novel framework that incorporates
explicit 3D supervision using XYZ images, which encode global 3D coordinates
for each image pixel. More specifically, we train a diffusion transformer to
learn the joint distribution of RGB and XYZ frames. This approach supports
multi-task adaptability via a flexible inpainting strategy. For example, WVD
can estimate XYZ frames from ground-truth RGB or generate novel RGB frames
using XYZ projections along a specified camera trajectory. In doing so, WVD
unifies tasks like single-image-to-3D generation, multi-view stereo, and
camera-controlled video generation. Our approach demonstrates competitive
performance across multiple benchmarks, providing a scalable solution for
3D-consistent video and image generation with a single pretrained model.Summary
AI-Generated Summary