具有明確3D建模的世界一致視訊擴散
World-consistent Video Diffusion with Explicit 3D Modeling
December 2, 2024
作者: Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu
cs.AI
摘要
最近擴散模型的進步在圖像和視頻生成方面設立了新的標竿,實現了在單幀和多幀上下文中逼真的視覺合成。然而,這些模型仍然在高效且明確地生成3D一致內容方面存在困難。為了解決這個問題,我們提出了「世界一致視頻擴散」(WVD),這是一個新穎的框架,通過使用XYZ圖像來包含全局3D坐標來納入明確的3D監督。更具體地說,我們訓練一個擴散變壓器來學習RGB和XYZ幀的聯合分佈。這種方法通過靈活的修補策略支持多任務適應性。例如,WVD可以從真實的RGB估算XYZ幀,或者使用沿著指定相機軌跡的XYZ投影生成新的RGB幀。通過這樣做,WVD統一了單幅圖像到3D生成、多視圖立體和相機控制視頻生成等任務。我們的方法在多個基準測試中展現了競爭性的性能,為3D一致視頻和圖像生成提供了可擴展的解決方案,並且僅需一個預訓練模型。
English
Recent advancements in diffusion models have set new benchmarks in image and
video generation, enabling realistic visual synthesis across single- and
multi-frame contexts. However, these models still struggle with efficiently and
explicitly generating 3D-consistent content. To address this, we propose
World-consistent Video Diffusion (WVD), a novel framework that incorporates
explicit 3D supervision using XYZ images, which encode global 3D coordinates
for each image pixel. More specifically, we train a diffusion transformer to
learn the joint distribution of RGB and XYZ frames. This approach supports
multi-task adaptability via a flexible inpainting strategy. For example, WVD
can estimate XYZ frames from ground-truth RGB or generate novel RGB frames
using XYZ projections along a specified camera trajectory. In doing so, WVD
unifies tasks like single-image-to-3D generation, multi-view stereo, and
camera-controlled video generation. Our approach demonstrates competitive
performance across multiple benchmarks, providing a scalable solution for
3D-consistent video and image generation with a single pretrained model.Summary
AI-Generated Summary