奇境:从单个图像导航3D场景
Wonderland: Navigating 3D Scenes from a Single Image
December 16, 2024
作者: Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, Jian Ren
cs.AI
摘要
本文探讨了一个具有挑战性的问题:如何能够从单个任意图像高效地创建高质量、广覆盖范围的3D场景?现有方法面临多种限制,例如需要多视角数据、耗时的每场景优化、背景视觉质量低以及未知区域中的重建失真。我们提出了一种新颖的流程来克服这些限制。具体而言,我们引入了一个大规模重建模型,该模型利用视频扩散模型中的潜变量以前向方式预测场景的3D高斯分布。视频扩散模型旨在精确按照指定的摄像机轨迹创建视频,从而能够生成包含多视角信息的压缩视频潜变量,同时保持3D一致性。我们通过渐进式训练策略训练3D重建模型,使其在视频潜变量空间上运行,实现高质量、广覆盖范围和通用的3D场景高效生成。对各种数据集进行的广泛评估表明,我们的模型在单视角3D场景生成方面明显优于现有方法,尤其是在处理域外图像时。我们首次证明了可以有效地基于扩散模型的潜变量空间构建3D重建模型,实现高效的3D场景生成。
English
This paper addresses a challenging question: How can we efficiently create
high-quality, wide-scope 3D scenes from a single arbitrary image? Existing
methods face several constraints, such as requiring multi-view data,
time-consuming per-scene optimization, low visual quality in backgrounds, and
distorted reconstructions in unseen areas. We propose a novel pipeline to
overcome these limitations. Specifically, we introduce a large-scale
reconstruction model that uses latents from a video diffusion model to predict
3D Gaussian Splattings for the scenes in a feed-forward manner. The video
diffusion model is designed to create videos precisely following specified
camera trajectories, allowing it to generate compressed video latents that
contain multi-view information while maintaining 3D consistency. We train the
3D reconstruction model to operate on the video latent space with a progressive
training strategy, enabling the efficient generation of high-quality,
wide-scope, and generic 3D scenes. Extensive evaluations across various
datasets demonstrate that our model significantly outperforms existing methods
for single-view 3D scene generation, particularly with out-of-domain images.
For the first time, we demonstrate that a 3D reconstruction model can be
effectively built upon the latent space of a diffusion model to realize
efficient 3D scene generation.Summary
AI-Generated Summary