仙境:從單張圖像導航3D場景
Wonderland: Navigating 3D Scenes from a Single Image
December 16, 2024
作者: Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, Jian Ren
cs.AI
摘要
本文探討一個具挑戰性的問題:如何能夠從單張任意圖像高效地創建高質量、廣泛範圍的3D場景?現有方法面臨諸多限制,例如需要多視圖數據、耗時的場景優化、背景視覺質量低以及未見區域的重建失真。我們提出了一個新穎的流程來克服這些限制。具體來說,我們引入了一個大規模重建模型,該模型使用來自視頻擴散模型的潛在特徵以前向方式預測場景的3D高斯分佈。視頻擴散模型旨在精確按照指定的相機軌跡創建視頻,使其能夠生成包含多視圖信息並保持3D一致性的壓縮視頻潛在特徵。我們通過漸進式訓練策略訓練3D重建模型,使其能夠在視頻潛在空間上運作,實現高質量、廣泛範圍和通用的3D場景高效生成。通過在各種數據集上進行廣泛評估,我們展示了我們的模型在單視圖3D場景生成方面明顯優於現有方法,特別是對於域外圖像。我們首次證明了可以有效地基於擴散模型的潛在空間構建3D重建模型,實現高效的3D場景生成。
English
This paper addresses a challenging question: How can we efficiently create
high-quality, wide-scope 3D scenes from a single arbitrary image? Existing
methods face several constraints, such as requiring multi-view data,
time-consuming per-scene optimization, low visual quality in backgrounds, and
distorted reconstructions in unseen areas. We propose a novel pipeline to
overcome these limitations. Specifically, we introduce a large-scale
reconstruction model that uses latents from a video diffusion model to predict
3D Gaussian Splattings for the scenes in a feed-forward manner. The video
diffusion model is designed to create videos precisely following specified
camera trajectories, allowing it to generate compressed video latents that
contain multi-view information while maintaining 3D consistency. We train the
3D reconstruction model to operate on the video latent space with a progressive
training strategy, enabling the efficient generation of high-quality,
wide-scope, and generic 3D scenes. Extensive evaluations across various
datasets demonstrate that our model significantly outperforms existing methods
for single-view 3D scene generation, particularly with out-of-domain images.
For the first time, we demonstrate that a 3D reconstruction model can be
effectively built upon the latent space of a diffusion model to realize
efficient 3D scene generation.Summary
AI-Generated Summary