원더랜드: 단일 이미지에서 3D 장면 탐색

초록

본 논문은 한 가지 어려운 질문에 대해 다룬다: 어떻게 단일 임의의 이미지로부터 고품질이면서 넓은 범위의 3D 장면을 효율적으로 생성할 수 있을까? 기존 방법들은 다수의 제약 조건을 가지고 있는데, 예를 들어 다중 뷰 데이터가 필요하다거나 장면별 최적화에 시간이 많이 소요된다거나 배경에서 시각적 품질이 낮다거나 보이지 않는 영역에서 왜곡된 재구성이 발생한다는 것이다. 우리는 이러한 제약을 극복하기 위한 새로운 파이프라인을 제안한다. 구체적으로, 비디오 확산 모델의 잠재 변수를 활용하여 피드포워드 방식으로 장면에 대한 3D 가우시안 스플래팅을 예측하는 대규모 재구성 모델을 소개한다. 비디오 확산 모델은 지정된 카메라 경로를 정확히 따라가는 비디오를 생성하기 위해 설계되었으며, 이를 통해 다중 뷰 정보를 포함하면서도 3D 일관성을 유지하는 압축된 비디오 잠재 변수를 생성할 수 있다. 우리는 3D 재구성 모델을 비디오 잠재 공간에서 작동하도록 훈련시키는 점진적 훈련 전략을 통해 고품질, 넓은 범위, 일반적인 3D 장면을 효율적으로 생성할 수 있게 한다. 다양한 데이터셋을 통한 포괄적인 평가 결과, 우리 모델이 특히 도메인 밖 이미지에 대해 기존 방법들을 크게 능가함을 보여준다. 우리는 처음으로 3D 재구성 모델이 확산 모델의 잠재 공간을 기반으로 효율적인 3D 장면 생성을 실현할 수 있다는 것을 입증한다.

English

This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.

원더랜드: 단일 이미지에서 3D 장면 탐색

Wonderland: Navigating 3D Scenes from a Single Image

초록

Support