빠르고 확장 가능한 단일 단계 이미지-3D 생성을 위해 확산 노이저에 가우시안 스플래팅을 통합하기

초록

기존의 피드 포워드 이미지-3D 변환 방법은 주로 3D 일관성을 보장할 수 없는 2D 다중 뷰 확산 모델에 의존합니다. 이러한 방법들은 프롬프트 뷰 방향을 변경할 때 쉽게 붕괴하며 주로 객체 중심의 프롬프트 이미지를 처리합니다. 본 논문에서는 단일 단계 3D 확산 모델인 DiffusionGS를 제안하여 단일 뷰로부터 객체 및 장면을 생성합니다. DiffusionGS는 각 타임스텝에서 3D 가우시안 포인트 클라우드를 직접 출력하여 뷰 일관성을 강화하고 객체 중심 입력을 넘어 어떤 방향의 프롬프트 뷰에서도 강력하게 생성할 수 있게 합니다. 또한 DiffusionGS의 성능과 일반화 능력을 향상시키기 위해 장면-객체 혼합 훈련 전략을 개발하여 3D 훈련 데이터를 확장합니다. 실험 결과, 우리의 방법은 PSNR에서 2.20 dB 더 높은 생성 품질과 FID에서 23.25 낮은 성과를 보이며 SOTA 방법보다 5배 이상 빠른 속도(~A100 GPU에서 6초)를 제공합니다. 사용자 연구 및 텍스트-3D 응용 프로그램은 또한 우리의 방법의 실용적 가치를 보여줍니다. 저희 프로젝트 페이지인 https://caiyuanhao1998.github.io/project/DiffusionGS/에서 비디오 및 대화식 생성 결과를 확인할 수 있습니다.

English

Existing feed-forward image-to-3D methods mainly rely on 2D multi-view diffusion models that cannot guarantee 3D consistency. These methods easily collapse when changing the prompt view direction and mainly handle object-centric prompt images. In this paper, we propose a novel single-stage 3D diffusion model, DiffusionGS, for object and scene generation from a single view. DiffusionGS directly outputs 3D Gaussian point clouds at each timestep to enforce view consistency and allow the model to generate robustly given prompt views of any directions, beyond object-centric inputs. Plus, to improve the capability and generalization ability of DiffusionGS, we scale up 3D training data by developing a scene-object mixed training strategy. Experiments show that our method enjoys better generation quality (2.20 dB higher in PSNR and 23.25 lower in FID) and over 5x faster speed (~6s on an A100 GPU) than SOTA methods. The user study and text-to-3D applications also reveals the practical values of our method. Our Project page at https://caiyuanhao1998.github.io/project/DiffusionGS/ shows the video and interactive generation results.

빠르고 확장 가능한 단일 단계 이미지-3D 생성을 위해 확산 노이저에 가우시안 스플래팅을 통합하기

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

초록

Support