MOVIS: 실내 장면을 위한 다중 물체 신규 뷰 합성 향상

초록

사전 훈련된 확산 모델을 재활용하는 것이 NVS에 효과적임이 입증되었습니다. 그러나 이러한 방법은 대부분 단일 객체로 제한되어 있습니다. 이러한 방법을 직접적으로 복합적인 다중 객체 시나리오에 적용하면 잘못된 객체 배치와 새로운 관점에서 일관성 없는 모양과 외관으로 인해 부적합한 결과가 발생합니다. 이러한 모델의 교차 관점 일관성을 향상시키고 체계적으로 평가하는 방법은 아직 충분히 탐구되지 않았습니다. 이 문제를 해결하기 위해 우리는 MOVIS를 제안하여 다중 객체 NVS를 위한 뷰 조건부 확산 모델의 구조 인식을 향상시킵니다. 모델 입력, 보조 작업 및 교육 전략 측면에서 구조 인식 기능을 강화합니다. 먼저, 우리는 손상 복원 U-Net에 깊이와 객체 마스크를 포함한 구조 인식 기능을 주입하여 모델이 객체 인스턴스와 공간적 관계를 이해하는 능력을 향상시킵니다. 둘째, 모델이 새로운 관점 객체 마스크를 동시에 예측하도록 하는 보조 작업을 도입하여 객체를 구분하고 배치하는 모델의 능력을 더욱 향상시킵니다. 마지막으로, 확산 샘플링 프로세스를 철저히 분석하고 교육 중 구조 지침 시간 단계 샘플링 스케줄러를 신중하게 설계하여 전역 객체 배치와 세밀한 세부 정보 회복의 학습을 균형 있게 유지합니다. 합성 이미지의 타당성을 체계적으로 평가하기 위해 우리는 이미지 수준 NVS 지표와 함께 교차 관점 일관성 및 새로운 관점 객체 배치를 평가하는 것을 제안합니다. 도전적인 합성 및 현실적인 데이터셋에서의 방대한 실험 결과는 우리의 방법이 강력한 일반화 능력을 보여주며 일관된 새로운 관점 합성을 생성하며, 미래의 3D 인식 다중 객체 NVS 작업을 이끌어낼 잠재력을 강조합니다.

English

Repurposing pre-trained diffusion models has been proven to be effective for NVS. However, these methods are mostly limited to a single object; directly applying such methods to compositional multi-object scenarios yields inferior results, especially incorrect object placement and inconsistent shape and appearance under novel views. How to enhance and systematically evaluate the cross-view consistency of such models remains under-explored. To address this issue, we propose MOVIS to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS in terms of model inputs, auxiliary tasks, and training strategy. First, we inject structure-aware features, including depth and object mask, into the denoising U-Net to enhance the model's comprehension of object instances and their spatial relationships. Second, we introduce an auxiliary task requiring the model to simultaneously predict novel view object masks, further improving the model's capability in differentiating and placing objects. Finally, we conduct an in-depth analysis of the diffusion sampling process and carefully devise a structure-guided timestep sampling scheduler during training, which balances the learning of global object placement and fine-grained detail recovery. To systematically evaluate the plausibility of synthesized images, we propose to assess cross-view consistency and novel view object placement alongside existing image-level NVS metrics. Extensive experiments on challenging synthetic and realistic datasets demonstrate that our method exhibits strong generalization capabilities and produces consistent novel view synthesis, highlighting its potential to guide future 3D-aware multi-object NVS tasks.

MOVIS: 실내 장면을 위한 다중 물체 신규 뷰 합성 향상

MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes

초록

Support