ChatPaper.aiChatPaper

利用多视角几何扩散进行零样本新视角和深度合成

Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion

January 30, 2025
作者: Vitor Guizilini, Muhammad Zubair Irshad, Dian Chen, Greg Shakhnarovich, Rares Ambrus
cs.AI

摘要

目前,从稀疏姿态图像进行三维场景重建的方法采用中间的三维表示,如神经场、体素网格或三维高斯,以实现多视角一致的场景外观和几何。在本文中,我们介绍了MVGD,这是一种基于扩散的架构,能够直接从新视角生成图像和深度图,给定任意数量的输入视图。我们的方法使用射线映射条件来增强视觉特征,从不同视角提取空间信息,以及引导从新视角生成图像和深度图。我们方法的一个关键方面是多任务生成图像和深度图,使用可学习的任务嵌入来引导扩散过程朝向特定模态。我们在公开可用数据集中的超过6000万个多视角样本集合上训练该模型,并提出了一些技术,以实现在这种多样化条件下的高效且一致的学习。我们还提出了一种新颖的策略,通过逐步微调较小的模型来实现更大模型的高效训练,具有有希望的扩展行为。通过大量实验,我们在多个新视角合成基准测试中报告了最先进的结果,以及多视角立体和视频深度估计。
English
Current methods for 3D scene reconstruction from sparse posed images employ intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussians, to achieve multi-view consistent scene appearance and geometry. In this paper we introduce MVGD, a diffusion-based architecture capable of direct pixel-level generation of images and depth maps from novel viewpoints, given an arbitrary number of input views. Our method uses raymap conditioning to both augment visual features with spatial information from different viewpoints, as well as to guide the generation of images and depth maps from novel views. A key aspect of our approach is the multi-task generation of images and depth maps, using learnable task embeddings to guide the diffusion process towards specific modalities. We train this model on a collection of more than 60 million multi-view samples from publicly available datasets, and propose techniques to enable efficient and consistent learning in such diverse conditions. We also propose a novel strategy that enables the efficient training of larger models by incrementally fine-tuning smaller ones, with promising scaling behavior. Through extensive experiments, we report state-of-the-art results in multiple novel view synthesis benchmarks, as well as multi-view stereo and video depth estimation.

Summary

AI-Generated Summary

PDF52February 3, 2025