MOVIS：增强室内场景多物体新视图合成

摘要

重新利用预训练扩散模型已被证明对多对象视图合成（NVS）非常有效。然而，这些方法大多局限于单个对象；直接将这些方法应用于组合式多对象场景会导致结果较差，尤其是在新视图下出现对象位置错误和形状、外观不一致。如何增强并系统评估这类模型的跨视图一致性仍未得到充分探讨。为解决这一问题，我们提出了MOVIS，以增强视图条件下扩散模型对多对象NVS的结构意识，包括模型输入、辅助任务和训练策略。首先，我们将结构感知特征（包括深度和对象掩模）注入去噪U-Net中，以增强模型对对象实例及其空间关系的理解。其次，我们引入一个辅助任务，要求模型同时预测新视图对象掩模，进一步提高模型区分和放置对象的能力。最后，我们对扩散采样过程进行深入分析，并在训练过程中精心设计了一个结构引导的时间步采样调度器，平衡了全局对象放置和细粒度细节恢复的学习。为了系统评估合成图像的合理性，我们提出评估跨视图一致性和新视图对象放置，同时结合现有的基于图像级别的NVS指标。在具有挑战性的合成和现实数据集上进行了大量实验，结果表明我们的方法具有很强的泛化能力，并产生一致的新视图合成，突显了其引导未来3D感知多对象NVS任务的潜力。

English

Repurposing pre-trained diffusion models has been proven to be effective for NVS. However, these methods are mostly limited to a single object; directly applying such methods to compositional multi-object scenarios yields inferior results, especially incorrect object placement and inconsistent shape and appearance under novel views. How to enhance and systematically evaluate the cross-view consistency of such models remains under-explored. To address this issue, we propose MOVIS to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS in terms of model inputs, auxiliary tasks, and training strategy. First, we inject structure-aware features, including depth and object mask, into the denoising U-Net to enhance the model's comprehension of object instances and their spatial relationships. Second, we introduce an auxiliary task requiring the model to simultaneously predict novel view object masks, further improving the model's capability in differentiating and placing objects. Finally, we conduct an in-depth analysis of the diffusion sampling process and carefully devise a structure-guided timestep sampling scheduler during training, which balances the learning of global object placement and fine-grained detail recovery. To systematically evaluate the plausibility of synthesized images, we propose to assess cross-view consistency and novel view object placement alongside existing image-level NVS metrics. Extensive experiments on challenging synthetic and realistic datasets demonstrate that our method exhibits strong generalization capabilities and produces consistent novel view synthesis, highlighting its potential to guide future 3D-aware multi-object NVS tasks.

MOVIS：增强室内场景多物体新视图合成

MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes

摘要

Support