ChatPaper.aiChatPaper

MIDI:单图像到3D场景生成的多实例扩散

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

December 4, 2024
作者: Zehuan Huang, Yuan-Chen Guo, Xingqiao An, Yunhan Yang, Yangguang Li, Zi-Xin Zou, Ding Liang, Xihui Liu, Yan-Pei Cao, Lu Sheng
cs.AI

摘要

本文介绍了一种名为MIDI的新型范式,用于从单个图像生成三维场景。与依赖重建或检索技术的现有方法或利用多阶段逐个对象生成的最近方法不同,MIDI将预训练的图像到三维对象生成模型扩展到多实例扩散模型,实现了同时生成多个具有准确空间关系和高泛化能力的三维实例。在核心部分,MIDI包含一种新颖的多实例注意机制,能够在生成过程中有效捕获对象间的相互作用和空间一致性,无需复杂的多步骤过程。该方法利用部分对象图像和全局场景上下文作为输入,在三维生成过程中直接建模对象完成。在训练过程中,我们通过有限量的场景级数据有效监督三维实例之间的交互作用,同时将单个对象数据用于正则化,从而保持预训练的泛化能力。MIDI在图像到场景生成方面表现出最先进的性能,通过对合成数据、真实世界场景数据以及由文本到图像扩散模型生成的风格化场景图像的评估进行验证。
English
This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.

Summary

AI-Generated Summary

PDF193December 5, 2024