MV-Adapter:简化多视角一致图像生成
MV-Adapter: Multi-view Consistent Image Generation Made Easy
December 4, 2024
作者: Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, Lu Sheng
cs.AI
摘要
现有的多视图图像生成方法通常对预训练的文本到图像(T2I)模型进行侵入性修改,并需要进行完全微调,这导致(1)高计算成本,特别是在使用大型基础模型和高分辨率图像时,以及(2)由于优化困难和高质量3D数据稀缺而导致图像质量下降。在本文中,我们提出了第一个基于适配器的多视图图像生成解决方案,并引入了MV-Adapter,这是一个多功能即插即用适配器,可增强T2I模型及其衍生物,而无需改变原始网络结构或特征空间。通过更新更少的参数,MV-Adapter实现了高效训练,并保留了嵌入在预训练模型中的先验知识,从而减轻过拟合风险。为了有效地在适配器内部建模3D几何知识,我们引入了包括重复的自注意力层和并行注意力架构在内的创新设计,使适配器能够继承预训练模型的强大先验知识,以建模新颖的3D知识。此外,我们提出了一个统一的条件编码器,无缝集成摄像机参数和几何信息,促进了诸如基于文本和图像的3D生成和纹理化等应用。MV-Adapter在Stable Diffusion XL(SDXL)上实现了768分辨率的多视图生成,并展示了其适应性和多功能性。它还可以扩展到任意视图生成,实现更广泛的应用。我们展示了MV-Adapter为多视图图像生成设定了新的质量标准,并由于其高效性、适应性和多功能性而开辟了新的可能性。
English
Existing multi-view image generation methods often make invasive
modifications to pre-trained text-to-image (T2I) models and require full
fine-tuning, leading to (1) high computational costs, especially with large
base models and high-resolution images, and (2) degradation in image quality
due to optimization difficulties and scarce high-quality 3D data. In this
paper, we propose the first adapter-based solution for multi-view image
generation, and introduce MV-Adapter, a versatile plug-and-play adapter that
enhances T2I models and their derivatives without altering the original network
structure or feature space. By updating fewer parameters, MV-Adapter enables
efficient training and preserves the prior knowledge embedded in pre-trained
models, mitigating overfitting risks. To efficiently model the 3D geometric
knowledge within the adapter, we introduce innovative designs that include
duplicated self-attention layers and parallel attention architecture, enabling
the adapter to inherit the powerful priors of the pre-trained models to model
the novel 3D knowledge. Moreover, we present a unified condition encoder that
seamlessly integrates camera parameters and geometric information, facilitating
applications such as text- and image-based 3D generation and texturing.
MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion
XL (SDXL), and demonstrates adaptability and versatility. It can also be
extended to arbitrary view generation, enabling broader applications. We
demonstrate that MV-Adapter sets a new quality standard for multi-view image
generation, and opens up new possibilities due to its efficiency, adaptability
and versatility.Summary
AI-Generated Summary