ChatPaper.aiChatPaper

MV-Adapter:简化多视角一致图像生成

MV-Adapter: Multi-view Consistent Image Generation Made Easy

December 4, 2024
作者: Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, Lu Sheng
cs.AI

摘要

现有的多视图图像生成方法通常对预训练的文本到图像(T2I)模型进行侵入性修改,并需要进行完全微调,这导致(1)高计算成本,特别是在使用大型基础模型和高分辨率图像时,以及(2)由于优化困难和高质量3D数据稀缺而导致图像质量下降。在本文中,我们提出了第一个基于适配器的多视图图像生成解决方案,并引入了MV-Adapter,这是一个多功能即插即用适配器,可增强T2I模型及其衍生物,而无需改变原始网络结构或特征空间。通过更新更少的参数,MV-Adapter实现了高效训练,并保留了嵌入在预训练模型中的先验知识,从而减轻过拟合风险。为了有效地在适配器内部建模3D几何知识,我们引入了包括重复的自注意力层和并行注意力架构在内的创新设计,使适配器能够继承预训练模型的强大先验知识,以建模新颖的3D知识。此外,我们提出了一个统一的条件编码器,无缝集成摄像机参数和几何信息,促进了诸如基于文本和图像的3D生成和纹理化等应用。MV-Adapter在Stable Diffusion XL(SDXL)上实现了768分辨率的多视图生成,并展示了其适应性和多功能性。它还可以扩展到任意视图生成,实现更广泛的应用。我们展示了MV-Adapter为多视图图像生成设定了新的质量标准,并由于其高效性、适应性和多功能性而开辟了新的可能性。
English
Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.

Summary

AI-Generated Summary

PDF243December 6, 2024