MV-Adapter:輕鬆實現多視角一致圖像生成
MV-Adapter: Multi-view Consistent Image Generation Made Easy
December 4, 2024
作者: Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, Lu Sheng
cs.AI
摘要
現有的多視角圖像生成方法通常對預先訓練的文本到圖像(T2I)模型進行侵入性修改,並需要進行完整的微調,這導致(1)高計算成本,尤其是對於大型基礎模型和高分辨率圖像,以及(2)由於優化困難和高質量3D數據稀缺而導致圖像質量下降。在本文中,我們提出了第一個基於適配器的多視角圖像生成解決方案,並引入MV-Adapter,這是一個多功能即插即用的適配器,可增強T2I模型及其衍生物,而無需改變原始網絡結構或特徵空間。通過更新較少的參數,MV-Adapter實現了高效的訓練,並保留了預先訓練模型中嵌入的先前知識,減輕了過度擬合的風險。為了有效地在適配器中建模3D幾何知識,我們引入了包括重複的自注意力層和平行注意力架構在內的創新設計,使適配器能夠繼承預先訓練模型的強大先驗知識,以建模新的3D知識。此外,我們提出了一個統一的條件編碼器,無縫集成相機參數和幾何信息,促進應用,如基於文本和圖像的3D生成和紋理。MV-Adapter在Stable Diffusion XL(SDXL)上實現了768分辨率的多視角生成,展示了其適應性和多功能性。它還可以擴展到任意視角生成,實現更廣泛的應用。我們展示了MV-Adapter為多視角圖像生成設定了新的質量標準,並由於其效率、適應性和多功能性而開啟了新的可能性。
English
Existing multi-view image generation methods often make invasive
modifications to pre-trained text-to-image (T2I) models and require full
fine-tuning, leading to (1) high computational costs, especially with large
base models and high-resolution images, and (2) degradation in image quality
due to optimization difficulties and scarce high-quality 3D data. In this
paper, we propose the first adapter-based solution for multi-view image
generation, and introduce MV-Adapter, a versatile plug-and-play adapter that
enhances T2I models and their derivatives without altering the original network
structure or feature space. By updating fewer parameters, MV-Adapter enables
efficient training and preserves the prior knowledge embedded in pre-trained
models, mitigating overfitting risks. To efficiently model the 3D geometric
knowledge within the adapter, we introduce innovative designs that include
duplicated self-attention layers and parallel attention architecture, enabling
the adapter to inherit the powerful priors of the pre-trained models to model
the novel 3D knowledge. Moreover, we present a unified condition encoder that
seamlessly integrates camera parameters and geometric information, facilitating
applications such as text- and image-based 3D generation and texturing.
MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion
XL (SDXL), and demonstrates adaptability and versatility. It can also be
extended to arbitrary view generation, enabling broader applications. We
demonstrate that MV-Adapter sets a new quality standard for multi-view image
generation, and opens up new possibilities due to its efficiency, adaptability
and versatility.Summary
AI-Generated Summary