EasyRef:透過多模態LLM實現擴散模型的全面群組圖像參考
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM
December 12, 2024
作者: Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li
cs.AI
摘要
個性化擴散模型已取得重大成就。傳統的無調整方法主要透過將多個參考圖像的圖像嵌入平均編碼為注入條件,但這種與圖像無關的操作無法實現圖像之間的互動,以捕捉多個參考中的一致視覺元素。儘管基於調整的低秩適應(LoRA)可以有效地通過訓練過程提取多個圖像中的一致元素,但需要為每個不同的圖像組進行特定的微調。本文介紹了EasyRef,一種新型即插即用適應方法,使擴散模型能夠根據多個參考圖像和文本提示進行條件化。為了有效利用多個圖像中的一致視覺元素,我們利用多模態大型語言模型(MLLM)的多圖像理解和指示遵循能力,促使其根據指示捕捉一致的視覺元素。此外,通過適配器將MLLM的表示注入到擴散過程中,可以輕鬆推廣到未見領域,挖掘未見數據中的一致視覺元素。為了減少計算成本並增強細節保留,我們引入了高效的參考聚合策略和漸進式訓練方案。最後,我們介紹了MRBench,一個新的多參考圖像生成基準。實驗結果表明,EasyRef超越了像IP-Adapter這樣的無調整方法和像LoRA這樣的基於調整的方法,實現了卓越的美學質量和在不同領域之間的強大零樣本泛化。
English
Significant achievements in personalization of diffusion models have been
witnessed. Conventional tuning-free methods mostly encode multiple reference
images by averaging their image embeddings as the injection condition, but such
an image-independent operation cannot perform interaction among images to
capture consistent visual elements within multiple references. Although the
tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent
elements within multiple images through the training process, it necessitates
specific finetuning for each distinct image group. This paper introduces
EasyRef, a novel plug-and-play adaptation method that enables diffusion models
to be conditioned on multiple reference images and the text prompt. To
effectively exploit consistent visual elements within multiple images, we
leverage the multi-image comprehension and instruction-following capabilities
of the multimodal large language model (MLLM), prompting it to capture
consistent visual elements based on the instruction. Besides, injecting the
MLLM's representations into the diffusion process through adapters can easily
generalize to unseen domains, mining the consistent visual elements within
unseen data. To mitigate computational costs and enhance fine-grained detail
preservation, we introduce an efficient reference aggregation strategy and a
progressive training scheme. Finally, we introduce MRBench, a new
multi-reference image generation benchmark. Experimental results demonstrate
EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based
methods like LoRA, achieving superior aesthetic quality and robust zero-shot
generalization across diverse domains.Summary
AI-Generated Summary