EasyRef:通过多模态LLM实现扩散模型的全面化群体图像参考
EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM
December 12, 2024
作者: Zhuofan Zong, Dongzhi Jiang, Bingqi Ma, Guanglu Song, Hao Shao, Dazhong Shen, Yu Liu, Hongsheng Li
cs.AI
摘要
个性化扩散模型的重要进展已经取得。传统的无调整方法主要通过对多个参考图像的图像嵌入进行平均来作为注入条件,但这种与图像无关的操作无法在图像之间进行交互以捕捉多个参考图像中的一致视觉元素。虽然基于调整的低秩适应(LoRA)可以通过训练过程有效地提取多个图像中的一致元素,但它需要为每个不同的图像组进行特定的微调。本文介绍了EasyRef,一种新颖的即插即用适应方法,使扩散模型能够根据多个参考图像和文本提示进行条件化。为了有效利用多个图像中的一致视觉元素,我们利用多模态大语言模型(MLLM)的多图像理解和遵循指令能力,促使其根据指令捕捉一致的视觉元素。此外,通过适配器将MLLM的表示注入到扩散过程中,可以轻松泛化到未见领域,挖掘未见数据中的一致视觉元素。为了减少计算成本并增强细粒度细节保留,我们引入了一种高效的参考聚合策略和渐进式训练方案。最后,我们介绍了MRBench,一个新的多参考图像生成基准。实验结果表明,EasyRef超越了像IP-Adapter这样的无调整方法和像LoRA这样的基于调整的方法,实现了卓越的美学质量和在不同领域之间的稳健零样本泛化。
English
Significant achievements in personalization of diffusion models have been
witnessed. Conventional tuning-free methods mostly encode multiple reference
images by averaging their image embeddings as the injection condition, but such
an image-independent operation cannot perform interaction among images to
capture consistent visual elements within multiple references. Although the
tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent
elements within multiple images through the training process, it necessitates
specific finetuning for each distinct image group. This paper introduces
EasyRef, a novel plug-and-play adaptation method that enables diffusion models
to be conditioned on multiple reference images and the text prompt. To
effectively exploit consistent visual elements within multiple images, we
leverage the multi-image comprehension and instruction-following capabilities
of the multimodal large language model (MLLM), prompting it to capture
consistent visual elements based on the instruction. Besides, injecting the
MLLM's representations into the diffusion process through adapters can easily
generalize to unseen domains, mining the consistent visual elements within
unseen data. To mitigate computational costs and enhance fine-grained detail
preservation, we introduce an efficient reference aggregation strategy and a
progressive training scheme. Finally, we introduce MRBench, a new
multi-reference image generation benchmark. Experimental results demonstrate
EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based
methods like LoRA, achieving superior aesthetic quality and robust zero-shot
generalization across diverse domains.Summary
AI-Generated Summary