EasyRef：透過多模態LLM實現擴散模型的全面群組圖像參考

摘要

個性化擴散模型已取得重大成就。傳統的無調整方法主要透過將多個參考圖像的圖像嵌入平均編碼為注入條件，但這種與圖像無關的操作無法實現圖像之間的互動，以捕捉多個參考中的一致視覺元素。儘管基於調整的低秩適應（LoRA）可以有效地通過訓練過程提取多個圖像中的一致元素，但需要為每個不同的圖像組進行特定的微調。本文介紹了EasyRef，一種新型即插即用適應方法，使擴散模型能夠根據多個參考圖像和文本提示進行條件化。為了有效利用多個圖像中的一致視覺元素，我們利用多模態大型語言模型（MLLM）的多圖像理解和指示遵循能力，促使其根據指示捕捉一致的視覺元素。此外，通過適配器將MLLM的表示注入到擴散過程中，可以輕鬆推廣到未見領域，挖掘未見數據中的一致視覺元素。為了減少計算成本並增強細節保留，我們引入了高效的參考聚合策略和漸進式訓練方案。最後，我們介紹了MRBench，一個新的多參考圖像生成基準。實驗結果表明，EasyRef超越了像IP-Adapter這樣的無調整方法和像LoRA這樣的基於調整的方法，實現了卓越的美學質量和在不同領域之間的強大零樣本泛化。

English

Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although the tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent elements within multiple images through the training process, it necessitates specific finetuning for each distinct image group. This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains, mining the consistent visual elements within unseen data. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.

EasyRef：透過多模態LLM實現擴散模型的全面群組圖像參考

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

摘要

Summary

Support