VMix：透過交叉注意力改進文本到圖像擴散模型的混合控制

摘要

儘管擴散模型在文字轉圖像生成方面展現出非凡的才能，但仍可能無法生成高度美學的圖像。更具體地說，生成的圖像與真實世界美學圖像之間仍存在差距，特別是在包括色彩、燈光、構圖等更細緻的維度。本文提出了交叉注意力值混合控制（VMix）適配器，這是一個即插即用的美學適配器，可提升生成圖像的質量，同時通過（1）將輸入文字提示解開為內容描述和美學描述，通過美學嵌入的初始化，以及（2）通過值混合的交叉注意力將美學條件整合到去噪過程中，並通過零初始化的線性層連接網絡，實現在視覺概念之間的通用性。我們的關鍵見解是通過設計出色的條件控制方法來增強現有擴散模型的美學呈現，同時保持圖像與文字的對齊。通過我們精心設計的VMix，可以靈活應用於社區模型，以提高視覺性能，無需重新訓練。為驗證我們方法的有效性，我們進行了大量實驗，結果顯示VMix優於其他最先進的方法，並且與其他社區模塊（例如LoRA、ControlNet和IPAdapter）兼容，用於圖像生成。項目頁面為https://vmix-diffusion.github.io/VMix/。

English

While diffusion models show extraordinary talents in text-to-image generation, they may still fail to generate highly aesthetic images. More specifically, there is still a gap between the generated images and the real-world aesthetic images in finer-grained dimensions including color, lighting, composition, etc. In this paper, we propose Cross-Attention Value Mixing Control (VMix) Adapter, a plug-and-play aesthetics adapter, to upgrade the quality of generated images while maintaining generality across visual concepts by (1) disentangling the input text prompt into the content description and aesthetic description by the initialization of aesthetic embedding, and (2) integrating aesthetic conditions into the denoising process through value-mixed cross-attention, with the network connected by zero-initialized linear layers. Our key insight is to enhance the aesthetic presentation of existing diffusion models by designing a superior condition control method, all while preserving the image-text alignment. Through our meticulous design, VMix is flexible enough to be applied to community models for better visual performance without retraining. To validate the effectiveness of our method, we conducted extensive experiments, showing that VMix outperforms other state-of-the-art methods and is compatible with other community modules (e.g., LoRA, ControlNet, and IPAdapter) for image generation. The project page is https://vmix-diffusion.github.io/VMix/.

VMix：透過交叉注意力改進文本到圖像擴散模型的混合控制

VMix: Improving Text-to-Image Diffusion Model with Cross-Attention Mixing Control

摘要

Summary

Support