无需训练的区域提示扩散变压器

摘要

扩散模型在文本到图像生成中展现出卓越的能力。它们的语义理解（即，随后的提示）能力也随着大型语言模型（例如T5、Llama）得到了极大的改进。然而，现有模型无法完美处理长且复杂的文本提示，特别是当文本提示包含多个具有许多属性和相互关联空间关系的对象时。虽然已经提出了许多基于UNet模型（如SD1.5、SDXL）的区域提示方法，但仍然没有基于最近的扩散变压器（DiT）架构的实现，例如SD3和FLUX.1。在本报告中，我们提出并实现了基于注意力操纵的FLUX.1的区域提示，这使得DiT能够以无需训练的方式具有细粒度的组合文本到图像生成能力。代码可在https://github.com/antonioo-c/Regional-Prompting-FLUX 找到。

English

Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.

无需训练的区域提示扩散变压器

Training-free Regional Prompting for Diffusion Transformers

摘要

Summary

Support

Support