無需訓練的區域提示擴散Transformer

摘要

擴散模型在文本轉圖像生成方面展現出卓越的能力。隨著大型語言模型（例如T5、Llama）的出現，它們對語義理解（即後續提示）的能力也得到了很大改善。然而，現有模型無法完美處理長且複雜的文本提示，特別是當文本提示包含多個具有許多屬性和相互關聯空間關係的物件時。雖然已經提出了許多針對基於UNet模型（如SD1.5、SDXL）的區域提示方法，但仍然沒有基於最近的擴散Transformer（DiT）架構（例如SD3和FLUX.1）的實現。在本報告中，我們提出並實現了基於注意力操作的FLUX.1的區域提示，這使得DiT能夠以訓練免費的方式具有精細的組合文本轉圖像生成能力。代碼可在https://github.com/antonioo-c/Regional-Prompting-FLUX找到。

English

Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.

無需訓練的區域提示擴散Transformer

Training-free Regional Prompting for Diffusion Transformers

摘要

Summary

Support

Support