無需訓練的區域提示擴散Transformer
Training-free Regional Prompting for Diffusion Transformers
November 4, 2024
作者: Anthony Chen, Jianjin Xu, Wenzhao Zheng, Gaole Dai, Yida Wang, Renrui Zhang, Haofan Wang, Shanghang Zhang
cs.AI
摘要
擴散模型在文本轉圖像生成方面展現出卓越的能力。隨著大型語言模型(例如T5、Llama)的出現,它們對語義理解(即後續提示)的能力也得到了很大改善。然而,現有模型無法完美處理長且複雜的文本提示,特別是當文本提示包含多個具有許多屬性和相互關聯空間關係的物件時。雖然已經提出了許多針對基於UNet模型(如SD1.5、SDXL)的區域提示方法,但仍然沒有基於最近的擴散Transformer(DiT)架構(例如SD3和FLUX.1)的實現。在本報告中,我們提出並實現了基於注意力操作的FLUX.1的區域提示,這使得DiT能夠以訓練免費的方式具有精細的組合文本轉圖像生成能力。代碼可在https://github.com/antonioo-c/Regional-Prompting-FLUX找到。
English
Diffusion models have demonstrated excellent capabilities in text-to-image
generation. Their semantic understanding (i.e., prompt following) ability has
also been greatly improved with large language models (e.g., T5, Llama).
However, existing models cannot perfectly handle long and complex text prompts,
especially when the text prompts contain various objects with numerous
attributes and interrelated spatial relationships. While many regional
prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but
there are still no implementations based on the recent Diffusion Transformer
(DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and
implement regional prompting for FLUX.1 based on attention manipulation, which
enables DiT with fined-grained compositional text-to-image generation
capability in a training-free manner. Code is available at
https://github.com/antonioo-c/Regional-Prompting-FLUX.Summary
AI-Generated Summary