확산 트랜스포머를 위한 훈련 없는 지역 프롬프팅

초록

확산 모델은 텍스트에서 이미지를 생성하는 데 뛰어난 능력을 보여주었습니다. 그들의 의미 이해(즉, 프롬프트 따르기) 능력은 대형 언어 모델(예: T5, Llama)로 크게 향상되었습니다. 그러나 기존 모델은 특히 텍스트 프롬프트가 다양한 객체와 다수의 속성 및 상호 관련된 공간적 관계를 포함할 때 장황하고 복잡한 텍스트 프롬프트를 완벽하게 처리할 수 없습니다. 많은 지역 프롬프팅 방법이 UNet 기반 모델들을 위해 제안되었지만(예: SD1.5, SDXL), 최근의 확산 트랜스포머(DiT) 아키텍처를 기반으로 한 구현은 아직 없습니다. 예를 들어 SD3 및 FLUX.1. 본 보고서에서는 주의 조작을 기반으로 FLUX.1을 위한 지역 프롬프팅을 제안하고 구현하며, 이는 훈련 없이 DiT가 미세 구성의 텍스트에서 이미지를 생성할 수 있는 능력을 제공합니다. 코드는 https://github.com/antonioo-c/Regional-Prompting-FLUX에서 확인할 수 있습니다.

English

Diffusion models have demonstrated excellent capabilities in text-to-image generation. Their semantic understanding (i.e., prompt following) ability has also been greatly improved with large language models (e.g., T5, Llama). However, existing models cannot perfectly handle long and complex text prompts, especially when the text prompts contain various objects with numerous attributes and interrelated spatial relationships. While many regional prompting methods have been proposed for UNet-based models (SD1.5, SDXL), but there are still no implementations based on the recent Diffusion Transformer (DiT) architecture, such as SD3 and FLUX.1.In this report, we propose and implement regional prompting for FLUX.1 based on attention manipulation, which enables DiT with fined-grained compositional text-to-image generation capability in a training-free manner. Code is available at https://github.com/antonioo-c/Regional-Prompting-FLUX.

확산 트랜스포머를 위한 훈련 없는 지역 프롬프팅

Training-free Regional Prompting for Diffusion Transformers

초록

Support