透過硬綁定和軟微調的方式實現區域感知的文本到圖像生成
Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement
November 10, 2024
作者: Zhennan Chen, Yajie Li, Haofan Wang, Zhibo Chen, Zhengkai Jiang, Jun Li, Qian Wang, Jian Yang, Ying Tai
cs.AI
摘要
本文介紹了RAG,一種基於區域感知的文本到圖像生成方法,該方法受區域描述條件限制,用於精確的版面組合。區域提示或組合生成使得對空間進行精細控制成為可能,因此在實際應用中越來越受到關注。然而,先前的方法要麼引入額外的可訓練模塊,因此僅適用於特定模型,要麼在交叉注意力層中使用注意力遮罩對得分圖進行操作,導致當區域數量增加時控制力受到限制。為了應對這些限制,我們將多區域生成拆分為兩個子任務,即單個區域的構建(區域硬綁定),確保區域提示得到正確執行,以及對區域進行整體細節的改進(區域軟精煉),忽略視覺邊界並增強相鄰交互作用。此外,RAG創新地實現了重新繪製的可行性,用戶可以在上一次生成中修改特定不滿意的區域,同時保持所有其他區域不變,而無需依賴額外的修補模型。我們的方法無需調整即可應用於其他框架,作為對遵循屬性的增強。定量和定性實驗表明,與先前的無需調整方法相比,RAG在屬性綁定和對象關係方面實現了優越性能。
English
In this paper, we present RAG, a Regional-Aware text-to-image Generation
method conditioned on regional descriptions for precise layout composition.
Regional prompting, or compositional generation, which enables fine-grained
spatial control, has gained increasing attention for its practicality in
real-world applications. However, previous methods either introduce additional
trainable modules, thus only applicable to specific models, or manipulate on
score maps within cross-attention layers using attention masks, resulting in
limited control strength when the number of regions increases. To handle these
limitations, we decouple the multi-region generation into two sub-tasks, the
construction of individual region (Regional Hard Binding) that ensures the
regional prompt is properly executed, and the overall detail refinement
(Regional Soft Refinement) over regions that dismiss the visual boundaries and
enhance adjacent interactions. Furthermore, RAG novelly makes repainting
feasible, where users can modify specific unsatisfied regions in the last
generation while keeping all other regions unchanged, without relying on
additional inpainting models. Our approach is tuning-free and applicable to
other frameworks as an enhancement to the prompt following property.
Quantitative and qualitative experiments demonstrate that RAG achieves superior
performance over attribute binding and object relationship than previous
tuning-free methods.Summary
AI-Generated Summary