通过硬绑定和软细化实现区域感知的文本到图像生成

摘要

本文介绍了RAG，一种基于区域感知的文本到图像生成方法，其以区域描述为条件，实现精确的布局组合。区域提示或组合生成使得细粒度空间控制成为可能，在实际应用中备受关注。然而，先前的方法要么引入额外的可训练模块，因此仅适用于特定模型，要么在交叉注意力层中使用注意力掩码对得分图进行操作，导致在区域数量增加时控制强度有限。为了解决这些限制，我们将多区域生成分解为两个子任务，即构建单个区域（区域硬绑定），以确保区域提示得到正确执行，以及对区域进行整体细化（区域软细化），消除视觉边界并增强相邻交互。此外，RAG创新地实现了重绘功能，用户可以在上一次生成的基础上修改特定不满意的区域，同时保持所有其他区域不变，而无需依赖额外的修补模型。我们的方法无需调整即可适用于其他框架，作为对随后属性的增强。定量和定性实验证明，与先前无需调整的方法相比，RAG在属性绑定和对象关系方面表现出优越性能。

English

In this paper, we present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition. Regional prompting, or compositional generation, which enables fine-grained spatial control, has gained increasing attention for its practicality in real-world applications. However, previous methods either introduce additional trainable modules, thus only applicable to specific models, or manipulate on score maps within cross-attention layers using attention masks, resulting in limited control strength when the number of regions increases. To handle these limitations, we decouple the multi-region generation into two sub-tasks, the construction of individual region (Regional Hard Binding) that ensures the regional prompt is properly executed, and the overall detail refinement (Regional Soft Refinement) over regions that dismiss the visual boundaries and enhance adjacent interactions. Furthermore, RAG novelly makes repainting feasible, where users can modify specific unsatisfied regions in the last generation while keeping all other regions unchanged, without relying on additional inpainting models. Our approach is tuning-free and applicable to other frameworks as an enhancement to the prompt following property. Quantitative and qualitative experiments demonstrate that RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.

通过硬绑定和软细化实现区域感知的文本到图像生成

Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement

摘要

Support