指南针控制：面向文本到图像生成的多对象定向控制

摘要

现有的文本到图像扩散模型控制方法虽然强大，却无法实现显式的三维物体中心控制，例如对物体朝向的精确调控。本研究致力于解决文本到图像扩散模型中的多物体朝向控制问题，从而能够生成具有每个物体精确朝向控制的多样化多物体场景。核心思想在于，通过一组朝向感知的指南针标记（每个物体对应一个）与文本标记共同条件化扩散模型。一个轻量级编码器网络以物体朝向为输入预测这些指南针标记。模型在一个由程序化生成场景构成的合成数据集上进行训练，每个场景包含一个或两个位于纯色背景上的三维资产。然而，直接训练该框架会导致朝向控制不佳以及物体间的相互干扰。为缓解此问题，我们在生成过程中进行干预，限制每个指南针标记的交叉注意力图仅作用于其对应的物体区域。训练后的模型能够实现对以下情况的精确朝向控制：a) 训练期间未见过的复杂物体；b) 包含两个以上物体的多物体场景，展现了强大的泛化能力。此外，当与个性化方法结合时，我们的方法能在多样化的上下文中精确控制新物体的朝向。通过广泛的评估和用户研究，我们的方法在朝向控制与文本对齐方面达到了业界领先水平。

English

Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control, such as precise control of object orientation. In this work, we address the problem of multi-object orientation control in text-to-image diffusion models. This enables the generation of diverse multi-object scenes with precise orientation control for each object. The key idea is to condition the diffusion model with a set of orientation-aware compass tokens, one for each object, along with text tokens. A light-weight encoder network predicts these compass tokens taking object orientation as the input. The model is trained on a synthetic dataset of procedurally generated scenes, each containing one or two 3D assets on a plain background. However, direct training this framework results in poor orientation control as well as leads to entanglement among objects. To mitigate this, we intervene in the generation process and constrain the cross-attention maps of each compass token to its corresponding object regions. The trained model is able to achieve precise orientation control for a) complex objects not seen during training and b) multi-object scenes with more than two objects, indicating strong generalization capabilities. Further, when combined with personalization methods, our method precisely controls the orientation of the new object in diverse contexts. Our method achieves state-of-the-art orientation control and text alignment, quantified with extensive evaluations and a user study.

指南针控制：面向文本到图像生成的多对象定向控制

Compass Control: Multi Object Orientation Control for Text-to-Image Generation

摘要

Summary

Support

Support