指南針控制：面向文本到圖像生成的多目標定向控制

摘要

現有的文本到圖像擴散模型控制方法雖然強大，但無法實現明確的以3D物體為中心的控制，例如精確控制物體的方向。在本研究中，我們解決了文本到圖像擴散模型中的多物體方向控制問題，從而能夠生成具有每個物體精確方向控制的多樣化多物體場景。關鍵思想是通過一組方向感知的指南針標記（每個物體一個）以及文本標記來條件化擴散模型。一個輕量級的編碼器網絡根據物體方向作為輸入來預測這些指南針標記。該模型在一個由程序生成的場景合成數據集上進行訓練，每個場景包含一個或兩個3D資產，背景為純色。然而，直接訓練此框架會導致方向控制不佳以及物體之間的糾纏。為了解決這個問題，我們在生成過程中進行干預，並將每個指南針標記的交叉注意力圖約束到其對應的物體區域。訓練後的模型能夠實現對a) 訓練期間未見過的複雜物體和b) 包含兩個以上物體的多物體場景的精確方向控制，顯示出強大的泛化能力。此外，當與個性化方法結合時，我們的方法能夠在多樣化的上下文中精確控制新物體的方向。我們的方法在廣泛的評估和用戶研究中實現了最先進的方向控制和文本對齊。

English

Existing approaches for controlling text-to-image diffusion models, while powerful, do not allow for explicit 3D object-centric control, such as precise control of object orientation. In this work, we address the problem of multi-object orientation control in text-to-image diffusion models. This enables the generation of diverse multi-object scenes with precise orientation control for each object. The key idea is to condition the diffusion model with a set of orientation-aware compass tokens, one for each object, along with text tokens. A light-weight encoder network predicts these compass tokens taking object orientation as the input. The model is trained on a synthetic dataset of procedurally generated scenes, each containing one or two 3D assets on a plain background. However, direct training this framework results in poor orientation control as well as leads to entanglement among objects. To mitigate this, we intervene in the generation process and constrain the cross-attention maps of each compass token to its corresponding object regions. The trained model is able to achieve precise orientation control for a) complex objects not seen during training and b) multi-object scenes with more than two objects, indicating strong generalization capabilities. Further, when combined with personalization methods, our method precisely controls the orientation of the new object in diverse contexts. Our method achieves state-of-the-art orientation control and text alignment, quantified with extensive evaluations and a user study.

指南針控制：面向文本到圖像生成的多目標定向控制

Compass Control: Multi Object Orientation Control for Text-to-Image Generation

摘要

Summary

Support

Support