指南針控制:面向文本到圖像生成的多目標定向控制
Compass Control: Multi Object Orientation Control for Text-to-Image Generation
April 9, 2025
作者: Rishubh Parihar, Vaibhav Agrawal, Sachidanand VS, R. Venkatesh Babu
cs.AI
摘要
現有的文本到圖像擴散模型控制方法雖然強大,但無法實現明確的以3D物體為中心的控制,例如精確控制物體的方向。在本研究中,我們解決了文本到圖像擴散模型中的多物體方向控制問題,從而能夠生成具有每個物體精確方向控制的多樣化多物體場景。關鍵思想是通過一組方向感知的指南針標記(每個物體一個)以及文本標記來條件化擴散模型。一個輕量級的編碼器網絡根據物體方向作為輸入來預測這些指南針標記。該模型在一個由程序生成的場景合成數據集上進行訓練,每個場景包含一個或兩個3D資產,背景為純色。然而,直接訓練此框架會導致方向控制不佳以及物體之間的糾纏。為了解決這個問題,我們在生成過程中進行干預,並將每個指南針標記的交叉注意力圖約束到其對應的物體區域。訓練後的模型能夠實現對a) 訓練期間未見過的複雜物體和b) 包含兩個以上物體的多物體場景的精確方向控制,顯示出強大的泛化能力。此外,當與個性化方法結合時,我們的方法能夠在多樣化的上下文中精確控制新物體的方向。我們的方法在廣泛的評估和用戶研究中實現了最先進的方向控制和文本對齊。
English
Existing approaches for controlling text-to-image diffusion models, while
powerful, do not allow for explicit 3D object-centric control, such as precise
control of object orientation. In this work, we address the problem of
multi-object orientation control in text-to-image diffusion models. This
enables the generation of diverse multi-object scenes with precise orientation
control for each object. The key idea is to condition the diffusion model with
a set of orientation-aware compass tokens, one for each object, along
with text tokens. A light-weight encoder network predicts these compass tokens
taking object orientation as the input. The model is trained on a synthetic
dataset of procedurally generated scenes, each containing one or two 3D assets
on a plain background. However, direct training this framework results in poor
orientation control as well as leads to entanglement among objects. To mitigate
this, we intervene in the generation process and constrain the cross-attention
maps of each compass token to its corresponding object regions. The trained
model is able to achieve precise orientation control for a) complex objects not
seen during training and b) multi-object scenes with more than two objects,
indicating strong generalization capabilities. Further, when combined with
personalization methods, our method precisely controls the orientation of the
new object in diverse contexts. Our method achieves state-of-the-art
orientation control and text alignment, quantified with extensive evaluations
and a user study.Summary
AI-Generated Summary