OmniBooth：透過多模態指導學習影像合成的潛在控制

摘要

我們提出了 OmniBooth，一個影像生成框架，可實現空間控制與實例級多模態定制。對於所有實例，多模態指令可以通過文本提示或圖像參考來描述。在給定一組用戶定義的遮罩和相應的文本或圖像指導的情況下，我們的目標是生成一幅圖像，其中多個物體位於指定坐標，並且它們的屬性與相應的指導精確對齊。這種方法顯著擴展了文本到圖像生成的範圍，使其提升到更具靈活性和實用性的可控維度。在本文中，我們的核心貢獻在於提出的潛在控制信號，這是一個高維度的空間特徵，提供了一個統一的表示，無縫集成了空間、文本和圖像條件。文本條件擴展了 ControlNet，以提供實例級開放詞彙生成。圖像條件進一步實現了對個性化身份的精細控制。在實踐中，我們的方法賦予用戶更多的靈活性，因為用戶可以根據需要從文本或圖像中選擇多模態條件。此外，通過徹底的實驗，我們展示了我們在圖像合成保真度和在不同任務和數據集上的對齊方面的增強性能。項目頁面：https://len-li.github.io/omnibooth-web/

English

We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. For all instances, the multimodal instruction can be described through text prompts or image references. Given a set of user-defined masks and associated text or image guidance, our objective is to generate an image, where multiple objects are positioned at specified coordinates and their attributes are precisely aligned with the corresponding guidance. This approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability. In this paper, our core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly. The text condition extends ControlNet to provide instance-level open-vocabulary generation. The image condition further enables fine-grained control with personalized identity. In practice, our method empowers users with more flexibility in controllable generation, as users can choose multi-modal conditions from text or images as needed. Furthermore, thorough experiments demonstrate our enhanced performance in image synthesis fidelity and alignment across different tasks and datasets. Project page: https://len-li.github.io/omnibooth-web/

OmniBooth：透過多模態指導學習影像合成的潛在控制

OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

摘要

Summary

Support

Support