3DIS-FLUX:使用DiT渲染的簡單高效多實例生成
3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering
January 9, 2025
作者: Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang
cs.AI
摘要
對於文本到圖像生成中可控輸出的需求不斷增長,推動了多實例生成(MIG)領域的重大進展,使用戶能夠定義實例佈局和屬性。目前,MIG 領域的最先進方法主要基於適配器。然而,這些方法需要每次釋放更高級模型時重新訓練新的適配器,導致了大量資源的消耗。一種名為深度驅動解耦實例合成(3DIS)的方法被提出,將 MIG 分解為兩個明確的階段:1)基於深度的場景構建和 2)使用廣泛預訓練的深度控制模型進行細節渲染。3DIS 方法僅在場景構建階段需要適配器訓練,同時使各種模型能夠進行無需訓練的細節渲染。最初,3DIS 專注於利用 U-Net 結構的渲染技術,如 SD1.5、SD2 和 SDXL,而沒有探索最近基於 DiT 模型(例如 FLUX)的潛力。本文提出了 3DIS-FLUX,這是 3DIS 框架的擴展,整合了 FLUX 模型以增強渲染能力。具體來說,我們使用 FLUX.1-Depth-dev 模型進行深度圖控制的圖像生成,並引入一個細節渲染器,根據佈局信息操縱 FLUX 的聯合注意機制中的注意力遮罩。這種方法允許對每個實例的細微屬性進行精確渲染。我們的實驗結果表明,利用 FLUX 模型的 3DIS-FLUX 在性能和圖像質量方面優於使用 SD2 和 SDXL 的原始 3DIS 方法,並超越當前最先進的基於適配器的方法。項目頁面:https://limuloo.github.io/3DIS/。
English
The growing demand for controllable outputs in text-to-image generation has
driven significant advancements in multi-instance generation (MIG), enabling
users to define both instance layouts and attributes. Currently, the
state-of-the-art methods in MIG are primarily adapter-based. However, these
methods necessitate retraining a new adapter each time a more advanced model is
released, resulting in significant resource consumption. A methodology named
Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which
decouples MIG into two distinct phases: 1) depth-based scene construction and
2) detail rendering with widely pre-trained depth control models. The 3DIS
method requires adapter training solely during the scene construction phase,
while enabling various models to perform training-free detail rendering.
Initially, 3DIS focused on rendering techniques utilizing U-Net architectures
such as SD1.5, SD2, and SDXL, without exploring the potential of recent
DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension
of the 3DIS framework that integrates the FLUX model for enhanced rendering
capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map
controlled image generation and introduce a detail renderer that manipulates
the Attention Mask in FLUX's Joint Attention mechanism based on layout
information. This approach allows for the precise rendering of fine-grained
attributes of each instance. Our experimental results indicate that 3DIS-FLUX,
leveraging the FLUX model, outperforms the original 3DIS method, which utilized
SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in
terms of both performance and image quality. Project Page:
https://limuloo.github.io/3DIS/.Summary
AI-Generated Summary