3DIS-FLUX:使用DiT渲染进行简单高效的多实例生成

3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering

January 9, 2025
作者: Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang
cs.AI

摘要

在文本到图像生成中,对可控输出的需求不断增长,推动了多实例生成(MIG)领域的重大进展,使用户能够定义实例布局和属性。目前,MIG 领域的最先进方法主要基于适配器。然而,这些方法需要每次发布更先进模型时重新训练新的适配器,导致了大量资源消耗。一种名为深度驱动解耦实例合成(3DIS)的方法被提出,将 MIG 分解为两个独立阶段:1)基于深度的场景构建和 2)利用广泛预训练深度控制模型进行细节渲染。3DIS 方法仅在场景构建阶段需要适配器训练,同时使各种模型能够进行无需训练的细节渲染。最初,3DIS 专注于利用 U-Net 架构的渲染技术,如 SD1.5、SD2 和 SDXL,而未探索最近基于 DiT 模型如 FLUX 的潜力。本文介绍了 3DIS-FLUX,这是 3DIS 框架的扩展,集成了 FLUX 模型以增强渲染能力。具体来说,我们采用 FLUX.1-Depth-dev 模型进行深度图控制图像生成,并引入一个细节渲染器,根据布局信息调整 FLUX 联合注意机制中的 Attention Mask。这种方法允许精确渲染每个实例的细粒度属性。我们的实验结果表明,利用 FLUX 模型的 3DIS-FLUX 在性能和图像质量方面优于使用 SD2 和 SDXL 的原始 3DIS 方法,并超越当前最先进的基于适配器的方法。项目页面:https://limuloo.github.io/3DIS/.
English
The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX's Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: https://limuloo.github.io/3DIS/.

Summary

AI-Generated Summary

PDF322January 15, 2025