3DIS-FLUX: DiT 렌더링을 사용한 간단하고 효율적인 다중 인스턴스 생성

초록

텍스트에서 이미지로의 생성에서 조절 가능한 출력물에 대한 수요 증가로 인해, 다중 인스턴스 생성(MIG) 분야에서 중요한 발전이 있었으며 사용자가 인스턴스 레이아웃과 속성을 모두 정의할 수 있게 되었습니다. 현재 MIG 분야에서 선두적인 방법은 주로 어댑터 기반입니다. 그러나 이러한 방법들은 더 고급 모델이 출시될 때마다 새로운 어댑터를 재학습해야 하므로 상당한 자원 소비가 발생합니다. 깊이 주도 분리된 인스턴스 합성(3DIS) 방법이 소개되었는데, 이 방법은 MIG를 두 가지 구분된 단계로 분리합니다: 1) 깊이 기반의 장면 구성 및 2) 널리 사전 훈련된 깊이 제어 모델을 사용한 세부 렌더링. 3DIS 방법은 장면 구성 단계에서만 어댑터 훈련이 필요하며, 다양한 모델이 훈련 없이 세부 렌더링을 수행할 수 있습니다. 처음에는 3DIS가 SD1.5, SD2 및 SDXL과 같은 U-Net 아키텍처를 활용한 렌더링 기술에 초점을 맞추었지만, 최근의 DiT 기반 모델인 FLUX의 잠재력을 탐색하지 않았습니다. 본 논문에서는 FLUX 모델을 통합하여 렌더링 능력을 향상시킨 3DIS-FLUX를 제안합니다. 구체적으로, 우리는 깊이 맵 제어 이미지 생성을 위해 FLUX.1-Depth-dev 모델을 활용하고, FLUX의 Joint Attention 메커니즘에서 레이아웃 정보를 기반으로 Attention Mask를 조작하는 세부 렌더러를 소개합니다. 이 방식은 각 인스턴스의 세밀한 속성을 정확하게 렌더링할 수 있도록 합니다. 실험 결과는 FLUX 모델을 활용한 3DIS-FLUX가 SD2 및 SDXL을 사용한 원래의 3DIS 방법보다 우수한 성능과 이미지 품질을 보여주며, 현재의 선두적인 어댑터 기반 방법을 능가한다는 것을 나타냅니다. 프로젝트 페이지: https://limuloo.github.io/3DIS/.

English

The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX's Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: https://limuloo.github.io/3DIS/.

3DIS-FLUX: DiT 렌더링을 사용한 간단하고 효율적인 다중 인스턴스 생성

3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering

초록

Support