텍스트에서 이미지로 RGBA 인스턴스 생성을 통한 구성적인 장면 생성

초록

이미지 생성 확산 생성 모델은 높은 품질의 이미지를 생성할 수 있지만 귀찮은 프롬프트 엔지니어링 비용이 발생합니다. 레이아웃 조건을 도입하여 조절성을 향상시킬 수 있지만, 기존 방법은 레이아웃 편집 능력과 물체 속성에 대한 세밀한 제어를 부족하게 합니다. 다층 생성 개념은 이러한 한계를 극복하기 위한 큰 잠재력을 가지고 있지만, 이미지 인스턴스를 동시에 생성하여 장면 구성을 제한함으로써 세밀한 물체 속성, 3D 공간 내 상대적 위치 및 장면 조작 능력을 제한합니다. 본 연구에서는 세밀한 제어, 유연성 및 상호작용을 위해 설계된 새로운 다단계 생성 패러다임을 제안합니다. 인스턴스 속성을 제어하기 위해 새로운 훈련 패러다임을 고안하여 확산 모델을 조정하여 투명도 정보를 포함한 RGBA 이미지로 고립된 장면 구성 요소를 생성합니다. 복잡한 이미지를 구축하기 위해 이러한 사전 생성된 인스턴스를 활용하고 실제적인 장면에서 구성 요소를 부드럽게 조립하는 다층 복합 생성 프로세스를 도입합니다. 실험 결과, 우리의 RGBA 확산 모델이 물체 속성을 정밀하게 제어하면서 다양하고 높은 품질의 인스턴스를 생성할 수 있는 것을 보여줍니다. 다층 구성을 통해 우리의 접근 방식이 경쟁하는 방법보다 세밀한 물체 외관 및 위치에 대한 제어를 허용하여 매우 복잡한 프롬프트에서 이미지를 구축하고 조작할 수 있음을 입증합니다.

English

Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability and fine-grained control over object attributes. The concept of multi-layer generation holds great potential to address these limitations, however generating image instances concurrently to scene composition limits control over fine-grained object attributes, relative positioning in 3D space and scene manipulation abilities. In this work, we propose a novel multi-stage generation paradigm that is designed for fine-grained control, flexibility and interactivity. To ensure control over instance attributes, we devise a novel training paradigm to adapt a diffusion model to generate isolated scene components as RGBA images with transparency information. To build complex images, we employ these pre-generated instances and introduce a multi-layer composite generation process that smoothly assembles components in realistic scenes. Our experiments show that our RGBA diffusion model is capable of generating diverse and high quality instances with precise control over object attributes. Through multi-layer composition, we demonstrate that our approach allows to build and manipulate images from highly complex prompts with fine-grained control over object appearance and location, granting a higher degree of control than competing methods.

텍스트에서 이미지로 RGBA 인스턴스 생성을 통한 구성적인 장면 생성

Generating Compositional Scenes via Text-to-image RGBA Instance Generation

초록

Support