패션작곡가: 구성적인 패션 이미지 생성

초록

패션 이미지 생성을 위한 구성적 패션 이미지 생성기인 FashionComposer를 제시합니다. 이전 방법과는 달리 FashionComposer는 매우 유연합니다. 텍스트 프롬프트, 매개 변수 인간 모델, 의상 이미지 및 얼굴 이미지와 같은 다중 모달 입력을 받아들이며 인간의 외관, 자세 및 체형을 개인화하고 한 번에 여러 의상을 할당할 수 있습니다. 이를 달성하기 위해 우리는 먼저 다양한 입력 모달리티를 처리할 수 있는 범용 프레임워크를 개발했습니다. 모델의 강력한 구성 능력을 향상시키기 위해 규모 조정된 훈련 데이터를 구축했습니다. 여러 참조 이미지(의상 및 얼굴)를 매끄럽게 수용하기 위해 이러한 참조를 "자산 라이브러리"로 단일 이미지에 구성하고 참조 UNet을 사용하여 외관 특징을 추출했습니다. 생성된 결과물의 올바른 픽셀에 외관 특징을 주입하기 위해 주체 바인딩 어텐션을 제안합니다. 이는 서로 다른 "자산"에서 외관 특징을 해당 텍스트 특징과 결합합니다. 이러한 방식으로 모델은 각 자산을 의미에 따라 이해할 수 있으며 임의의 수와 유형의 참조 이미지를 지원합니다. 종합적인 솔루션으로 FashionComposer는 인간 앨범 생성, 다양한 가상 시착 작업 등 다양한 응용 프로그램도 지원합니다.

English

We present FashionComposer for compositional fashion image generation. Unlike previous methods, FashionComposer is highly flexible. It takes multi-modal input (i.e., text prompt, parametric human model, garment image, and face image) and supports personalizing the appearance, pose, and figure of the human and assigning multiple garments in one pass. To achieve this, we first develop a universal framework capable of handling diverse input modalities. We construct scaled training data to enhance the model's robust compositional capabilities. To accommodate multiple reference images (garments and faces) seamlessly, we organize these references in a single image as an "asset library" and employ a reference UNet to extract appearance features. To inject the appearance features into the correct pixels in the generated result, we propose subject-binding attention. It binds the appearance features from different "assets" with the corresponding text features. In this way, the model could understand each asset according to their semantics, supporting arbitrary numbers and types of reference images. As a comprehensive solution, FashionComposer also supports many other applications like human album generation, diverse virtual try-on tasks, etc.

패션작곡가: 구성적인 패션 이미지 생성

FashionComposer: Compositional Fashion Image Generation

초록

Summary

Support

Support