FashionComposer:組合式時尚圖像生成
FashionComposer: Compositional Fashion Image Generation
December 18, 2024
作者: Sihui Ji, Yiyang Wang, Xi Chen, Xiaogang Xu, Hao Luo, Hengshuang Zhao
cs.AI
摘要
我們提出了FashionComposer,用於組合式時尚圖像生成。與以往方法不同,FashionComposer 高度靈活。它接受多模態輸入(即文本提示、參數化人體模型、服裝圖像和面部圖像),支持個性化外觀、姿勢和人體形象,並一次性分配多件服裝。為實現此目的,我們首先開發了一個能處理多樣輸入模式的通用框架。我們構建了經過縮放的訓練數據,以增強模型的強大組合能力。為了無縫地容納多個參考圖像(服裝和面部),我們將這些參考圖像組織在一個單獨的圖像中,作為“資產庫”,並使用參考 UNet 提取外觀特徵。為了將外觀特徵注入生成結果中的正確像素,我們提出了主題綁定注意力。它將來自不同“資產”的外觀特徵與相應的文本特徵綁定。通過這種方式,模型可以根據語義理解每個資產,支持任意數量和類型的參考圖像。作為一個全面的解決方案,FashionComposer 還支持許多其他應用,如人物相冊生成、多樣化虛擬試穿任務等。
English
We present FashionComposer for compositional fashion image generation. Unlike
previous methods, FashionComposer is highly flexible. It takes multi-modal
input (i.e., text prompt, parametric human model, garment image, and face
image) and supports personalizing the appearance, pose, and figure of the human
and assigning multiple garments in one pass. To achieve this, we first develop
a universal framework capable of handling diverse input modalities. We
construct scaled training data to enhance the model's robust compositional
capabilities. To accommodate multiple reference images (garments and faces)
seamlessly, we organize these references in a single image as an "asset
library" and employ a reference UNet to extract appearance features. To inject
the appearance features into the correct pixels in the generated result, we
propose subject-binding attention. It binds the appearance features from
different "assets" with the corresponding text features. In this way, the model
could understand each asset according to their semantics, supporting arbitrary
numbers and types of reference images. As a comprehensive solution,
FashionComposer also supports many other applications like human album
generation, diverse virtual try-on tasks, etc.Summary
AI-Generated Summary