在注意力中學習可控制人物圖像生成的流場
Learning Flow Fields in Attention for Controllable Person Image Generation
December 11, 2024
作者: Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He
cs.AI
摘要
可控制人像生成旨在生成一幅人像,其條件是參考圖像,從而精確控制人物的外觀或姿勢。然而,先前的方法常常扭曲參考圖像中的細粒度紋理細節,儘管實現了高整體圖像質量。我們將這些扭曲歸因於未能適當關注參考圖像中的相應區域。為了解決這個問題,我們因此提出了在注意力中學習流場(Leffa),該方法明確引導目標查詢在訓練期間關注正確的參考關鍵。具體而言,通過在基於擴散的基線中的注意力映射頂部實現正則化損失。我們的大量實驗表明,Leffa在控制外觀(虛擬試穿)和姿勢(姿勢轉移)方面實現了最先進的性能,顯著減少了細粒度細節失真,同時保持了高圖像質量。此外,我們展示了我們的損失是與模型無關的,可以用於改善其他擴散模型的性能。
English
Controllable person image generation aims to generate a person image
conditioned on reference images, allowing precise control over the person's
appearance or pose. However, prior methods often distort fine-grained textural
details from the reference image, despite achieving high overall image quality.
We attribute these distortions to inadequate attention to corresponding regions
in the reference image. To address this, we thereby propose learning flow
fields in attention (Leffa), which explicitly guides the target query to attend
to the correct reference key in the attention layer during training.
Specifically, it is realized via a regularization loss on top of the attention
map within a diffusion-based baseline. Our extensive experiments show that
Leffa achieves state-of-the-art performance in controlling appearance (virtual
try-on) and pose (pose transfer), significantly reducing fine-grained detail
distortion while maintaining high image quality. Additionally, we show that our
loss is model-agnostic and can be used to improve the performance of other
diffusion models.Summary
AI-Generated Summary