在注意力中學習可控制人物圖像生成的流場

Learning Flow Fields in Attention for Controllable Person Image Generation

December 11, 2024
作者: Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He
cs.AI

摘要

可控制人像生成旨在生成一幅人像,其條件是參考圖像,從而精確控制人物的外觀或姿勢。然而,先前的方法常常扭曲參考圖像中的細粒度紋理細節,儘管實現了高整體圖像質量。我們將這些扭曲歸因於未能適當關注參考圖像中的相應區域。為了解決這個問題,我們因此提出了在注意力中學習流場(Leffa),該方法明確引導目標查詢在訓練期間關注正確的參考關鍵。具體而言,通過在基於擴散的基線中的注意力映射頂部實現正則化損失。我們的大量實驗表明,Leffa在控制外觀(虛擬試穿)和姿勢(姿勢轉移)方面實現了最先進的性能,顯著減少了細粒度細節失真,同時保持了高圖像質量。此外,我們展示了我們的損失是與模型無關的,可以用於改善其他擴散模型的性能。
English
Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.

Summary

AI-Generated Summary

PDF323December 12, 2024