在注意力机制中学习可控人像图像的流场

Learning Flow Fields in Attention for Controllable Person Image Generation

December 11, 2024
作者: Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He
cs.AI

摘要

可控人物图像生成旨在生成一个人物图像,其受到参考图像的限制,从而精确控制人物的外观或姿势。然而,先前的方法通常会扭曲参考图像中的细粒度纹理细节,尽管达到了较高的整体图像质量。我们将这些扭曲归因于对参考图像中相应区域关注不足。为了解决这个问题,我们因此提出了在注意力中学习流场(Leffa),它明确地指导目标查询在训练期间在注意力层中关注正确的参考关键。具体而言,它是通过在基于扩散的基线内的注意力图之上的正则化损失来实现的。我们的大量实验表明,Leffa 在控制外观(虚拟试穿)和姿势(姿势转移)方面实现了最先进的性能,显著减少了细粒度细节失真,同时保持了高图像质量。此外,我们展示了我们的损失是与模型无关的,并且可以用来改善其他扩散模型的性能。
English
Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.

Summary

AI-Generated Summary

PDF323December 12, 2024