주어진 인물 이미지 생성을 위한 조절 가능한 주의 흐름 필드 학습

초록

가능한 인물 이미지 생성은 참조 이미지에 의존하여 인물의 외모나 자세를 정밀하게 제어할 수 있도록 하는 것을 목표로 합니다. 그러나 이전 방법들은 종종 참조 이미지로부터 미세한 질감 세부 사항을 왜곡시키지만 전체적인 이미지 품질은 높게 유지하는 것을 성취했습니다. 우리는 이러한 왜곡을 참조 이미지의 해당 영역에 충분한 주의를 기울이지 않은 데 기인한다고 합니다. 이를 해결하기 위해 우리는 훈련 중에 대상 쿼리가 올바른 참조 키에 주의를 기울도록 명시적으로 안내하는 주의 흐름을 학습하는 Leffa를 제안합니다. 구체적으로, 이는 확산 기반 기준선 내 주의 맵 위에 정규화 손실을 통해 실현됩니다. 우리의 포괄적인 실험 결과는 Leffa가 외모(가상 시착)와 자세 이전(자세 이전)을 제어하는 성능에서 최고 수준을 달성하며 미세한 세부 사항 왜곡을 크게 줄이면서 높은 이미지 품질을 유지한다는 것을 보여줍니다. 게다가, 우리의 손실은 모델에 구애받지 않으며 다른 확산 모델의 성능을 향상시키는 데 사용될 수 있다는 것을 보여줍니다.

English

Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.

주어진 인물 이미지 생성을 위한 조절 가능한 주의 흐름 필드 학습

Learning Flow Fields in Attention for Controllable Person Image Generation

초록

Support