DisPose：解开姿势指导，实现可控人体图像动画

摘要

可控人类图像动画旨在利用驱动视频从参考图像生成视频。由于稀疏引导（例如骨架姿势）提供的控制信号有限，最近的研究尝试引入额外的密集条件（例如深度图）以确保运动对齐。然而，当参考人物的身体形状与驱动视频明显不同时，这种严格的密集引导会损害生成视频的质量。在本文中，我们提出DisPose，以挖掘更具普适性和有效性的控制信号，而无需额外的密集输入，将人类图像动画中的稀疏骨架姿势解开为运动场引导和关键点对应。具体而言，我们从稀疏运动场和参考图像生成密集运动场，提供区域级的密集引导，同时保持稀疏姿势控制的普适性。我们还从参考图像中提取对应于姿势关键点的扩散特征，然后将这些点特征转移到目标姿势，提供独特的身份信息。为了无缝集成到现有模型中，我们提出了一个即插即用的混合ControlNet，提高了生成视频的质量和一致性，同时冻结现有模型参数。大量的定性和定量实验证明了DisPose相比当前方法的优越性。源代码：https://github.com/lihxxx/DisPose。

English

Controllable human image animation aims to generate videos from reference images using driving videos. Due to the limited control signals provided by sparse guidance (e.g., skeleton pose), recent works have attempted to introduce additional dense conditions (e.g., depth map) to ensure motion alignment. However, such strict dense guidance impairs the quality of the generated video when the body shape of the reference character differs significantly from that of the driving video. In this paper, we present DisPose to mine more generalizable and effective control signals without additional dense input, which disentangles the sparse skeleton pose in human image animation into motion field guidance and keypoint correspondence. Specifically, we generate a dense motion field from a sparse motion field and the reference image, which provides region-level dense guidance while maintaining the generalization of the sparse pose control. We also extract diffusion features corresponding to pose keypoints from the reference image, and then these point features are transferred to the target pose to provide distinct identity information. To seamlessly integrate into existing models, we propose a plug-and-play hybrid ControlNet that improves the quality and consistency of generated videos while freezing the existing model parameters. Extensive qualitative and quantitative experiments demonstrate the superiority of DisPose compared to current methods. Code: https://github.com/lihxxx/DisPose{https://github.com/lihxxx/DisPose}.

DisPose：解开姿势指导，实现可控人体图像动画

DisPose: Disentangling Pose Guidance for Controllable Human Image Animation

摘要

Summary

Support