ChatPaper.aiChatPaper

X-Dancer:从富有表现力的音乐到人类舞蹈视频的生成

X-Dancer: Expressive Music to Human Dance Video Generation

February 24, 2025
作者: Zeyuan Chen, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xin Chen, Chao Wang, Di Chang, Linjie Luo
cs.AI

摘要

我们推出X-Dancer,一种创新的零样本音乐驱动图像动画流程,能够从单一静态图像生成多样化且长距离逼真的人类舞蹈视频。其核心在于引入了一个统一的Transformer-扩散框架,该框架包含一个自回归Transformer模型,用于合成与音乐同步的二维身体、头部及手部姿态的扩展令牌序列,进而指导扩散模型生成连贯且真实的舞蹈视频帧。与主要生成三维人体运动的传统方法不同,X-Dancer通过建模广泛的二维舞蹈动作,利用易于获取的单目视频捕捉其与音乐节拍的微妙对齐,有效应对了数据限制并提升了可扩展性。为此,我们首先从带有关键点置信度的二维人体姿态标签构建了空间组合式令牌表示,编码了大幅度的身体动作(如上下身)及精细动作(如头部和手部)。随后,我们设计了一个音乐到动作的Transformer模型,自回归地生成与音乐对齐的舞蹈姿态令牌序列,同时全局关注音乐风格及先前的运动上下文。最后,我们利用扩散模型骨干,通过AdaIN技术将参考图像与这些合成的姿态令牌动画化,形成了一个完全可微分的端到端框架。实验结果表明,X-Dancer能够生成既多样又具特色的舞蹈视频,在多样性、表现力及真实感方面大幅超越现有最先进方法。代码与模型将供研究用途开放。
English
We present X-Dancer, a novel zero-shot music-driven image animation pipeline that creates diverse and long-range lifelike human dance videos from a single static image. As its core, we introduce a unified transformer-diffusion framework, featuring an autoregressive transformer model that synthesize extended and music-synchronized token sequences for 2D body, head and hands poses, which then guide a diffusion model to produce coherent and realistic dance video frames. Unlike traditional methods that primarily generate human motion in 3D, X-Dancer addresses data limitations and enhances scalability by modeling a wide spectrum of 2D dance motions, capturing their nuanced alignment with musical beats through readily available monocular videos. To achieve this, we first build a spatially compositional token representation from 2D human pose labels associated with keypoint confidences, encoding both large articulated body movements (e.g., upper and lower body) and fine-grained motions (e.g., head and hands). We then design a music-to-motion transformer model that autoregressively generates music-aligned dance pose token sequences, incorporating global attention to both musical style and prior motion context. Finally we leverage a diffusion backbone to animate the reference image with these synthesized pose tokens through AdaIN, forming a fully differentiable end-to-end framework. Experimental results demonstrate that X-Dancer is able to produce both diverse and characterized dance videos, substantially outperforming state-of-the-art methods in term of diversity, expressiveness and realism. Code and model will be available for research purposes.

Summary

AI-Generated Summary

PDF113February 25, 2025