FantasyTalking:通过连贯运动合成实现逼真肖像对话生成
FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis
April 7, 2025
作者: Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, Mu Xu
cs.AI
摘要
从单一静态肖像创建逼真且可动画化的虚拟形象仍具挑战性。现有方法往往难以捕捉微妙的面部表情、相应的全身动作以及动态背景。为克服这些局限,我们提出了一种新颖框架,该框架利用预训练的视频扩散变换器模型,生成高保真、连贯且运动动态可控的说话肖像。我们工作的核心在于双阶段视听对齐策略。第一阶段,采用片段级训练方案,通过在整个场景(包括参考肖像、上下文对象及背景)中同步音频驱动的动态,建立连贯的全局运动。第二阶段,借助唇部追踪掩码在帧级别精修唇部动作,确保与音频信号的精确同步。为在不牺牲运动灵活性的前提下保持身份一致性,我们以面部聚焦的交叉注意力模块替代了常用的参考网络,有效维持视频中面部的一致性。此外,我们集成了运动强度调节模块,明确控制表情和身体动作的强度,使得肖像动作的操控不仅限于唇部运动。大量实验结果表明,我们提出的方法在质量、真实感、连贯性、运动强度及身份保持方面均表现出色。项目页面:https://fantasy-amap.github.io/fantasy-talking/。
English
Creating a realistic animatable avatar from a single static portrait remains
challenging. Existing approaches often struggle to capture subtle facial
expressions, the associated global body movements, and the dynamic background.
To address these limitations, we propose a novel framework that leverages a
pretrained video diffusion transformer model to generate high-fidelity,
coherent talking portraits with controllable motion dynamics. At the core of
our work is a dual-stage audio-visual alignment strategy. In the first stage,
we employ a clip-level training scheme to establish coherent global motion by
aligning audio-driven dynamics across the entire scene, including the reference
portrait, contextual objects, and background. In the second stage, we refine
lip movements at the frame level using a lip-tracing mask, ensuring precise
synchronization with audio signals. To preserve identity without compromising
motion flexibility, we replace the commonly used reference network with a
facial-focused cross-attention module that effectively maintains facial
consistency throughout the video. Furthermore, we integrate a motion intensity
modulation module that explicitly controls expression and body motion
intensity, enabling controllable manipulation of portrait movements beyond mere
lip motion. Extensive experimental results show that our proposed approach
achieves higher quality with better realism, coherence, motion intensity, and
identity preservation. Ours project page:
https://fantasy-amap.github.io/fantasy-talking/.Summary
AI-Generated Summary