FantasyTalking:通過連貫運動合成實現逼真肖像對話生成
FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis
April 7, 2025
作者: Mengchao Wang, Qiang Wang, Fan Jiang, Yaqi Fan, Yunpeng Zhang, Yonggang Qi, Kun Zhao, Mu Xu
cs.AI
摘要
從單張靜態肖像創建逼真且可動畫化的虛擬形象仍具挑戰性。現有方法往往難以捕捉細微的面部表情、相應的全身動作以及動態背景。為解決這些限制,我們提出了一種新穎框架,該框架利用預訓練的視頻擴散變換器模型來生成高保真、連貫且運動動態可控的說話肖像。我們工作的核心是一個雙階段的視聽對齊策略。在第一階段,我們採用片段級別的訓練方案,通過對齊整個場景(包括參考肖像、上下文對象和背景)中的音頻驅動動態,來建立連貫的全局運動。在第二階段,我們使用唇部追踪掩碼在幀級別上精細化唇部動作,確保與音頻信號的精確同步。為了在不犧牲運動靈活性的前提下保持身份一致性,我們用一個專注於面部的交叉注意力模塊取代了常用的參考網絡,該模塊有效保持了視頻中面部的一致性。此外,我們集成了一個運動強度調製模塊,該模塊顯式控制表情和身體運動的強度,從而實現對肖像運動(不僅僅是唇部運動)的可控操作。大量實驗結果表明,我們提出的方法在質量、真實感、連貫性、運動強度和身份保持方面均取得了更好的效果。我們的項目頁面:https://fantasy-amap.github.io/fantasy-talking/。
English
Creating a realistic animatable avatar from a single static portrait remains
challenging. Existing approaches often struggle to capture subtle facial
expressions, the associated global body movements, and the dynamic background.
To address these limitations, we propose a novel framework that leverages a
pretrained video diffusion transformer model to generate high-fidelity,
coherent talking portraits with controllable motion dynamics. At the core of
our work is a dual-stage audio-visual alignment strategy. In the first stage,
we employ a clip-level training scheme to establish coherent global motion by
aligning audio-driven dynamics across the entire scene, including the reference
portrait, contextual objects, and background. In the second stage, we refine
lip movements at the frame level using a lip-tracing mask, ensuring precise
synchronization with audio signals. To preserve identity without compromising
motion flexibility, we replace the commonly used reference network with a
facial-focused cross-attention module that effectively maintains facial
consistency throughout the video. Furthermore, we integrate a motion intensity
modulation module that explicitly controls expression and body motion
intensity, enabling controllable manipulation of portrait movements beyond mere
lip motion. Extensive experimental results show that our proposed approach
achieves higher quality with better realism, coherence, motion intensity, and
identity preservation. Ours project page:
https://fantasy-amap.github.io/fantasy-talking/.Summary
AI-Generated Summary