FantasyTalking：通過連貫運動合成實現逼真肖像對話生成

摘要

從單張靜態肖像創建逼真且可動畫化的虛擬形象仍具挑戰性。現有方法往往難以捕捉細微的面部表情、相應的全身動作以及動態背景。為解決這些限制，我們提出了一種新穎框架，該框架利用預訓練的視頻擴散變換器模型來生成高保真、連貫且運動動態可控的說話肖像。我們工作的核心是一個雙階段的視聽對齊策略。在第一階段，我們採用片段級別的訓練方案，通過對齊整個場景（包括參考肖像、上下文對象和背景）中的音頻驅動動態，來建立連貫的全局運動。在第二階段，我們使用唇部追踪掩碼在幀級別上精細化唇部動作，確保與音頻信號的精確同步。為了在不犧牲運動靈活性的前提下保持身份一致性，我們用一個專注於面部的交叉注意力模塊取代了常用的參考網絡，該模塊有效保持了視頻中面部的一致性。此外，我們集成了一個運動強度調製模塊，該模塊顯式控制表情和身體運動的強度，從而實現對肖像運動（不僅僅是唇部運動）的可控操作。大量實驗結果表明，我們提出的方法在質量、真實感、連貫性、運動強度和身份保持方面均取得了更好的效果。我們的項目頁面：https://fantasy-amap.github.io/fantasy-talking/。

English

Creating a realistic animatable avatar from a single static portrait remains challenging. Existing approaches often struggle to capture subtle facial expressions, the associated global body movements, and the dynamic background. To address these limitations, we propose a novel framework that leverages a pretrained video diffusion transformer model to generate high-fidelity, coherent talking portraits with controllable motion dynamics. At the core of our work is a dual-stage audio-visual alignment strategy. In the first stage, we employ a clip-level training scheme to establish coherent global motion by aligning audio-driven dynamics across the entire scene, including the reference portrait, contextual objects, and background. In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals. To preserve identity without compromising motion flexibility, we replace the commonly used reference network with a facial-focused cross-attention module that effectively maintains facial consistency throughout the video. Furthermore, we integrate a motion intensity modulation module that explicitly controls expression and body motion intensity, enabling controllable manipulation of portrait movements beyond mere lip motion. Extensive experimental results show that our proposed approach achieves higher quality with better realism, coherence, motion intensity, and identity preservation. Ours project page: https://fantasy-amap.github.io/fantasy-talking/.

FantasyTalking：通過連貫運動合成實現逼真肖像對話生成

FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis

摘要

Summary

Support

Support