備忘錄:記憶引導擴散用於表達性語音視頻生成
MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
December 5, 2024
作者: Longtao Zheng, Yifan Zhang, Hanzhong Guo, Jiachun Pan, Zhenxiong Tan, Jiahao Lu, Chuanxin Tang, Bo An, Shuicheng Yan
cs.AI
摘要
最近在影片擴散模型方面的進展為實現逼真的音頻驅動說話影片生成開啟了新的潛力。然而,實現無縫的音頻唇部同步、保持長期身份一致性以及在生成的說話影片中產生自然、音頻對齊的表情仍然是重大挑戰。為了應對這些挑戰,我們提出了記憶引導情感感知擴散(MEMO)方法,這是一種端到端的音頻驅動肖像動畫方法,用於生成具有身份一致性和表現力的說話影片。我們的方法圍繞兩個關鍵模塊構建:(1)一個記憶引導的時間模塊,通過開發記憶狀態來存儲來自更長過去上下文的信息,通過線性注意力引導時間建模,從而增強長期身份一致性和運動平滑度;以及(2)一個情感感知音頻模塊,它用多模態注意力取代傳統的交叉注意力,以增強音頻-影片交互作用,同時從音頻中檢測情感,通過情感自適應層規範來精煉面部表情。廣泛的定量和定性結果表明,MEMO在各種圖像和音頻類型上生成更逼真的說話影片,優於最先進的方法在整體質量、音頻唇部同步、身份一致性和表情-情感對齊方面。
English
Recent advances in video diffusion models have unlocked new potential for
realistic audio-driven talking video generation. However, achieving seamless
audio-lip synchronization, maintaining long-term identity consistency, and
producing natural, audio-aligned expressions in generated talking videos remain
significant challenges. To address these challenges, we propose Memory-guided
EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation
approach to generate identity-consistent and expressive talking videos. Our
approach is built around two key modules: (1) a memory-guided temporal module,
which enhances long-term identity consistency and motion smoothness by
developing memory states to store information from a longer past context to
guide temporal modeling via linear attention; and (2) an emotion-aware audio
module, which replaces traditional cross attention with multi-modal attention
to enhance audio-video interaction, while detecting emotions from audio to
refine facial expressions via emotion adaptive layer norm. Extensive
quantitative and qualitative results demonstrate that MEMO generates more
realistic talking videos across diverse image and audio types, outperforming
state-of-the-art methods in overall quality, audio-lip synchronization,
identity consistency, and expression-emotion alignment.Summary
AI-Generated Summary