備忘錄：記憶引導擴散用於表達性語音視頻生成

摘要

最近在影片擴散模型方面的進展為實現逼真的音頻驅動說話影片生成開啟了新的潛力。然而，實現無縫的音頻唇部同步、保持長期身份一致性以及在生成的說話影片中產生自然、音頻對齊的表情仍然是重大挑戰。為了應對這些挑戰，我們提出了記憶引導情感感知擴散（MEMO）方法，這是一種端到端的音頻驅動肖像動畫方法，用於生成具有身份一致性和表現力的說話影片。我們的方法圍繞兩個關鍵模塊構建：（1）一個記憶引導的時間模塊，通過開發記憶狀態來存儲來自更長過去上下文的信息，通過線性注意力引導時間建模，從而增強長期身份一致性和運動平滑度；以及（2）一個情感感知音頻模塊，它用多模態注意力取代傳統的交叉注意力，以增強音頻-影片交互作用，同時從音頻中檢測情感，通過情感自適應層規範來精煉面部表情。廣泛的定量和定性結果表明，MEMO在各種圖像和音頻類型上生成更逼真的說話影片，優於最先進的方法在整體質量、音頻唇部同步、身份一致性和表情-情感對齊方面。

English

Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.

備忘錄：記憶引導擴散用於表達性語音視頻生成

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

摘要

Support