DAWN：具有非自回歸擴散框架的動態幀頭像，用於說話頭部視頻生成

摘要

語頭生成旨在從單一肖像和語音音訊創建生動逼真的語頭視頻。儘管在基於擴散的語頭生成方面取得了顯著進展，幾乎所有方法都依賴自回歸策略，這些策略在當前生成步驟之外利用上下文有限，存在誤差累積並且生成速度較慢。為應對這些挑戰，我們提出了DAWN（動態幀頭像與非自回歸擴散），這是一個框架，可以實現動態長度視頻序列的一次生成。具體而言，它由兩個主要組件組成：（1）在潛在運動空間中生成由音頻驅動的整體面部動態，以及（2）由音頻驅動的頭部姿勢和眨眼生成。大量實驗表明，我們的方法生成具有精確唇部運動和自然姿勢/眨眼動作的真實生動視頻。此外，DAWN 具有高生成速度，具有強大的外推能力，確保高質量長視頻的穩定生成。這些結果突顯了DAWN 在語頭視頻生成領域中的重要潛力和影響力。此外，我們希望DAWN 能激發對擴散模型中非自回歸方法的進一步探索。我們的代碼將公開在 https://github.com/Hanbo-Cheng/DAWN-pytorch。

English

Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (Dynamic frame Avatar With Non-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly at https://github.com/Hanbo-Cheng/DAWN-pytorch.

DAWN：具有非自回歸擴散框架的動態幀頭像，用於說話頭部視頻生成

DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking Head Video Generation

摘要

Summary

Support

Support