浮動：用於音頻驅動的說話肖像的生成式運動潛在流匹配

摘要

隨著基於擴散的生成模型的快速發展，肖像圖像動畫取得了顯著的成果。然而，由於其迭代採樣的性質，它仍然面臨著在時間上一致的視頻生成和快速採樣方面的挑戰。本文提出了FLOAT，一種基於流匹配生成模型的音頻驅動的說話肖像視頻生成方法。我們將生成建模從基於像素的潛在空間轉移到學習運動潛在空間，從而實現有效的時間一致運動的設計。為了實現這一點，我們引入了一個基於變壓器的向量場預測器，具有一種簡單而有效的逐幀條件機制。此外，我們的方法支持以語音驅動的情感增強，實現了表達動作的自然融入。廣泛的實驗表明，我們的方法在視覺質量、運動保真度和效率方面優於最先進的以音頻驅動的說話肖像方法。

English

With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. We shift the generative modeling from the pixel-based latent space to a learned motion latent space, enabling efficient design of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with a simple yet effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.

浮動：用於音頻驅動的說話肖像的生成式運動潛在流匹配

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

摘要

Support