VividFace:一個基於擴散的混合框架,用於高保真度視頻人臉交換
VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping
December 15, 2024
作者: Hao Shao, Shulun Wang, Yang Zhou, Guanglu Song, Dailan He, Shuo Qin, Zhuofan Zong, Bingqi Ma, Yu Liu, Hongsheng Li
cs.AI
摘要
影片臉部交換在各種應用中越來越受歡迎,然而現有方法主要集中在靜態圖像,對於影片臉部交換存在著時間一致性和複雜情境的困難。本文提出了第一個專為影片臉部交換設計的擴散式框架。我們的方法引入了一個新穎的影像-影片混合訓練框架,充分利用豐富的靜態圖像數據和時間序列影片,解決了僅使用影片訓練的固有限制。該框架結合了特別設計的擴散模型和 VidFaceVAE,有效處理兩種類型數據,以更好地保持生成影片的時間一致性。為了進一步解開身份和姿勢特徵,我們建立了屬性-身份解纏三元組(AIDT)數據集,其中每個三元組包含三張臉部圖像,其中兩張圖像共享相同的姿勢,另外兩張共享相同的身份。通過全面的遮擋增強,該數據集還提高了對遮擋的魯棒性。此外,我們將三維重建技術集成為網絡的輸入條件,以應對大範圍的姿勢變化。大量實驗表明,我們的框架在身份保留、時間一致性和視覺質量方面相比現有方法表現出優越性,同時需要較少的推理步驟。我們的方法有效地減輕了影片臉部交換中的關鍵挑戰,包括時間閃爍、身份保留以及對遮擋和姿勢變化的魯棒性。
English
Video face swapping is becoming increasingly popular across various
applications, yet existing methods primarily focus on static images and
struggle with video face swapping because of temporal consistency and complex
scenarios. In this paper, we present the first diffusion-based framework
specifically designed for video face swapping. Our approach introduces a novel
image-video hybrid training framework that leverages both abundant static image
data and temporal video sequences, addressing the inherent limitations of
video-only training. The framework incorporates a specially designed diffusion
model coupled with a VidFaceVAE that effectively processes both types of data
to better maintain temporal coherence of the generated videos. To further
disentangle identity and pose features, we construct the Attribute-Identity
Disentanglement Triplet (AIDT) Dataset, where each triplet has three face
images, with two images sharing the same pose and two sharing the same
identity. Enhanced with a comprehensive occlusion augmentation, this dataset
also improves robustness against occlusions. Additionally, we integrate 3D
reconstruction techniques as input conditioning to our network for handling
large pose variations. Extensive experiments demonstrate that our framework
achieves superior performance in identity preservation, temporal consistency,
and visual quality compared to existing methods, while requiring fewer
inference steps. Our approach effectively mitigates key challenges in video
face swapping, including temporal flickering, identity preservation, and
robustness to occlusions and pose variations.Summary
AI-Generated Summary