VividFace:一种基于扩散的高保真视频人脸交换混合框架
VividFace: A Diffusion-Based Hybrid Framework for High-Fidelity Video Face Swapping
December 15, 2024
作者: Hao Shao, Shulun Wang, Yang Zhou, Guanglu Song, Dailan He, Shuo Qin, Zhuofan Zong, Bingqi Ma, Yu Liu, Hongsheng Li
cs.AI
摘要
视频换脸在各种应用中变得越来越流行,然而现有方法主要集中在静态图像上,并且在视频换脸方面存在挑战,因为需要考虑时间一致性和复杂场景。本文提出了第一个专为视频换脸设计的基于扩散的框架。我们的方法引入了一种新颖的图像-视频混合训练框架,充分利用丰富的静态图像数据和时间序列视频,解决了仅使用视频训练的固有局限性。该框架结合了一个特别设计的扩散模型和 VidFaceVAE,有效处理两种类型的数据,以更好地保持生成视频的时间一致性。为了进一步解开身份和姿势特征,我们构建了属性-身份解缠三元组(AIDT)数据集,其中每个三元组包含三张人脸图像,其中两张图像共享相同的姿势,两张共享相同的身份。通过全面的遮挡增强,这个数据集还提高了对遮挡的鲁棒性。此外,我们将三维重建技术集成为网络的输入条件,以处理大姿势变化。广泛的实验表明,我们的框架在身份保留、时间一致性和视觉质量方面相对于现有方法表现出优越性能,同时需要更少的推理步骤。我们的方法有效地缓解了视频换脸中的关键挑战,包括时间闪烁、身份保留以及对遮挡和姿势变化的鲁棒性。
English
Video face swapping is becoming increasingly popular across various
applications, yet existing methods primarily focus on static images and
struggle with video face swapping because of temporal consistency and complex
scenarios. In this paper, we present the first diffusion-based framework
specifically designed for video face swapping. Our approach introduces a novel
image-video hybrid training framework that leverages both abundant static image
data and temporal video sequences, addressing the inherent limitations of
video-only training. The framework incorporates a specially designed diffusion
model coupled with a VidFaceVAE that effectively processes both types of data
to better maintain temporal coherence of the generated videos. To further
disentangle identity and pose features, we construct the Attribute-Identity
Disentanglement Triplet (AIDT) Dataset, where each triplet has three face
images, with two images sharing the same pose and two sharing the same
identity. Enhanced with a comprehensive occlusion augmentation, this dataset
also improves robustness against occlusions. Additionally, we integrate 3D
reconstruction techniques as input conditioning to our network for handling
large pose variations. Extensive experiments demonstrate that our framework
achieves superior performance in identity preservation, temporal consistency,
and visual quality compared to existing methods, while requiring fewer
inference steps. Our approach effectively mitigates key challenges in video
face swapping, including temporal flickering, identity preservation, and
robustness to occlusions and pose variations.Summary
AI-Generated Summary