ChatPaper.aiChatPaper

FantasyID:基于面部知识增强的身份保持视频生成

FantasyID: Face Knowledge Enhanced ID-Preserving Video Generation

February 19, 2025
作者: Yunpeng Zhang, Qiang Wang, Fan Jiang, Yaqi Fan, Mu Xu, Yonggang Qi
cs.AI

摘要

无需调优的适应大规模预训练视频扩散模型的身份保持文本到视频生成(IPT2V)方法,因其高效性和可扩展性,近来备受关注。然而,在保持身份不变的同时实现令人满意的面部动态表现,仍面临重大挑战。本研究提出了一种新颖的无调优IPT2V框架,通过增强基于扩散变换器(DiT)构建的预训练视频模型的面部知识,命名为FantasyID。核心在于,引入3D面部几何先验,确保视频合成过程中面部结构的合理性。为防止模型学习简单复制参考面部跨帧的“复制粘贴”捷径,设计了多视角面部增强策略,以捕捉多样的2D面部外观特征,从而增加面部表情和头部姿态的动态性。此外,在融合2D与3D特征作为指导后,并非简单采用交叉注意力将指导信息注入DiT层,而是引入一种可学习的层级自适应机制,选择性地将融合特征注入各个DiT层,促进身份保持与运动动态之间的平衡建模。实验结果证实了我们的模型在当前无调优IPT2V方法中的优越性。
English
Tuning-free approaches adapting large-scale pre-trained video diffusion models for identity-preserving text-to-video generation (IPT2V) have gained popularity recently due to their efficacy and scalability. However, significant challenges remain to achieve satisfied facial dynamics while keeping the identity unchanged. In this work, we present a novel tuning-free IPT2V framework by enhancing face knowledge of the pre-trained video model built on diffusion transformers (DiT), dubbed FantasyID. Essentially, 3D facial geometry prior is incorporated to ensure plausible facial structures during video synthesis. To prevent the model from learning copy-paste shortcuts that simply replicate reference face across frames, a multi-view face augmentation strategy is devised to capture diverse 2D facial appearance features, hence increasing the dynamics over the facial expressions and head poses. Additionally, after blending the 2D and 3D features as guidance, instead of naively employing cross-attention to inject guidance cues into DiT layers, a learnable layer-aware adaptive mechanism is employed to selectively inject the fused features into each individual DiT layers, facilitating balanced modeling of identity preservation and motion dynamics. Experimental results validate our model's superiority over the current tuning-free IPT2V methods.

Summary

AI-Generated Summary

PDF82February 24, 2025