ChatPaper.aiChatPaper

通过频率分解实现保持身份的文本到视频生成

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

November 26, 2024
作者: Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan
cs.AI

摘要

保持身份的文本到视频(IPT2V)生成旨在创建具有一致人类身份的高保真度视频。这是视频生成中的重要任务,但对生成模型仍然是一个未解决的问题。本文在两个文献中尚未解决的方向上推动了IPT2V的技术前沿:(1)一个无需繁琐逐案微调的无调谐流程,以及(2)一个频率感知启发式保持身份的DiT控制方案。我们提出了ConsisID,一个无调谐DiT控制的可控IPT2V模型,以保持生成视频中的人类身份一致。受扩散变压器频率分析的先前发现启发,它在频率域中使用身份控制信号,其中面部特征可以分解为低频全局特征和高频固有特征。首先,从低频角度出发,我们引入了一个全局面部提取器,将参考图像和面部关键点编码为潜在空间,生成富含低频信息的特征。然后,将这些特征集成到网络的浅层中,以缓解与DiT相关的训练挑战。其次,从高频角度出发,我们设计了一个局部面部提取器,捕获高频细节并将其注入变压器块,增强模型保持细粒度特征的能力。我们提出了一种分层训练策略,利用频率信息进行身份保持,将普通的预训练视频生成模型转化为IPT2V模型。大量实验证明,我们的频率感知启发式方案为基于DiT的模型提供了最佳控制解决方案。由于这一方案,我们的ConsisID生成了高质量、保持身份的视频,朝着更有效的IPT2V迈出了一大步。
English
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video. Inspired by prior findings in frequency analysis of diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features and high-frequency intrinsic features. First, from a low-frequency perspective, we introduce a global facial extractor, which encodes reference images and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into transformer blocks, enhancing the model's ability to preserve fine-grained features. We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID generates high-quality, identity-preserving videos, making strides towards more effective IPT2V.

Summary

AI-Generated Summary

PDF133November 28, 2024