透過頻率分解保持身份的文本到視頻生成

摘要

保持身份的文本到視頻（IPT2V）生成旨在創建具有一致人類身份的高保真度視頻。這是視頻生成中的一項重要任務，但對生成模型來說仍然是一個未解決的問題。本文在兩個文獻中尚未解決的方向上推動了IPT2V的技術前沿：（1）一個無需繁瑣的案例調整即可調整的管道，以及（2）一個頻率感知啟發式保持身份的DiT控制方案。我們提出了ConsisID，一個無需調整的DiT控制可控IPT2V模型，以保持生成的視頻中的人類身份一致。受擴散變壓器頻率分析先前發現的啟發，它在頻率域中使用身份控制信號，其中面部特徵可以被分解為低頻全局特徵和高頻固有特徵。首先，從低頻角度出發，我們引入了一個全局面部提取器，將參考圖像和面部關鍵點編碼為潛在空間，生成富含低頻信息的特徵。然後，將這些特徵集成到網絡的淺層中，以減輕與DiT相關的訓練挑戰。其次，從高頻角度出發，我們設計了一個局部面部提取器，捕捉高頻細節並將其注入變壓器塊，增強模型保存細粒特徵的能力。我們提出了一種分層訓練策略，利用頻率信息進行身份保持，將普通的預訓練視頻生成模型轉換為IPT2V模型。大量實驗表明，我們的頻率感知啟發式方案為基於DiT的模型提供了最佳控制解決方案。由於這個方案，我們的ConsisID生成了高質量、保持身份的視頻，朝著更有效的IPT2V邁出了一步。

English

Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving DiT-based control scheme. We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video. Inspired by prior findings in frequency analysis of diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features and high-frequency intrinsic features. First, from a low-frequency perspective, we introduce a global facial extractor, which encodes reference images and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into transformer blocks, enhancing the model's ability to preserve fine-grained features. We propose a hierarchical training strategy to leverage frequency information for identity preservation, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our ConsisID generates high-quality, identity-preserving videos, making strides towards more effective IPT2V.

透過頻率分解保持身份的文本到視頻生成

Identity-Preserving Text-to-Video Generation by Frequency Decomposition

摘要

Summary

Support