EchoVideo：通过多模态特征融合实现保护身份的人类视频生成

摘要

最近视频生成方面的进展显著影响了各种下游应用，特别是在保持身份的视频生成（IPT2V）方面。然而，现有方法在处理“复制粘贴”伪影和低相似度问题时存在困难，主要是因为它们过度依赖低级别的面部图像信息。这种依赖可能导致刚性的面部外观和反映无关细节的伪影。为了解决这些挑战，我们提出了EchoVideo，它采用两个关键策略：（1）身份图像-文本融合模块（IITF），集成来自文本的高级语义特征，捕获干净的面部身份表示，同时丢弃遮挡、姿势和光照变化，以避免引入伪影；（2）两阶段训练策略，第二阶段采用随机方法，随机利用浅层面部信息。其目标是在减轻对浅层特征过度依赖的同时平衡浅层特征所提供的保真度增强。这种策略鼓励模型在训练过程中利用高级特征，最终培养更强大的面部身份表示。EchoVideo有效地保持面部身份并保持全身完整性。大量实验证明，它在生成高质量、可控性和保真度视频方面取得了出色的结果。

English

Recent advancements in video generation have significantly impacted various downstream applications, particularly in identity-preserving video generation (IPT2V). However, existing methods struggle with "copy-paste" artifacts and low similarity issues, primarily due to their reliance on low-level facial image information. This dependence can result in rigid facial appearances and artifacts reflecting irrelevant details. To address these challenges, we propose EchoVideo, which employs two key strategies: (1) an Identity Image-Text Fusion Module (IITF) that integrates high-level semantic features from text, capturing clean facial identity representations while discarding occlusions, poses, and lighting variations to avoid the introduction of artifacts; (2) a two-stage training strategy, incorporating a stochastic method in the second phase to randomly utilize shallow facial information. The objective is to balance the enhancements in fidelity provided by shallow features while mitigating excessive reliance on them. This strategy encourages the model to utilize high-level features during training, ultimately fostering a more robust representation of facial identities. EchoVideo effectively preserves facial identities and maintains full-body integrity. Extensive experiments demonstrate that it achieves excellent results in generating high-quality, controllability and fidelity videos.

EchoVideo：通过多模态特征融合实现保护身份的人类视频生成

EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion

摘要

Summary

Support