内容:将自定义照片与视频扩散变换器混合
Ingredients: Blending Custom Photos with Video Diffusion Transformers
January 3, 2025
作者: Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, Mingyuan Fan
cs.AI
摘要
本文提出了一个强大的框架,通过将多个特定身份(ID)照片与视频扩散Transformer相结合,用于定制视频创作,被称为Ingredients。总体上,我们的方法包括三个主要模块:(i)一个面部提取器,从全局和局部角度捕获每个人的ID的多功能和精确的面部特征;(ii)一个多尺度投影器,将面部嵌入映射到视频扩散Transformer中图像查询的上下文空间;(iii)一个ID路由器,动态地将多个ID嵌入组合并分配到相应的时空区域。通过精心策划的文本-视频数据集和多阶段训练协议,Ingredients展示了将自定义照片转化为动态和个性化视频内容的卓越性能。定性评估突显了所提出方法的优势,将其定位为在基于Transformer的架构中,相较于现有方法,更有效的生成视频控制工具的重大进展。数据、代码和模型权重可在以下网址公开获取:https://github.com/feizc/Ingredients。
English
This paper presents a powerful framework to customize video creations by
incorporating multiple specific identity (ID) photos, with video diffusion
Transformers, referred to as Ingredients. Generally, our method
consists of three primary modules: (i) a facial extractor that
captures versatile and precise facial features for each human ID from both
global and local perspectives; (ii) a multi-scale projector that maps
face embeddings into the contextual space of image query in video diffusion
transformers; (iii) an ID router that dynamically combines and
allocates multiple ID embedding to the corresponding space-time regions.
Leveraging a meticulously curated text-video dataset and a multi-stage training
protocol, Ingredients demonstrates superior performance in turning
custom photos into dynamic and personalized video content. Qualitative
evaluations highlight the advantages of proposed method, positioning it as a
significant advancement toward more effective generative video control tools in
Transformer-based architecture, compared to existing methods. The data, code,
and model weights are publicly available at:
https://github.com/feizc/Ingredients.Summary
AI-Generated Summary