内容:将自定义照片与视频扩散变换器混合

Ingredients: Blending Custom Photos with Video Diffusion Transformers

January 3, 2025
作者: Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, Mingyuan Fan
cs.AI

摘要

本文提出了一个强大的框架,通过将多个特定身份(ID)照片与视频扩散Transformer相结合,用于定制视频创作,被称为Ingredients。总体上,我们的方法包括三个主要模块:(i)一个面部提取器,从全局和局部角度捕获每个人的ID的多功能和精确的面部特征;(ii)一个多尺度投影器,将面部嵌入映射到视频扩散Transformer中图像查询的上下文空间;(iii)一个ID路由器,动态地将多个ID嵌入组合并分配到相应的时空区域。通过精心策划的文本-视频数据集和多阶段训练协议,Ingredients展示了将自定义照片转化为动态和个性化视频内容的卓越性能。定性评估突显了所提出方法的优势,将其定位为在基于Transformer的架构中,相较于现有方法,更有效的生成视频控制工具的重大进展。数据、代码和模型权重可在以下网址公开获取:https://github.com/feizc/Ingredients。
English
This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as Ingredients. Generally, our method consists of three primary modules: (i) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (ii) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (iii) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions. Leveraging a meticulously curated text-video dataset and a multi-stage training protocol, Ingredients demonstrates superior performance in turning custom photos into dynamic and personalized video content. Qualitative evaluations highlight the advantages of proposed method, positioning it as a significant advancement toward more effective generative video control tools in Transformer-based architecture, compared to existing methods. The data, code, and model weights are publicly available at: https://github.com/feizc/Ingredients.

Summary

AI-Generated Summary

PDF82January 7, 2025