内容：将自定义照片与视频扩散变换器混合

摘要

本文提出了一个强大的框架，通过将多个特定身份（ID）照片与视频扩散Transformer相结合，用于定制视频创作，被称为Ingredients。总体上，我们的方法包括三个主要模块：（i）一个面部提取器，从全局和局部角度捕获每个人的ID的多功能和精确的面部特征；（ii）一个多尺度投影器，将面部嵌入映射到视频扩散Transformer中图像查询的上下文空间；（iii）一个ID路由器，动态地将多个ID嵌入组合并分配到相应的时空区域。通过精心策划的文本-视频数据集和多阶段训练协议，Ingredients展示了将自定义照片转化为动态和个性化视频内容的卓越性能。定性评估突显了所提出方法的优势，将其定位为在基于Transformer的架构中，相较于现有方法，更有效的生成视频控制工具的重大进展。数据、代码和模型权重可在以下网址公开获取：https://github.com/feizc/Ingredients。

English

This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as Ingredients. Generally, our method consists of three primary modules: (i) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (ii) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (iii) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions. Leveraging a meticulously curated text-video dataset and a multi-stage training protocol, Ingredients demonstrates superior performance in turning custom photos into dynamic and personalized video content. Qualitative evaluations highlight the advantages of proposed method, positioning it as a significant advancement toward more effective generative video control tools in Transformer-based architecture, compared to existing methods. The data, code, and model weights are publicly available at: https://github.com/feizc/Ingredients.

内容：将自定义照片与视频扩散变换器混合

Ingredients: Blending Custom Photos with Video Diffusion Transformers

摘要

Summary

Support