內容:將自定義照片與視頻擴散轉換器進行混合
Ingredients: Blending Custom Photos with Video Diffusion Transformers
January 3, 2025
作者: Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, Mingyuan Fan
cs.AI
摘要
本文提出了一個強大的框架,通過將多個特定身份(ID)照片與視頻擴散Transformer相結合,稱為Ingredients,來定制視頻創作。一般來說,我們的方法包括三個主要模塊:(i) 一個面部提取器,從全局和局部角度捕獲每個人類ID的多功能和精確的面部特徵;(ii) 一個多尺度投影器,將面部嵌入映射到視頻擴散Transformer中圖像查詢的上下文空間;(iii) 一個ID路由器,動態地將多個ID嵌入組合並分配到相應的時空區域。通過精心策劃的文本-視頻數據集和多階段訓練協議,Ingredients在將自定義照片轉換為動態和個性化視頻內容方面展現出卓越性能。定性評估突顯了所提出方法的優勢,將其定位為在基於Transformer的架構中,相對於現有方法,更有效的生成視頻控制工具的重大進步。數據、代碼和模型權重可在以下鏈接公開獲取:https://github.com/feizc/Ingredients。
English
This paper presents a powerful framework to customize video creations by
incorporating multiple specific identity (ID) photos, with video diffusion
Transformers, referred to as Ingredients. Generally, our method
consists of three primary modules: (i) a facial extractor that
captures versatile and precise facial features for each human ID from both
global and local perspectives; (ii) a multi-scale projector that maps
face embeddings into the contextual space of image query in video diffusion
transformers; (iii) an ID router that dynamically combines and
allocates multiple ID embedding to the corresponding space-time regions.
Leveraging a meticulously curated text-video dataset and a multi-stage training
protocol, Ingredients demonstrates superior performance in turning
custom photos into dynamic and personalized video content. Qualitative
evaluations highlight the advantages of proposed method, positioning it as a
significant advancement toward more effective generative video control tools in
Transformer-based architecture, compared to existing methods. The data, code,
and model weights are publicly available at:
https://github.com/feizc/Ingredients.Summary
AI-Generated Summary