ConceptMaster:在擴散Transformer模型上進行多概念視頻自定義,無需測試時間調整
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
January 8, 2025
作者: Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai
cs.AI
摘要
透過擴散模型,文本轉視頻生成已取得顯著進展。然而,多概念視頻定制(MCVC)仍然是一個重要挑戰。我們在這項任務中確定了兩個關鍵挑戰:1)身份解耦問題,直接採用現有的定制方法在同時處理多個概念時不可避免地混合屬性,以及2)高質量視頻-實體對的稀缺性,這對於訓練代表並解耦各種概念的模型至關重要。為了應對這些挑戰,我們引入了ConceptMaster,這是一個創新框架,有效應對身份解耦的關鍵問題,同時在定制視頻中保持概念的忠實性。具體來說,我們引入了一種新穎的策略,學習解耦的多概念嵌入,並以獨立方式注入到擴散模型中,這有效保證了具有多個身份的定制視頻的質量,即使對於非常相似的視覺概念也是如此。為了進一步克服高質量MCVC數據的稀缺性,我們精心建立了一個數據構建流程,這使得能夠系統性地收集跨不同概念的精確多概念視頻-實體數據。設計了一個全面的基準測試來驗證我們的模型在三個關鍵維度上的有效性:概念忠實度、身份解耦能力以及在六種不同概念組合情景下的視頻生成質量。大量實驗表明,我們的ConceptMaster在這項任務中明顯優於先前的方法,為生成跨多個概念的個性化和語義準確的視頻鋪平了道路。
English
Text-to-video generation has made remarkable advancements through diffusion
models. However, Multi-Concept Video Customization (MCVC) remains a significant
challenge. We identify two key challenges in this task: 1) the identity
decoupling problem, where directly adopting existing customization methods
inevitably mix attributes when handling multiple concepts simultaneously, and
2) the scarcity of high-quality video-entity pairs, which is crucial for
training such a model that represents and decouples various concepts well. To
address these challenges, we introduce ConceptMaster, an innovative framework
that effectively tackles the critical issues of identity decoupling while
maintaining concept fidelity in customized videos. Specifically, we introduce a
novel strategy of learning decoupled multi-concept embeddings that are injected
into the diffusion models in a standalone manner, which effectively guarantees
the quality of customized videos with multiple identities, even for highly
similar visual concepts. To further overcome the scarcity of high-quality MCVC
data, we carefully establish a data construction pipeline, which enables
systematic collection of precise multi-concept video-entity data across diverse
concepts. A comprehensive benchmark is designed to validate the effectiveness
of our model from three critical dimensions: concept fidelity, identity
decoupling ability, and video generation quality across six different concept
composition scenarios. Extensive experiments demonstrate that our ConceptMaster
significantly outperforms previous approaches for this task, paving the way for
generating personalized and semantically accurate videos across multiple
concepts.Summary
AI-Generated Summary