ConceptMaster:在扩散变压器模型上进行多概念视频定制,无需测试时调整
ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning
January 8, 2025
作者: Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, Kun Gai
cs.AI
摘要
文本到视频生成通过扩散模型取得了显著进展。然而,多概念视频定制(MCVC)仍然是一个重要挑战。我们在这项任务中确定了两个关键挑战:1)身份解耦问题,直接采用现有的定制方法在同时处理多个概念时不可避免地会混合属性,以及2)高质量视频-实体对的稀缺性,这对于训练代表和解耦各种概念的模型至关重要。为了解决这些挑战,我们引入了ConceptMaster,这是一个创新框架,有效地解决了身份解耦的关键问题,同时在定制视频中保持概念的忠实度。具体地,我们引入了一种新颖的策略,学习解耦的多概念嵌入,以独立的方式注入扩散模型,这有效地保证了具有多个身份的定制视频的质量,即使是高度相似的视觉概念。为了进一步克服高质量MCVC数据的稀缺性,我们精心建立了一个数据构建流水线,这使得能够系统地收集跨不同概念的精确多概念视频-实体数据。我们设计了一个全面的基准测试来验证我们的模型在三个关键维度上的有效性:概念忠实度、身份解耦能力以及在六种不同概念组合场景下的视频生成质量。大量实验证明,我们的ConceptMaster在这项任务中明显优于先前的方法,为生成跨多个概念的个性化和语义准确的视频铺平了道路。
English
Text-to-video generation has made remarkable advancements through diffusion
models. However, Multi-Concept Video Customization (MCVC) remains a significant
challenge. We identify two key challenges in this task: 1) the identity
decoupling problem, where directly adopting existing customization methods
inevitably mix attributes when handling multiple concepts simultaneously, and
2) the scarcity of high-quality video-entity pairs, which is crucial for
training such a model that represents and decouples various concepts well. To
address these challenges, we introduce ConceptMaster, an innovative framework
that effectively tackles the critical issues of identity decoupling while
maintaining concept fidelity in customized videos. Specifically, we introduce a
novel strategy of learning decoupled multi-concept embeddings that are injected
into the diffusion models in a standalone manner, which effectively guarantees
the quality of customized videos with multiple identities, even for highly
similar visual concepts. To further overcome the scarcity of high-quality MCVC
data, we carefully establish a data construction pipeline, which enables
systematic collection of precise multi-concept video-entity data across diverse
concepts. A comprehensive benchmark is designed to validate the effectiveness
of our model from three critical dimensions: concept fidelity, identity
decoupling ability, and video generation quality across six different concept
composition scenarios. Extensive experiments demonstrate that our ConceptMaster
significantly outperforms previous approaches for this task, paving the way for
generating personalized and semantically accurate videos across multiple
concepts.Summary
AI-Generated Summary