DreamRelation:以关系为核心的视频定制
DreamRelation: Relation-Centric Video Customization
March 10, 2025
作者: Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, Hongming Shan
cs.AI
摘要
关系视频定制是指创建个性化视频,以展示用户指定的两个主体之间的关系,这是理解现实世界视觉内容的关键任务。尽管现有方法能够个性化主体的外观和动作,但在复杂的关系视频定制方面仍存在困难,其中精确的关系建模和跨主体类别的高泛化能力至关重要。主要挑战源于关系中固有的复杂空间排列、布局变化和细微的时间动态;因此,当前模型往往过度强调无关的视觉细节,而未能捕捉到有意义的互动。为解决这些挑战,我们提出了DreamRelation,一种通过少量示例视频个性化关系的新方法,利用两个关键组件:关系解耦学习和关系动态增强。首先,在关系解耦学习中,我们使用关系LoRA三元组和混合掩码训练策略将关系与主体外观分离,确保在不同关系间实现更好的泛化。此外,我们通过分析MM-DiT注意力机制中查询、键和值特征的不同作用,确定了关系LoRA三元组的最佳设计,使DreamRelation成为首个具有可解释组件的关系视频生成框架。其次,在关系动态增强中,我们引入了时空关系对比损失,优先考虑关系动态,同时最小化对详细主体外观的依赖。大量实验表明,DreamRelation在关系视频定制方面优于现有最先进方法。代码和模型将公开发布。
English
Relational video customization refers to the creation of personalized videos
that depict user-specified relations between two subjects, a crucial task for
comprehending real-world visual content. While existing methods can personalize
subject appearances and motions, they still struggle with complex relational
video customization, where precise relational modeling and high generalization
across subject categories are essential. The primary challenge arises from the
intricate spatial arrangements, layout variations, and nuanced temporal
dynamics inherent in relations; consequently, current models tend to
overemphasize irrelevant visual details rather than capturing meaningful
interactions. To address these challenges, we propose DreamRelation, a novel
approach that personalizes relations through a small set of exemplar videos,
leveraging two key components: Relational Decoupling Learning and Relational
Dynamics Enhancement. First, in Relational Decoupling Learning, we disentangle
relations from subject appearances using relation LoRA triplet and hybrid mask
training strategy, ensuring better generalization across diverse relationships.
Furthermore, we determine the optimal design of relation LoRA triplet by
analyzing the distinct roles of the query, key, and value features within
MM-DiT's attention mechanism, making DreamRelation the first relational video
generation framework with explainable components. Second, in Relational
Dynamics Enhancement, we introduce space-time relational contrastive loss,
which prioritizes relational dynamics while minimizing the reliance on detailed
subject appearances. Extensive experiments demonstrate that DreamRelation
outperforms state-of-the-art methods in relational video customization. Code
and models will be made publicly available.Summary
AI-Generated Summary