基于单视频的动态概念个性化
Dynamic Concepts Personalization from Single Videos
February 20, 2025
作者: Rameen Abdal, Or Patashnik, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman
cs.AI
摘要
生成式文本到图像模型的个性化已取得显著进展,但将这种个性化扩展到文本到视频模型则面临独特挑战。与静态概念不同,个性化文本到视频模型具备捕捉动态概念的潜力,即不仅通过外观定义实体,还通过其运动来定义。本文提出了一种名为“集合与序列”的新框架,用于将基于扩散变换器(DiTs)的生成视频模型与动态概念进行个性化。我们的方法在未明确分离空间与时间特征的架构中,构建了一个时空权重空间。这一目标通过两个关键阶段实现:首先,我们利用视频中无序帧集微调低秩适应(LoRA)层,学习一个代表外观的身份LoRA基,不受时间干扰;其次,在冻结身份LoRA的基础上,通过运动残差增强其系数,并在完整视频序列上进行微调,以捕捉运动动态。我们的“集合与序列”框架构建了一个时空权重空间,有效将动态概念嵌入视频模型的输出域,实现了前所未有的可编辑性和组合性,同时为动态概念的个性化设立了新标杆。
English
Personalizing generative text-to-image models has seen remarkable progress,
but extending this personalization to text-to-video models presents unique
challenges. Unlike static concepts, personalizing text-to-video models has the
potential to capture dynamic concepts, i.e., entities defined not only by their
appearance but also by their motion. In this paper, we introduce
Set-and-Sequence, a novel framework for personalizing Diffusion Transformers
(DiTs)-based generative video models with dynamic concepts. Our approach
imposes a spatio-temporal weight space within an architecture that does not
explicitly separate spatial and temporal features. This is achieved in two key
stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an
unordered set of frames from the video to learn an identity LoRA basis that
represents the appearance, free from temporal interference. In the second
stage, with the identity LoRAs frozen, we augment their coefficients with
Motion Residuals and fine-tune them on the full video sequence, capturing
motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal
weight space that effectively embeds dynamic concepts into the video model's
output domain, enabling unprecedented editability and compositionality while
setting a new benchmark for personalizing dynamic concepts.Summary
AI-Generated Summary