基于单视频的动态概念个性化

摘要

生成式文本到图像模型的个性化已取得显著进展，但将这种个性化扩展到文本到视频模型则面临独特挑战。与静态概念不同，个性化文本到视频模型具备捕捉动态概念的潜力，即不仅通过外观定义实体，还通过其运动来定义。本文提出了一种名为“集合与序列”的新框架，用于将基于扩散变换器（DiTs）的生成视频模型与动态概念进行个性化。我们的方法在未明确分离空间与时间特征的架构中，构建了一个时空权重空间。这一目标通过两个关键阶段实现：首先，我们利用视频中无序帧集微调低秩适应（LoRA）层，学习一个代表外观的身份LoRA基，不受时间干扰；其次，在冻结身份LoRA的基础上，通过运动残差增强其系数，并在完整视频序列上进行微调，以捕捉运动动态。我们的“集合与序列”框架构建了一个时空权重空间，有效将动态概念嵌入视频模型的输出域，实现了前所未有的可编辑性和组合性，同时为动态概念的个性化设立了新标杆。

English

Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.

基于单视频的动态概念个性化

Dynamic Concepts Personalization from Single Videos

摘要

Summary

Support