단일 비디오를 통한 동적 개념 개인화

초록

생성적 텍스트-이미지 모델의 개인화는 놀라운 진전을 보여왔지만, 이를 텍스트-비디오 모델로 확장하는 것은 독특한 도전 과제를 제시합니다. 정적 개념과 달리, 텍스트-비디오 모델의 개인화는 동적 개념, 즉 외관뿐만 아니라 움직임으로 정의된 개체를 포착할 수 있는 잠재력을 가지고 있습니다. 본 논문에서는 동적 개념을 기반으로 Diffusion Transformers(DiTs) 기반 생성 비디오 모델을 개인화하기 위한 새로운 프레임워크인 Set-and-Sequence를 소개합니다. 우리의 접근 방식은 공간적 및 시간적 특징을 명시적으로 분리하지 않는 아키텍처 내에서 시공간적 가중치 공간을 부과합니다. 이는 두 가지 주요 단계로 이루어집니다. 먼저, 비디오의 순서 없는 프레임 세트를 사용하여 Low-Rank Adaptation(LoRA) 레이어를 미세 조정하여 시간적 간섭 없이 외관을 나타내는 identity LoRA 기반을 학습합니다. 두 번째 단계에서는, identity LoRA를 고정한 상태에서 Motion Residuals로 계수를 보강하고 전체 비디오 시퀀스에 대해 미세 조정하여 움직임 역학을 포착합니다. 우리의 Set-and-Sequence 프레임워크는 동적 개념을 비디오 모델의 출력 도메인에 효과적으로 내장하는 시공간적 가중치 공간을 생성하며, 전례 없는 편집성과 구성성을 가능하게 하면서 동적 개념 개인화에 대한 새로운 벤치마크를 설정합니다.

English

Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.

단일 비디오를 통한 동적 개념 개인화

Dynamic Concepts Personalization from Single Videos

초록

Support