基於身份與運動解耦的特定主體驅動影片生成

摘要

我們提出了一種無需額外調校即可訓練主題驅動的定制視頻生成模型的方法，該方法通過將特定主題的學習與時間動態解耦來實現零樣本學習。傳統的無需調校的視頻定制方法通常依賴於大型、帶註釋的視頻數據集，這些數據集計算成本高昂且需要大量註釋。與以往方法不同，我們直接將圖像定制數據集用於訓練視頻定制模型，將視頻定制分解為兩個方面：(1) 通過圖像定制數據集進行身份注入，以及 (2) 通過圖像到視頻的訓練方法，利用少量未註釋的視頻保持時間建模。此外，我們在圖像到視頻的微調過程中採用隨機圖像令牌丟棄和隨機圖像初始化，以緩解複製粘貼問題。為了進一步增強學習效果，我們在特定主題特徵和時間特徵的聯合優化中引入了隨機切換，從而減輕災難性遺忘。我們的方法在零樣本設置下實現了強烈的主題一致性和可擴展性，超越了現有的視頻定制模型，展示了我們框架的有效性。

English

We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.

基於身份與運動解耦的特定主體驅動影片生成

Subject-driven Video Generation via Disentangled Identity and Motion

摘要

Summary

Support

Support