SUGAR:以主題驅動的零樣本方式進行視頻定制
SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner
December 13, 2024
作者: Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Nanxuan Zhao, Jing Shi, Tong Sun
cs.AI
摘要
我們提出了SUGAR,一種用於主題驅動視頻定制的零樣本方法。給定一個輸入圖像,SUGAR能夠為圖像中包含的主題生成視頻,並將生成與用戶輸入的文本指定的風格和運動等任意視覺屬性對齊。與需要測試時間微調或無法生成文本對齊視頻的先前方法不同,SUGAR在測試時間無需額外成本即可達到卓越的結果。為了實現零樣本能力,我們引入了一個可擴展的流程來構建合成數據集,該數據集專門設計用於主題驅動的定制,產生了250萬個圖像-視頻-文本三元組。此外,我們提出了幾種增強我們模型的方法,包括特殊注意設計、改進的訓練策略和精煉的抽樣算法。進行了大量實驗。與先前方法相比,SUGAR在保留身份、視頻動態和主題驅動視頻定制的視頻-文本對齊方面取得了最先進的結果,展示了我們提出方法的有效性。
English
We present SUGAR, a zero-shot method for subject-driven video customization.
Given an input image, SUGAR is capable of generating videos for the subject
contained in the image and aligning the generation with arbitrary visual
attributes such as style and motion specified by user-input text. Unlike
previous methods, which require test-time fine-tuning or fail to generate
text-aligned videos, SUGAR achieves superior results without the need for extra
cost at test-time. To enable zero-shot capability, we introduce a scalable
pipeline to construct synthetic dataset which is specifically designed for
subject-driven customization, leading to 2.5 millions of image-video-text
triplets. Additionally, we propose several methods to enhance our model,
including special attention designs, improved training strategies, and a
refined sampling algorithm. Extensive experiments are conducted. Compared to
previous methods, SUGAR achieves state-of-the-art results in identity
preservation, video dynamics, and video-text alignment for subject-driven video
customization, demonstrating the effectiveness of our proposed method.Summary
AI-Generated Summary