SUGAR：以主題驅動的零樣本方式進行視頻定制

摘要

我們提出了SUGAR，一種用於主題驅動視頻定制的零樣本方法。給定一個輸入圖像，SUGAR能夠為圖像中包含的主題生成視頻，並將生成與用戶輸入的文本指定的風格和運動等任意視覺屬性對齊。與需要測試時間微調或無法生成文本對齊視頻的先前方法不同，SUGAR在測試時間無需額外成本即可達到卓越的結果。為了實現零樣本能力，我們引入了一個可擴展的流程來構建合成數據集，該數據集專門設計用於主題驅動的定制，產生了250萬個圖像-視頻-文本三元組。此外，我們提出了幾種增強我們模型的方法，包括特殊注意設計、改進的訓練策略和精煉的抽樣算法。進行了大量實驗。與先前方法相比，SUGAR在保留身份、視頻動態和主題驅動視頻定制的視頻-文本對齊方面取得了最先進的結果，展示了我們提出方法的有效性。

English

We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.

SUGAR：以主題驅動的零樣本方式進行視頻定制

SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

摘要

Support