SUGAR:零样本方式下基于主体驱动的视频定制
SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner
December 13, 2024
作者: Yufan Zhou, Ruiyi Zhang, Jiuxiang Gu, Nanxuan Zhao, Jing Shi, Tong Sun
cs.AI
摘要
我们提出了SUGAR,这是一种用于主题驱动视频定制的零样本方法。
给定输入图像,SUGAR能够为图像中包含的主题生成视频,并将生成与用户输入文本指定的任意视觉属性(如风格和动作)对齐。与以往的方法不同,这些方法需要在测试时进行微调或无法生成与文本对齐的视频,SUGAR在无需额外成本的情况下实现了更优异的结果。为了实现零样本能力,我们引入了一个可扩展的流程,用于构建专门设计用于主题驱动定制的合成数据集,从而产生了250万个图像-视频-文本三元组。此外,我们提出了几种增强模型的方法,包括特殊注意力设计、改进的训练策略和精细的采样算法。我们进行了大量实验。与以往的方法相比,SUGAR在保持身份、视频动态和视频-文本对齐方面取得了最先进的结果,展示了我们提出方法的有效性。
English
We present SUGAR, a zero-shot method for subject-driven video customization.
Given an input image, SUGAR is capable of generating videos for the subject
contained in the image and aligning the generation with arbitrary visual
attributes such as style and motion specified by user-input text. Unlike
previous methods, which require test-time fine-tuning or fail to generate
text-aligned videos, SUGAR achieves superior results without the need for extra
cost at test-time. To enable zero-shot capability, we introduce a scalable
pipeline to construct synthetic dataset which is specifically designed for
subject-driven customization, leading to 2.5 millions of image-video-text
triplets. Additionally, we propose several methods to enhance our model,
including special attention designs, improved training strategies, and a
refined sampling algorithm. Extensive experiments are conducted. Compared to
previous methods, SUGAR achieves state-of-the-art results in identity
preservation, video dynamics, and video-text alignment for subject-driven video
customization, demonstrating the effectiveness of our proposed method.Summary
AI-Generated Summary