SUGAR: ゼロショット方式における主題駆動型ビデオカスタマイズ

要旨

私たちは、主題駆動型ビデオカスタマイズのためのゼロショット手法であるSUGARを提案します。入力画像が与えられると、SUGARは画像に含まれる主題に対してビデオを生成し、ユーザー入力のテキストで指定されたスタイルやモーションなどの任意の視覚属性と整合させることができます。テスト時の微調整が必要ないか、テキストに整列したビデオを生成できない従来の手法とは異なり、SUGARはテスト時に追加コストが不要で優れた結果を達成します。ゼロショット機能を実現するために、主題駆動型カスタマイズ向けに特別に設計された合成データセットを構築するためのスケーラブルなパイプラインを導入し、250万の画像-ビデオ-テキストの三つ組を生成します。さらに、特別な注意設計、改善されたトレーニング戦略、洗練されたサンプリングアルゴリズムを含む、モデルを強化するためのいくつかの手法を提案します。包括的な実験が行われました。従来の手法と比較して、SUGARは主題駆動型ビデオカスタマイゼーションにおいて、アイデンティティの保存、ビデオダイナミクス、ビデオ-テキストの整列において最先端の結果を達成し、提案手法の効果を示しています。

English

We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.

SUGAR: ゼロショット方式における主題駆動型ビデオカスタマイズ

SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

要旨

Summary

Support

Support