SUGAR: 제로샷 방식으로 주체 주도 비디오 맞춤화

초록

우리는 주제 중심 비디오 맞춤화를 위한 제로샷 방법인 SUGAR을 제안합니다. 입력 이미지를 받으면, SUGAR은 이미지에 포함된 주제를 바탕으로 사용자 입력 텍스트로 지정된 스타일 및 동작과 같은 임의의 시각적 속성과 일치하도록 비디오를 생성할 수 있습니다. 테스트 시간 미세 조정이 필요하거나 텍스트에 맞게 정렬된 비디오를 생성하지 못하는 이전 방법과는 달리, SUGAR은 테스트 시간에 추가 비용이 필요 없이 우수한 결과를 달성합니다. 제로샷 기능을 가능하게 하기 위해, 주제 중심 맞춤화를 위해 특별히 설계된 합성 데이터셋을 구축하기 위한 확장 가능한 파이프라인을 소개합니다. 이를 통해 250만 개의 이미지-비디오-텍스트 쌍을 생성합니다. 또한, 특별한 주의 디자인, 개선된 훈련 전략 및 정교한 샘플링 알고리즘을 포함한 여러 방법을 제안합니다. 광범위한 실험을 수행했습니다. 이전 방법과 비교하여, SUGAR은 주제 중심 비디오 맞춤화를 위한 신원 보존, 비디오 역동성 및 비디오-텍스트 정렬에서 최첨단 결과를 달성하여 우리가 제안한 방법의 효과를 입증합니다.

English

We present SUGAR, a zero-shot method for subject-driven video customization. Given an input image, SUGAR is capable of generating videos for the subject contained in the image and aligning the generation with arbitrary visual attributes such as style and motion specified by user-input text. Unlike previous methods, which require test-time fine-tuning or fail to generate text-aligned videos, SUGAR achieves superior results without the need for extra cost at test-time. To enable zero-shot capability, we introduce a scalable pipeline to construct synthetic dataset which is specifically designed for subject-driven customization, leading to 2.5 millions of image-video-text triplets. Additionally, we propose several methods to enhance our model, including special attention designs, improved training strategies, and a refined sampling algorithm. Extensive experiments are conducted. Compared to previous methods, SUGAR achieves state-of-the-art results in identity preservation, video dynamics, and video-text alignment for subject-driven video customization, demonstrating the effectiveness of our proposed method.

SUGAR: 제로샷 방식으로 주체 주도 비디오 맞춤화

SUGAR: Subject-Driven Video Customization in a Zero-Shot Manner

초록

Support