비디오메이커: 비디오 확산 모델의 본질적 힘을 이용한 제로샷 맞춤형 비디오 생성

초록

제로샷 맞춤형 비디오 생성은 상당한 응용 잠재력으로 인해 상당한 관심을 받고 있습니다. 기존 방법은 참조 주제 특징을 추출하고 주입하기 위해 추가 모델에 의존하는데, 이는 비디오 확산 모델(VDM)만으로는 제로샷 맞춤형 비디오 생성에 부족하다고 가정합니다. 그러나 이러한 방법은 종종 최적이 아닌 특징 추출 및 주입 기술로 인해 일관된 주제 외관을 유지하는 데 어려움을 겪습니다. 본 논문에서는 VDM이 본질적으로 주제 특징을 추출하고 주입할 능력을 갖고 있다는 것을 밝힙니다. 이전의 휴리스틱 접근에서 벗어나 VDM의 본질적인 능력을 활용하여 고품질 제로샷 맞춤형 비디오 생성을 가능하게 하는 새로운 프레임워크를 소개합니다. 구체적으로 특징 추출에 대해 참조 이미지를 직접 VDM에 입력하고 그 내재적인 특징 추출 과정을 사용하여 세밀한 특징을 제공할 뿐만 아니라 VDM의 사전 훈련된 지식과 크게 일치시킵니다. 특징 주입에 대해 VDM 내에서 공간 자기 주의를 통해 주제 특징과 생성된 콘텐츠 간의 혁신적인 양방향 상호 작용을 고안하여, VDM이 생성된 비디오의 다양성을 유지하면서 주제 충실도를 더 잘 갖도록 보장합니다. 맞춤형 인간 및 물체 비디오 생성에 대한 실험은 우리의 프레임워크의 효과를 검증합니다.

English

Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM's inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM's pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated video.Experiments on both customized human and object video generation validate the effectiveness of our framework.

비디오메이커: 비디오 확산 모델의 본질적 힘을 이용한 제로샷 맞춤형 비디오 생성

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

초록

Summary

Support

Support