ビデオメーカー：ビデオ拡散モデルの固有力を用いたゼロショットカスタマイズビデオ生成

要旨

ゼロショットカスタマイズビデオ生成は、その大きな応用潜在性から注目を集めています。既存の手法は、ゼロショットカスタマイズビデオ生成においてVideo Diffusion Model（VDM）単体では不十分であると仮定し、参照主題の特徴を抽出および注入するための追加モデルに依存しています。しかし、これらの手法はしばしば、最適でない特徴抽出および注入技術により、一貫した主題の外観を維持するのに苦労しています。本論文では、VDM自体が主題の特徴を抽出および注入する力を持っていることを明らかにします。従来のヒューリスティックアプローチから脱却し、VDMの固有の力を活用して高品質のゼロショットカスタマイズビデオ生成を実現する革新的なフレームワークを紹介します。具体的には、特徴抽出において、参照画像を直接VDMに入力し、その固有の特徴抽出プロセスを使用することで、細かい特徴を提供するだけでなく、VDMの事前学習知識と大きく一致します。特徴注入においては、VDM内の空間自己注意を介した主題特徴と生成されたコンテンツとの革新的な双方向相互作用を考案し、VDMが主題の忠実度を向上させながら生成されたビデオの多様性を維持することを確実にします。カスタマイズされた人間およびオブジェクトビデオ生成に関する実験は、当社のフレームワークの効果を検証しています。

English

Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM's inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM's pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated video.Experiments on both customized human and object video generation validate the effectiveness of our framework.

ビデオメーカー：ビデオ拡散モデルの固有力を用いたゼロショットカスタマイズビデオ生成

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

要旨

Support