VideoMaker:利用影片擴散模型的固有力量實現零樣本定制影片生成
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
December 27, 2024
作者: Tao Wu, Yong Zhang, Xiaodong Cun, Zhongang Qi, Junfu Pu, Huanzhang Dou, Guangcong Zheng, Ying Shan, Xi Li
cs.AI
摘要
零樣式定制視頻生成因其重要的應用潛力而受到廣泛關注。現有方法依賴額外模型來提取和注入參考主題特徵,假設僅靠視頻擴散模型(VDM)無法實現零樣式定制視頻生成。然而,這些方法常因特徵提取和注入技術不佳而難以保持主題外觀的一致性。本文揭示了VDM本身具有提取和注入主題特徵的能力。與以往的啟發式方法不同,我們提出了一個新的框架,利用VDM固有的力量實現高質量的零樣式定制視頻生成。具體而言,在特徵提取方面,我們將參考圖像直接輸入VDM並利用其內在的特徵提取過程,這不僅提供了細粒度特徵,還與VDM的預訓練知識顯著一致。對於特徵注入,我們通過VDM內的空間自注意力設計了一種創新的主題特徵和生成內容之間的雙向交互,確保VDM在保持生成視頻多樣性的同時具有更好的主題忠實度。對定制人類和物體視頻生成的實驗驗證了我們框架的有效性。
English
Zero-shot customized video generation has gained significant attention due to
its substantial application potential. Existing methods rely on additional
models to extract and inject reference subject features, assuming that the
Video Diffusion Model (VDM) alone is insufficient for zero-shot customized
video generation. However, these methods often struggle to maintain consistent
subject appearance due to suboptimal feature extraction and injection
techniques. In this paper, we reveal that VDM inherently possesses the force to
extract and inject subject features. Departing from previous heuristic
approaches, we introduce a novel framework that leverages VDM's inherent force
to enable high-quality zero-shot customized video generation. Specifically, for
feature extraction, we directly input reference images into VDM and use its
intrinsic feature extraction process, which not only provides fine-grained
features but also significantly aligns with VDM's pre-trained knowledge. For
feature injection, we devise an innovative bidirectional interaction between
subject features and generated content through spatial self-attention within
VDM, ensuring that VDM has better subject fidelity while maintaining the
diversity of the generated video.Experiments on both customized human and
object video generation validate the effectiveness of our framework.Summary
AI-Generated Summary