VideoMaker：利用影片擴散模型的固有力量實現零樣本定制影片生成

摘要

零樣式定制視頻生成因其重要的應用潛力而受到廣泛關注。現有方法依賴額外模型來提取和注入參考主題特徵，假設僅靠視頻擴散模型（VDM）無法實現零樣式定制視頻生成。然而，這些方法常因特徵提取和注入技術不佳而難以保持主題外觀的一致性。本文揭示了VDM本身具有提取和注入主題特徵的能力。與以往的啟發式方法不同，我們提出了一個新的框架，利用VDM固有的力量實現高質量的零樣式定制視頻生成。具體而言，在特徵提取方面，我們將參考圖像直接輸入VDM並利用其內在的特徵提取過程，這不僅提供了細粒度特徵，還與VDM的預訓練知識顯著一致。對於特徵注入，我們通過VDM內的空間自注意力設計了一種創新的主題特徵和生成內容之間的雙向交互，確保VDM在保持生成視頻多樣性的同時具有更好的主題忠實度。對定制人類和物體視頻生成的實驗驗證了我們框架的有效性。

English

Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM's inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM's pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated video.Experiments on both customized human and object video generation validate the effectiveness of our framework.

VideoMaker：利用影片擴散模型的固有力量實現零樣本定制影片生成

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

摘要

Support