DreamVideo-2:零樣本主題驅動視頻定制與精確運動控制
DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control
October 17, 2024
作者: Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, Yingya Zhang, Hongming Shan
cs.AI
摘要
最近在定制視頻生成方面取得了重大進展,使用戶能夠創建針對特定主題和運動軌跡的視頻。然而,現有方法通常需要複雜的測試時微調,並且在平衡主題學習和運動控制方面存在困難,限制了它們在現實世界中的應用。本文提出了DreamVideo-2,一種零樣本視頻定制框架,能夠生成具有特定主題和運動軌跡的視頻,分別由單張圖像和邊界框序列引導,而無需進行測試時微調。具體來說,我們引入了參考注意力,利用模型固有的主題學習能力,並設計了一個基於遮罩引導的運動模塊,通過充分利用從邊界框中獲得的框遮罩的強大運動信號來實現精確的運動控制。儘管這兩個組件實現了它們預期的功能,但我們在實驗中觀察到運動控制往往佔主導地位而壓倒了主題學習。為了解決這個問題,我們提出了兩個關鍵設計:1) 遮罩參考注意力,將混合潛在遮罩建模方案整合到參考注意力中,以增強所需位置的主題表示,以及2) 重新加權擴散損失,區分邊界框內外區域對主題和運動控制的貢獻,以確保平衡。對新編制的數據集進行的大量實驗結果表明,DreamVideo-2在主題定制和運動控制方面優於最先進的方法。數據集、代碼和模型將公開提供。
English
Recent advances in customized video generation have enabled users to create
videos tailored to both specific subjects and motion trajectories. However,
existing methods often require complicated test-time fine-tuning and struggle
with balancing subject learning and motion control, limiting their real-world
applications. In this paper, we present DreamVideo-2, a zero-shot video
customization framework capable of generating videos with a specific subject
and motion trajectory, guided by a single image and a bounding box sequence,
respectively, and without the need for test-time fine-tuning. Specifically, we
introduce reference attention, which leverages the model's inherent
capabilities for subject learning, and devise a mask-guided motion module to
achieve precise motion control by fully utilizing the robust motion signal of
box masks derived from bounding boxes. While these two components achieve their
intended functions, we empirically observe that motion control tends to
dominate over subject learning. To address this, we propose two key designs: 1)
the masked reference attention, which integrates a blended latent mask modeling
scheme into reference attention to enhance subject representations at the
desired positions, and 2) a reweighted diffusion loss, which differentiates the
contributions of regions inside and outside the bounding boxes to ensure a
balance between subject and motion control. Extensive experimental results on a
newly curated dataset demonstrate that DreamVideo-2 outperforms
state-of-the-art methods in both subject customization and motion control. The
dataset, code, and models will be made publicly available.Summary
AI-Generated Summary