SketchVideo:基于草图的视频生成与编辑
SketchVideo: Sketch-based Video Generation and Editing
March 30, 2025
作者: Feng-Lin Liu, Hongbo Fu, Xintao Wang, Weicai Ye, Pengfei Wan, Di Zhang, Lin Gao
cs.AI
摘要
基于文本提示或图像条件的视频生成与编辑技术已取得显著进展。然而,仅通过文本精确控制全局布局与几何细节,以及通过图像实现运动控制与局部修改,仍面临挑战。本文旨在实现基于草图的空间与运动控制视频生成,并支持对真实或合成视频的细粒度编辑。基于DiT视频生成模型,我们提出了一种内存高效的控制结构,包含草图控制块,用于预测跳过的DiT块的残差特征。草图可绘制于一个或两个关键帧(任意时间点)上,便于交互。为了将这种时间上稀疏的草图条件传播至所有帧,我们提出了一种帧间注意力机制,用于分析关键帧与每帧视频之间的关系。针对基于草图的视频编辑,我们设计了一个额外的视频插入模块,确保新编辑内容与原始视频的空间特征及动态运动之间的一致性。在推理过程中,我们采用潜在融合技术,以精确保留未编辑区域。大量实验证明,我们的SketchVideo在可控视频生成与编辑方面表现出色。
English
Video generation and editing conditioned on text prompts or images have
undergone significant advancements. However, challenges remain in accurately
controlling global layout and geometry details solely by texts, and supporting
motion control and local modification through images. In this paper, we aim to
achieve sketch-based spatial and motion control for video generation and
support fine-grained editing of real or synthetic videos. Based on the DiT
video generation model, we propose a memory-efficient control structure with
sketch control blocks that predict residual features of skipped DiT blocks.
Sketches are drawn on one or two keyframes (at arbitrary time points) for easy
interaction. To propagate such temporally sparse sketch conditions across all
frames, we propose an inter-frame attention mechanism to analyze the
relationship between the keyframes and each video frame. For sketch-based video
editing, we design an additional video insertion module that maintains
consistency between the newly edited content and the original video's spatial
feature and dynamic motion. During inference, we use latent fusion for the
accurate preservation of unedited regions. Extensive experiments demonstrate
that our SketchVideo achieves superior performance in controllable video
generation and editing.Summary
AI-Generated Summary