VACE:一体化视频创作与编辑平台
VACE: All-in-One Video Creation and Editing
March 10, 2025
作者: Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, Yu Liu
cs.AI
摘要
扩散变换器在生成高质量图像和视频方面展现了强大的能力和可扩展性。进一步追求生成与编辑任务的统一,已在图像内容创作领域取得了显著进展。然而,由于在时间和空间动态上对一致性的内在要求,实现视频合成的统一方法仍具挑战性。我们提出了VACE,它使用户能够在一个集创建与编辑于一体的框架内执行视频任务。这些任务包括参考视频生成、视频到视频编辑以及掩码视频到视频编辑。具体而言,我们通过将视频任务输入(如编辑、参考和掩码)组织成一个统一的界面——视频条件单元(VCU),有效整合了各类任务的需求。此外,通过采用上下文适配器结构,我们利用时间和空间维度的形式化表示,将不同任务概念注入模型,使其能够灵活处理任意视频合成任务。大量实验表明,VACE的统一模型在各种子任务上均达到了与专用模型相当的性能,同时通过多样化的任务组合实现了广泛的应用。项目页面:https://ali-vilab.github.io/VACE-Page/。
English
Diffusion Transformer has demonstrated powerful capability and scalability in
generating high-quality images and videos. Further pursuing the unification of
generation and editing tasks has yielded significant progress in the domain of
image content creation. However, due to the intrinsic demands for consistency
across both temporal and spatial dynamics, achieving a unified approach for
video synthesis remains challenging. We introduce VACE, which enables users to
perform Video tasks within an All-in-one framework for Creation and Editing.
These tasks include reference-to-video generation, video-to-video editing, and
masked video-to-video editing. Specifically, we effectively integrate the
requirements of various tasks by organizing video task inputs, such as editing,
reference, and masking, into a unified interface referred to as the Video
Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we
inject different task concepts into the model using formalized representations
of temporal and spatial dimensions, allowing it to handle arbitrary video
synthesis tasks flexibly. Extensive experiments demonstrate that the unified
model of VACE achieves performance on par with task-specific models across
various subtasks. Simultaneously, it enables diverse applications through
versatile task combinations. Project page:
https://ali-vilab.github.io/VACE-Page/.Summary
AI-Generated Summary