DynVFX:用动态内容增强真实视频
DynVFX: Augmenting Real Videos with Dynamic Content
February 5, 2025
作者: Danah Yatim, Rafail Fridman, Omer Bar-Tal, Tali Dekel
cs.AI
摘要
我们提出了一种用新生成的动态内容增强现实世界视频的方法。给定一个输入视频和一个简单的用户提供的描述所需内容的文本指令,我们的方法合成动态对象或复杂场景效果,这些对象或效果会随着时间自然地与现有场景互动。新内容的位置、外观和运动被无缝地整合到原始镜头中,同时考虑了摄像机运动、遮挡以及与场景中其他动态对象的互动,从而产生连贯且逼真的输出视频。我们通过一个零-shot、无需训练的框架实现了这一点,该框架利用了一个预训练的文本到视频扩散变压器来合成新内容,以及一个预训练的视觉语言模型来详细展现增强场景。具体来说,我们引入了一种基于推理的新方法,该方法在注意机制内操作特征,实现了对新内容的准确定位和无缝整合,同时保持了原始场景的完整性。我们的方法是完全自动化的,只需要一个简单的用户指令。我们展示了它在应用于现实世界视频的各种编辑上的有效性,涵盖了涉及摄像机和物体运动的各种对象和场景。
English
We present a method for augmenting real-world videos with newly generated
dynamic content. Given an input video and a simple user-provided text
instruction describing the desired content, our method synthesizes dynamic
objects or complex scene effects that naturally interact with the existing
scene over time. The position, appearance, and motion of the new content are
seamlessly integrated into the original footage while accounting for camera
motion, occlusions, and interactions with other dynamic objects in the scene,
resulting in a cohesive and realistic output video. We achieve this via a
zero-shot, training-free framework that harnesses a pre-trained text-to-video
diffusion transformer to synthesize the new content and a pre-trained Vision
Language Model to envision the augmented scene in detail. Specifically, we
introduce a novel inference-based method that manipulates features within the
attention mechanism, enabling accurate localization and seamless integration of
the new content while preserving the integrity of the original scene. Our
method is fully automated, requiring only a simple user instruction. We
demonstrate its effectiveness on a wide range of edits applied to real-world
videos, encompassing diverse objects and scenarios involving both camera and
object motion.Summary
AI-Generated Summary