演示生成视频
Video Creation by Demonstration
December 12, 2024
作者: Yihong Sun, Hao Zhou, Liangzhe Yuan, Jennifer J. Sun, Yandong Li, Xuhui Jia, Hartwig Adam, Bharath Hariharan, Long Zhao, Ting Liu
cs.AI
摘要
我们探索了一种新颖的视频创作体验,即演示视频生成。给定一个演示视频和来自不同场景的上下文图像,我们生成一个物理上合理的视频,从上下文图像自然延续,并执行演示中的动作概念。为实现这一能力,我们提出了delta-Diffusion,这是一种自监督训练方法,通过有条件的未标记视频未来帧预测进行学习。与大多数现有的基于显式信号的视频生成控制不同,我们采用隐式潜在控制形式,以满足一般视频所需的最大灵活性和表现力。通过利用具有顶部外观瓶颈设计的视频基础模型,我们从演示视频中提取动作潜在因素,用于在生成过程中进行条件设置,最小程度地泄漏外观信息。实证结果表明,delta-Diffusion在人类偏好和大规模机器评估方面优于相关基线,并展示了朝着交互式世界模拟的潜力。可在https://delta-diffusion.github.io/ 上查看生成的视频样本。
English
We explore a novel video creation experience, namely Video Creation by
Demonstration. Given a demonstration video and a context image from a different
scene, we generate a physically plausible video that continues naturally from
the context image and carries out the action concepts from the demonstration.
To enable this capability, we present delta-Diffusion, a self-supervised
training approach that learns from unlabeled videos by conditional future frame
prediction. Unlike most existing video generation controls that are based on
explicit signals, we adopts the form of implicit latent control for maximal
flexibility and expressiveness required by general videos. By leveraging a
video foundation model with an appearance bottleneck design on top, we extract
action latents from demonstration videos for conditioning the generation
process with minimal appearance leakage. Empirically, delta-Diffusion
outperforms related baselines in terms of both human preference and large-scale
machine evaluations, and demonstrates potentials towards interactive world
simulation. Sampled video generation results are available at
https://delta-diffusion.github.io/.Summary
AI-Generated Summary