VideoJAM:用于增强视频模型中运动生成的联合外观-运动表示
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
February 4, 2025
作者: Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin
cs.AI
摘要
尽管近年来取得了巨大进展,生成式视频模型仍然难以捕捉真实世界的运动、动态和物理特性。我们表明,这一限制源于传统的像素重建目标,使模型偏向外观保真度,而牺牲了运动连贯性。为了解决这一问题,我们引入了VideoJAM,这是一个新颖的框架,通过鼓励模型学习联合外观-运动表示,为视频生成器注入了有效的运动先验。VideoJAM由两个互补单元组成。在训练过程中,我们扩展了目标,以预测生成像素及其对应运动,从单个学习表示中获得。在推断过程中,我们引入了Inner-Guidance,一种机制,通过利用模型自身不断演化的运动预测作为动态引导信号,引导生成向连贯的运动。值得注意的是,我们的框架可以应用于任何视频模型,只需进行最少的调整,无需修改训练数据或扩展模型。VideoJAM在运动连贯性方面实现了最先进的性能,超越了高度竞争的专有模型,同时提升了生成物的视觉质量。这些发现强调了外观和运动可以是互补的,有效集成时可以增强视频生成的视觉质量和连贯性。项目网站:https://hila-chefer.github.io/videojam-paper.github.io/
English
Despite tremendous recent progress, generative video models still struggle to
capture real-world motion, dynamics, and physics. We show that this limitation
arises from the conventional pixel reconstruction objective, which biases
models toward appearance fidelity at the expense of motion coherence. To
address this, we introduce VideoJAM, a novel framework that instills an
effective motion prior to video generators, by encouraging the model to learn a
joint appearance-motion representation. VideoJAM is composed of two
complementary units. During training, we extend the objective to predict both
the generated pixels and their corresponding motion from a single learned
representation. During inference, we introduce Inner-Guidance, a mechanism that
steers the generation toward coherent motion by leveraging the model's own
evolving motion prediction as a dynamic guidance signal. Notably, our framework
can be applied to any video model with minimal adaptations, requiring no
modifications to the training data or scaling of the model. VideoJAM achieves
state-of-the-art performance in motion coherence, surpassing highly competitive
proprietary models while also enhancing the perceived visual quality of the
generations. These findings emphasize that appearance and motion can be
complementary and, when effectively integrated, enhance both the visual quality
and the coherence of video generation. Project website:
https://hila-chefer.github.io/videojam-paper.github.io/Summary
AI-Generated Summary