VideoJAM：用于增强视频模型中运动生成的联合外观-运动表示

摘要

尽管近年来取得了巨大进展，生成式视频模型仍然难以捕捉真实世界的运动、动态和物理特性。我们表明，这一限制源于传统的像素重建目标，使模型偏向外观保真度，而牺牲了运动连贯性。为了解决这一问题，我们引入了VideoJAM，这是一个新颖的框架，通过鼓励模型学习联合外观-运动表示，为视频生成器注入了有效的运动先验。VideoJAM由两个互补单元组成。在训练过程中，我们扩展了目标，以预测生成像素及其对应运动，从单个学习表示中获得。在推断过程中，我们引入了Inner-Guidance，一种机制，通过利用模型自身不断演化的运动预测作为动态引导信号，引导生成向连贯的运动。值得注意的是，我们的框架可以应用于任何视频模型，只需进行最少的调整，无需修改训练数据或扩展模型。VideoJAM在运动连贯性方面实现了最先进的性能，超越了高度竞争的专有模型，同时提升了生成物的视觉质量。这些发现强调了外观和运动可以是互补的，有效集成时可以增强视频生成的视觉质量和连贯性。项目网站：https://hila-chefer.github.io/videojam-paper.github.io/

English

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/

VideoJAM：用于增强视频模型中运动生成的联合外观-运动表示

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

摘要

Summary

Support