SlowFast-VGen：用于动作驱动长视频生成的慢快学习

摘要

人类天生具有一种互补学习系统，它连接了对一般世界动态的缓慢学习和从新经历中快速存储情节记忆的过程。然而，先前的视频生成模型主要侧重于通过对大量数据进行预训练来进行缓慢学习，忽视了对于存储情节记忆至关重要的快速学习阶段。这一疏忽导致在生成更长视频时跨时间帧之间存在不一致性，因为这些帧超出了模型的上下文窗口。为此，我们引入了SlowFast-VGen，这是一种新颖的双速学习系统，用于基于动作的长视频生成。我们的方法结合了用于缓慢学习世界动态的掩码条件视频扩散模型，以及基于时间LoRA模块的推理时快速学习策略。具体而言，快速学习过程根据局部输入和输出更新其时间LoRA参数，从而有效地在其参数中存储情节记忆。我们进一步提出了一种慢快学习循环算法，将内部快速学习循环无缝集成到外部缓慢学习循环中，实现了对先前多情节经历的上下文感知技能学习的回忆。为了促进对近似世界模型的缓慢学习，我们收集了一个包含20万个视频的大规模数据集，其中包括语言动作注释，涵盖了各种场景。大量实验证明，SlowFast-VGen在基于动作的视频生成的各种指标上优于基线模型，实现了514的FVD分数，而基线模型为782，并在更长视频中保持一致性，平均为0.37个场景切换，而基线模型为0.89。慢快学习循环算法还显著提升了长期规划任务的性能。项目网站：https://slowfast-vgen.github.io

English

Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model's context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi-episode experiences for context-aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast-VGen outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well. Project Website: https://slowfast-vgen.github.io

SlowFast-VGen：用于动作驱动长视频生成的慢快学习

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

摘要

Summary

Support

Support