ChatPaper.aiChatPaper

SlowFast-VGen:動作驅動長視頻生成的慢快學習

SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation

October 30, 2024
作者: Yining Hong, Beide Liu, Maxine Wu, Yuanhao Zhai, Kai-Wei Chang, Lingjie Li, Kevin Lin, Chung-Ching Lin, Jianfeng Wang, Zhengyuan Yang, Yingnian Wu, Lijuan Wang
cs.AI

摘要

人類天生具有一個互補的學習系統,它將一般世界動態的緩慢學習與從新經驗中快速儲存情節記憶的能力連接在一起。然而,先前的視頻生成模型主要專注於通過預先在大量數據上進行預訓練的緩慢學習,忽略了對情節記憶存儲至關重要的快速學習階段。這種疏忽導致在生成較長視頻時,跨時間間隔的幀之間存在不一致性,因為這些幀超出了模型的上下文窗口。為此,我們引入了SlowFast-VGen,這是一種新穎的雙速學習系統,用於基於動作的長視頻生成。我們的方法結合了一個用於緩慢學習世界動態的遮罩條件視頻擴散模型,以及一個基於時間LoRA模塊的推理時快速學習策略。具體來說,快速學習過程根據本地輸入和輸出更新其時間LoRA參數,從而有效地將情節記憶存儲在其參數中。我們進一步提出了一個慢快學習迴圈算法,將內部快速學習迴圈無縫集成到外部緩慢學習迴圈中,實現對先前多情節經驗的回憶,以進行上下文感知技能學習。為了促進對近似世界模型的緩慢學習,我們收集了一個包含20萬個帶有語言動作標註的視頻的大規模數據集,涵蓋了各種情境。大量實驗表明,SlowFast-VGen在基於動作的視頻生成的各種指標上優於基線,實現了514的FVD分數,而基線為782,並在較長視頻中保持一致性,平均0.37個場景切換,而基線為0.89。慢快學習迴圈算法還顯著提高了長期規劃任務的性能。項目網站:https://slowfast-vgen.github.io
English
Human beings are endowed with a complementary learning system, which bridges the slow learning of general world dynamics with fast storage of episodic memory from a new experience. Previous video generation models, however, primarily focus on slow learning by pre-training on vast amounts of data, overlooking the fast learning phase crucial for episodic memory storage. This oversight leads to inconsistencies across temporally distant frames when generating longer videos, as these frames fall beyond the model's context window. To this end, we introduce SlowFast-VGen, a novel dual-speed learning system for action-driven long video generation. Our approach incorporates a masked conditional video diffusion model for the slow learning of world dynamics, alongside an inference-time fast learning strategy based on a temporal LoRA module. Specifically, the fast learning process updates its temporal LoRA parameters based on local inputs and outputs, thereby efficiently storing episodic memory in its parameters. We further propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop, enabling the recall of prior multi-episode experiences for context-aware skill learning. To facilitate the slow learning of an approximate world model, we collect a large-scale dataset of 200k videos with language action annotations, covering a wide range of scenarios. Extensive experiments show that SlowFast-VGen outperforms baselines across various metrics for action-driven video generation, achieving an FVD score of 514 compared to 782, and maintaining consistency in longer videos, with an average of 0.37 scene cuts versus 0.89. The slow-fast learning loop algorithm significantly enhances performances on long-horizon planning tasks as well. Project Website: https://slowfast-vgen.github.io

Summary

AI-Generated Summary

PDF93November 16, 2024