ChatPaper.aiChatPaper

测试时训练的一分钟视频生成

One-Minute Video Generation with Test-Time Training

April 7, 2025
作者: Karan Dalal, Daniel Koceja, Gashon Hussein, Jiarui Xu, Yue Zhao, Youjin Song, Shihao Han, Ka Chun Cheung, Jan Kautz, Carlos Guestrin, Tatsunori Hashimoto, Sanmi Koyejo, Yejin Choi, Yu Sun, Xiaolong Wang
cs.AI

摘要

现今的Transformer模型在生成一分钟视频方面仍面临挑战,因为自注意力层在处理长上下文时效率低下。而诸如Mamba层等替代方案,由于隐藏状态表达能力不足,难以驾驭复杂的多场景故事。我们尝试了测试时训练(TTT)层,其隐藏状态本身可以是神经网络,因而具备更强的表达能力。将TTT层集成到预训练的Transformer中,使其能够从文本故事板生成一分钟视频。为验证概念,我们基于《猫和老鼠》动画片构建了一个数据集。与Mamba~2、门控DeltaNet及滑动窗口注意力层等基线方法相比,TTT层生成的视频在讲述复杂故事时连贯性显著提升,在每种方法100个视频的人类评估中,以34个Elo分的优势领先。尽管结果令人鼓舞,但仍存在瑕疵,这可能是由于预训练的50亿参数模型能力有限所致。此外,我们实现的效率也有待提高。由于资源限制,我们仅实验了一分钟视频,但该方法可扩展至更长视频及更复杂的故事。示例视频、代码及注释可在以下网址获取:https://test-time-training.github.io/video-dit。
English
Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit

Summary

AI-Generated Summary

PDF994April 8, 2025