测试时训练的一分钟视频生成

摘要

现今的Transformer模型在生成一分钟视频方面仍面临挑战，因为自注意力层在处理长上下文时效率低下。而诸如Mamba层等替代方案，由于隐藏状态表达能力不足，难以驾驭复杂的多场景故事。我们尝试了测试时训练（TTT）层，其隐藏状态本身可以是神经网络，因而具备更强的表达能力。将TTT层集成到预训练的Transformer中，使其能够从文本故事板生成一分钟视频。为验证概念，我们基于《猫和老鼠》动画片构建了一个数据集。与Mamba~2、门控DeltaNet及滑动窗口注意力层等基线方法相比，TTT层生成的视频在讲述复杂故事时连贯性显著提升，在每种方法100个视频的人类评估中，以34个Elo分的优势领先。尽管结果令人鼓舞，但仍存在瑕疵，这可能是由于预训练的50亿参数模型能力有限所致。此外，我们实现的效率也有待提高。由于资源限制，我们仅实验了一分钟视频，但该方法可扩展至更长视频及更复杂的故事。示例视频、代码及注释可在以下网址获取：https://test-time-training.github.io/video-dit。

English

Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit

测试时训练的一分钟视频生成

One-Minute Video Generation with Test-Time Training

摘要

Summary

Support

Support