LinGen:实现高分辨率、分钟级文本到视频生成的线性计算复杂度
LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity
December 13, 2024
作者: Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei-Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, Peter Vajda, Niraj K. Jha, Xiaoliang Dai
cs.AI
摘要
文本到视频生成增强了内容创作,但计算密集度很高:扩散变压器(DiTs)的计算成本随像素数量呈二次增长。这使得生成分钟级视频的成本极高,限制了大多数现有模型只能生成10-20秒长度的视频。我们提出了一个线性复杂度的文本到视频生成(LinGen)框架,其成本随像素数量线性增长。LinGen首次实现了在单个GPU上生成高分辨率的分钟级视频,而无需牺牲质量。它用一个线性复杂度的名为MATE的模块替换了计算主导和二次复杂度的自注意力块。MATE由MA分支和TE分支组成,MA分支针对短到长范围的相关性,结合了一个双向的Mamba2块和我们的标记重排方法Rotary Major Scan,以及为长视频生成开发的Review Tokens。TE分支是一个新颖的TEmporal Swin Attention块,专注于相邻标记和中程标记之间的时间相关性。MATE模块解决了Mamba的邻近保留问题,并显著提高了生成视频的一致性。实验结果表明,LinGen在视频质量方面优于DiT(胜率达75.6%),同时减少了高达15倍(11.5倍)的FLOPs(延迟)。此外,自动指标和人类评估均表明,我们的LinGen-4B在视频质量上与最先进的模型(相对于Gen-3、LumaLabs和Kling,胜率分别为50.5%、52.1%、49.1%)可媲美。这为生成小时级电影和实时交互式视频生成铺平了道路。我们在项目网站提供了68秒视频生成结果和更多示例:https://lineargen.github.io/。
English
Text-to-video generation enhances content creation but is highly
computationally intensive: The computational cost of Diffusion Transformers
(DiTs) scales quadratically in the number of pixels. This makes minute-length
video generation extremely expensive, limiting most existing models to
generating videos of only 10-20 seconds length. We propose a Linear-complexity
text-to-video Generation (LinGen) framework whose cost scales linearly in the
number of pixels. For the first time, LinGen enables high-resolution
minute-length video generation on a single GPU without compromising quality. It
replaces the computationally-dominant and quadratic-complexity block,
self-attention, with a linear-complexity block called MATE, which consists of
an MA-branch and a TE-branch. The MA-branch targets short-to-long-range
correlations, combining a bidirectional Mamba2 block with our token
rearrangement method, Rotary Major Scan, and our review tokens developed for
long video generation. The TE-branch is a novel TEmporal Swin Attention block
that focuses on temporal correlations between adjacent tokens and medium-range
tokens. The MATE block addresses the adjacency preservation issue of Mamba and
improves the consistency of generated videos significantly. Experimental
results show that LinGen outperforms DiT (with a 75.6% win rate) in video
quality with up to 15times (11.5times) FLOPs (latency) reduction.
Furthermore, both automatic metrics and human evaluation demonstrate our
LinGen-4B yields comparable video quality to state-of-the-art models (with a
50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling,
respectively). This paves the way to hour-length movie generation and real-time
interactive video generation. We provide 68s video generation results and more
examples in our project website: https://lineargen.github.io/.Summary
AI-Generated Summary