LinGen：朝向高解析度、分鐘級長度的文本到視頻生成，具有線性計算複雜度。

摘要

文字到視頻生成增強了內容創作，但計算密集度很高：擴散Transformer（DiTs）的計算成本隨像素數量呈二次方增長。這使得生成幾分鐘長的視頻極為昂貴，導致大多數現有模型僅能生成10-20秒長度的視頻。我們提出了一個線性複雜度的文字到視頻生成（LinGen）框架，其成本隨像素數量呈線性增長。LinGen首次實現了在單個GPU上高分辨率幾分鐘長視頻的生成，而無需犧牲質量。它將計算佔主導地位且具有二次複雜度的自注意力塊替換為一個線性複雜度的塊，稱為MATE，其中包括MA分支和TE分支。MA分支針對短至長距離相關性，將雙向Mamba2塊與我們的令牌重新排列方法Rotary Major Scan以及我們為長視頻生成開發的檢視令牌相結合。TE分支是一個新穎的時間Swin注意力塊，專注於相鄰令牌和中程令牌之間的時間相關性。MATE塊解決了Mamba的鄰近保留問題，顯著提高了生成視頻的一致性。實驗結果表明，LinGen在視頻質量上優於DiT（勝率達75.6%），並實現了高達15倍（11.5倍）的FLOPs（延遲）減少。此外，自動指標和人類評估均表明，我們的LinGen-4B在視頻質量上與最先進的模型（相對於Gen-3、LumaLabs和Kling，分別達到50.5%、52.1%和49.1%的勝率）相當。這為生成長達一小時的電影和實時交互式視頻生成鋪平了道路。我們在項目網站上提供了68秒的視頻生成結果和更多示例：https://lineargen.github.io/。

English

Text-to-video generation enhances content creation but is highly computationally intensive: The computational cost of Diffusion Transformers (DiTs) scales quadratically in the number of pixels. This makes minute-length video generation extremely expensive, limiting most existing models to generating videos of only 10-20 seconds length. We propose a Linear-complexity text-to-video Generation (LinGen) framework whose cost scales linearly in the number of pixels. For the first time, LinGen enables high-resolution minute-length video generation on a single GPU without compromising quality. It replaces the computationally-dominant and quadratic-complexity block, self-attention, with a linear-complexity block called MATE, which consists of an MA-branch and a TE-branch. The MA-branch targets short-to-long-range correlations, combining a bidirectional Mamba2 block with our token rearrangement method, Rotary Major Scan, and our review tokens developed for long video generation. The TE-branch is a novel TEmporal Swin Attention block that focuses on temporal correlations between adjacent tokens and medium-range tokens. The MATE block addresses the adjacency preservation issue of Mamba and improves the consistency of generated videos significantly. Experimental results show that LinGen outperforms DiT (with a 75.6% win rate) in video quality with up to 15times (11.5times) FLOPs (latency) reduction. Furthermore, both automatic metrics and human evaluation demonstrate our LinGen-4B yields comparable video quality to state-of-the-art models (with a 50.5%, 52.1%, 49.1% win rate with respect to Gen-3, LumaLabs, and Kling, respectively). This paves the way to hour-length movie generation and real-time interactive video generation. We provide 68s video generation results and more examples in our project website: https://lineargen.github.io/.

LinGen：朝向高解析度、分鐘級長度的文本到視頻生成，具有線性計算複雜度。

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

摘要

Support