無需向量量化的自回歸式影片生成
Autoregressive Video Generation without Vector Quantization
December 18, 2024
作者: Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, Xinlong Wang
cs.AI
摘要
本文提出了一種新穎的方法,實現了高效的自回歸視頻生成。我們建議將視頻生成問題重新定義為非量化的時間幀預測和空間集合預測的自回歸建模。與先前的自回歸模型中的光柵掃描預測或擴散模型中固定長度令牌的聯合分佈建模不同,我們的方法保持了 GPT 風格模型的因果特性,以實現靈活的上下文能力,同時利用了單幀內的雙向建模以提高效率。通過所提出的方法,我們訓練了一個新穎的視頻自回歸模型,稱為 NOVA,而無需向量量化。我們的結果表明,即使模型容量遠小於 0.6B 參數,NOVA 在數據效率、推理速度、視覺保真度和視頻流暢性方面均超越了先前的自回歸視頻模型。NOVA 在文本到圖像生成任務中也優於最先進的圖像擴散模型,並具有顯著較低的訓練成本。此外,NOVA 在延長的視頻持續時間上具有良好的泛化能力,並能夠在一個統一模型中實現多樣的零樣本應用。代碼和模型可在 https://github.com/baaivision/NOVA 上公開獲取。
English
This paper presents a novel approach that enables autoregressive video
generation with high efficiency. We propose to reformulate the video generation
problem as a non-quantized autoregressive modeling of temporal frame-by-frame
prediction and spatial set-by-set prediction. Unlike raster-scan prediction in
prior autoregressive models or joint distribution modeling of fixed-length
tokens in diffusion models, our approach maintains the causal property of
GPT-style models for flexible in-context capabilities, while leveraging
bidirectional modeling within individual frames for efficiency. With the
proposed approach, we train a novel video autoregressive model without vector
quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior
autoregressive video models in data efficiency, inference speed, visual
fidelity, and video fluency, even with a much smaller model capacity, i.e.,
0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models
in text-to-image generation tasks, with a significantly lower training cost.
Additionally, NOVA generalizes well across extended video durations and enables
diverse zero-shot applications in one unified model. Code and models are
publicly available at https://github.com/baaivision/NOVA.Summary
AI-Generated Summary