无需矢量量化的自回归视频生成

Autoregressive Video Generation without Vector Quantization

December 18, 2024
作者: Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, Xinlong Wang
cs.AI

摘要

本文提出了一种新颖的方法,实现了具有高效性的自回归视频生成。我们建议将视频生成问题重新构建为非量化的时间逐帧预测和空间逐组预测的自回归建模。与先前自回归模型中的光栅扫描预测或扩散模型中固定长度令牌的联合分布建模不同,我们的方法保持了类似GPT风格模型的因果特性,以实现灵活的上下文能力,同时利用了单个帧内的双向建模以提高效率。通过提出的方法,我们训练了一种新颖的视频自回归模型,称为NOVA,无需矢量量化。我们的结果表明,即使模型容量明显较小(即0.6B参数),NOVA在数据效率、推理速度、视觉保真度和视频流畅性方面均超越了先前的自回归视频模型。NOVA还在文本到图像生成任务中优于最先进的图像扩散模型,并且具有显着较低的训练成本。此外,NOVA在扩展视频时长上具有良好的泛化能力,并且能够在一个统一模型中实现多样的零样本应用。代码和模型可在https://github.com/baaivision/NOVA 上公开获取。
English
This paper presents a novel approach that enables autoregressive video generation with high efficiency. We propose to reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. Unlike raster-scan prediction in prior autoregressive models or joint distribution modeling of fixed-length tokens in diffusion models, our approach maintains the causal property of GPT-style models for flexible in-context capabilities, while leveraging bidirectional modeling within individual frames for efficiency. With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity, i.e., 0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models in text-to-image generation tasks, with a significantly lower training cost. Additionally, NOVA generalizes well across extended video durations and enables diverse zero-shot applications in one unified model. Code and models are publicly available at https://github.com/baaivision/NOVA.

Summary

AI-Generated Summary

PDF142December 19, 2024