벡터 양자화 없이 자기회귀적 비디오 생성

초록

본 논문은 고효율로 자기회귀 비디오 생성을 가능하게 하는 새로운 접근 방식을 제시합니다. 우리는 비디오 생성 문제를 양자화되지 않은 시간 프레임별 예측과 공간 집합별 예측의 자기회귀 모델링으로 재정의하는 것을 제안합니다. 이전 자기회귀 모델의 래스터 스캔 예측이나 확산 모델의 고정 길이 토큰의 합동 분포 모델링과는 달리, 우리의 접근 방식은 유연한 문맥 기능을 위해 GPT 스타일 모델의 인과 속성을 유지하면서 효율성을 높이기 위해 개별 프레임 내에서 양방향 모델링을 활용합니다. 제안된 방법을 사용하여 벡터 양자화 없이 새로운 비디오 자기회귀 모델 NOVA를 훈련시킵니다. 결과는 NOVA가 데이터 효율성, 추론 속도, 시각적 충실도 및 비디오 유창성에서 이전 자기회귀 비디오 모델을 능가함을 보여줍니다. 더불어 훨씬 작은 모델 용량인 0.6B 매개변수로도 NOVA가 최신 이미지 확산 모델을 텍스트에서 이미지 생성 작업에서 능가하며 훈련 비용이 현저히 낮습니다. 게다가 NOVA는 확장된 비디오 기간에 걸쳐 일반화가 잘 되며 하나의 통합된 모델에서 다양한 제로샷 응용을 가능하게 합니다. 코드와 모델은 https://github.com/baaivision/NOVA에서 공개적으로 제공됩니다.

English

This paper presents a novel approach that enables autoregressive video generation with high efficiency. We propose to reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction and spatial set-by-set prediction. Unlike raster-scan prediction in prior autoregressive models or joint distribution modeling of fixed-length tokens in diffusion models, our approach maintains the causal property of GPT-style models for flexible in-context capabilities, while leveraging bidirectional modeling within individual frames for efficiency. With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity, i.e., 0.6B parameters. NOVA also outperforms state-of-the-art image diffusion models in text-to-image generation tasks, with a significantly lower training cost. Additionally, NOVA generalizes well across extended video durations and enables diverse zero-shot applications in one unified model. Code and models are publicly available at https://github.com/baaivision/NOVA.

벡터 양자화 없이 자기회귀적 비디오 생성

Autoregressive Video Generation without Vector Quantization

초록

Summary

Support

Support