ChatPaper.aiChatPaper

下一个区块预测:通过半自回归建模生成视频

Next Block Prediction: Video Generation via Semi-Autoregressive Modeling

February 11, 2025
作者: Shuhuai Ren, Shuming Ma, Xu Sun, Furu Wei
cs.AI

摘要

下一个标记预测(NTP)是自回归(AR)视频生成的事实上方法,但存在不佳的单向依赖性和缓慢的推理速度。在这项工作中,我们提出了一种半自回归(semi-AR)框架,称为下一个块预测(NBP),用于视频生成。通过将视频内容均匀分解为相等大小的块(例如,行或帧),我们将生成单元从单个标记转移到块,使当前块中的每个标记能够同时预测下一个块中对应的标记。与传统的AR建模不同,我们的框架在每个块内使用双向注意力,使标记能够捕获更强大的空间依赖关系。通过并行预测多个标记,NBP模型显著减少了生成步骤的数量,从而实现更快速和更高效的推理。我们的模型在UCF101上实现了103.3的FVD分数,在K600上实现了25.5的FVD分数,比普通的NTP模型平均提高了4.4。此外,由于推理步骤的减少,NBP模型每秒生成8.89帧(128x128分辨率),实现了11倍的加速。我们还探索了从700M到3B参数的模型规模,观察到生成质量显著提高,UCF101的FVD分数从103.3降至55.3,K600的FVD分数从25.5降至19.5,展示了我们方法的可扩展性。
English
Next-Token Prediction (NTP) is a de facto approach for autoregressive (AR) video generation, but it suffers from suboptimal unidirectional dependencies and slow inference speed. In this work, we propose a semi-autoregressive (semi-AR) framework, called Next-Block Prediction (NBP), for video generation. By uniformly decomposing video content into equal-sized blocks (e.g., rows or frames), we shift the generation unit from individual tokens to blocks, allowing each token in the current block to simultaneously predict the corresponding token in the next block. Unlike traditional AR modeling, our framework employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies. By predicting multiple tokens in parallel, NBP models significantly reduce the number of generation steps, leading to faster and more efficient inference. Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4. Furthermore, thanks to the reduced number of inference steps, the NBP model generates 8.89 frames (128x128 resolution) per second, achieving an 11x speedup. We also explored model scales ranging from 700M to 3B parameters, observing significant improvements in generation quality, with FVD scores dropping from 103.3 to 55.3 on UCF101 and from 25.5 to 19.5 on K600, demonstrating the scalability of our approach.

Summary

AI-Generated Summary

PDF92February 13, 2025