ChatPaper.aiChatPaper

视频生成中下一帧预测模型的输入帧上下文打包

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

April 17, 2025
作者: Lvmin Zhang, Maneesh Agrawala
cs.AI

摘要

我们提出了一种名为FramePack的神经网络结构,用于训练视频生成中的下一帧(或下一帧片段)预测模型。FramePack通过压缩输入帧,使得无论视频长度如何,Transformer的上下文长度都保持固定。因此,我们能够利用视频扩散处理大量帧,同时计算瓶颈与图像扩散相似。这也使得训练视频的批量大小显著提高(批量大小变得与图像扩散训练相当)。我们还提出了一种防漂移采样方法,该方法以倒序时间顺序生成帧,并预先设定端点,以避免暴露偏差(迭代过程中的误差累积)。最后,我们展示了现有视频扩散模型可以通过FramePack进行微调,并且由于下一帧预测支持更平衡的扩散调度器,减少了极端流移时间步长,其视觉质量可能得到提升。
English
We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. The FramePack compresses input frames to make the transformer context length a fixed number regardless of the video length. As a result, we are able to process a large number of frames using video diffusion with computation bottleneck similar to image diffusion. This also makes the training video batch sizes significantly higher (batch sizes become comparable to image diffusion training). We also propose an anti-drifting sampling method that generates frames in inverted temporal order with early-established endpoints to avoid exposure bias (error accumulation over iterations). Finally, we show that existing video diffusion models can be finetuned with FramePack, and their visual quality may be improved because the next-frame prediction supports more balanced diffusion schedulers with less extreme flow shift timesteps.

Summary

AI-Generated Summary

PDF452April 18, 2025