LARP：利用学习到的自回归生成先验对视频进行标记化。

摘要

我们提出了LARP，这是一种新颖的视频分词器，旨在克服当前自回归（AR）生成模型中视频分词方法的局限性。与直接将局部视觉补丁编码为离散标记的传统分块式分词器不同，LARP引入了一种整体分词方案，通过一组学习到的整体查询从视觉内容中收集信息。这种设计使LARP能够捕获更全局和语义表示，而不仅限于局部补丁级别的信息。此外，它通过支持任意数量的离散标记，实现了根据任务特定要求进行自适应和高效的分词。为了将离散标记空间与下游AR生成任务对齐，LARP集成了一个轻量级AR变换器作为训练时的先验模型，该模型在其离散潜在空间上预测下一个标记。通过在训练过程中整合先验模型，LARP学习到一个不仅针对视频重建进行了优化，而且在结构上更有利于自回归生成的潜在空间。此外，这个过程为离散标记定义了一个顺序，逐渐将它们推向在训练期间的最佳配置，确保推理时更流畅和更准确的AR生成。全面的实验表明LARP表现出色，在UCF101类别条件视频生成基准测试中实现了最先进的FVD。LARP增强了AR模型与视频的兼容性，并为构建统一的高保真度多模态大型语言模型（MLLMs）打开了潜力。

English

We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP's strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).

LARP：利用学习到的自回归生成先验对视频进行标记化。

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

摘要

Summary

Support

Support