通过基于坐标的补丁重建实现长视频的高效标记化

摘要

在训练能够处理长视频的视觉模型时，视频的高效分词仍然是一个挑战。一个有前途的方向是开发一个能够对长视频剪辑进行编码的分词器，因为这将使分词器更好地利用视频的时间一致性进行分词。然而，训练现有的分词器处理长视频通常会产生巨大的训练成本，因为它们被训练以一次性重建所有帧。在本文中，我们介绍了CoordTok，一种视频分词器，它学习从基于坐标的表示到输入视频对应补丁的映射，灵感来自于最近3D生成模型的进展。具体而言，CoordTok将视频编码为分解的三平面表示，并重建对应于随机采样的（x，y，t）坐标的补丁。这使得可以直接在长视频上训练大型分词器模型，而无需过多的训练资源。我们的实验表明，CoordTok可以显著减少用于编码长视频剪辑的标记数量。例如，CoordTok可以将一个包含128帧、分辨率为128x128的视频编码为1280个标记，而基准需要6144或8192个标记才能达到类似的重建质量。我们进一步表明，这种高效的视频分词使得可以高效地训练扩散变换器，该变换器可以一次生成128帧。

English

Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled (x,y,t) coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128times128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.

通过基于坐标的补丁重建实现长视频的高效标记化

Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction

摘要

Summary

Support

Support