透過基於協調的補丁重建實現高效的長視頻標記化

摘要

在訓練能夠處理長視頻的視覺模型時，有效的視頻標記仍然是一個挑戰。一個有前途的方向是開發一種能夠編碼長視頻片段的標記器，因為這將使標記器更好地利用視頻的時間相干性進行標記。然而，將現有的標記器訓練在長視頻上往往會產生巨大的訓練成本，因為它們被訓練以一次重建所有幀。在本文中，我們介紹了CoordTok，一種視頻標記器，它學習從基於坐標的表示到輸入視頻相應塊的映射，受到3D生成模型最新進展的啟發。具體而言，CoordTok將視頻編碼為分解的三面體表示，並重建對應於隨機抽樣的（x，y，t）坐標的塊。這使得可以直接在長視頻上訓練大型標記器模型，而無需過多的訓練資源。我們的實驗表明，CoordTok可以顯著減少編碼長視頻片段所需的標記數量。例如，CoordTok可以將一個包含128幀、解析度為128x128的視頻編碼為1280個標記，而基準方法需要6144或8192個標記才能達到類似的重建質量。我們進一步展示，這種高效的視頻標記化使得可以高效地訓練一個擴散變壓器，可以一次生成128幀。

English

Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled (x,y,t) coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128times128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.

透過基於協調的補丁重建實現高效的長視頻標記化

Efficient Long Video Tokenization via Coordinated-based Patch Reconstruction

摘要

Summary

Support