VidTok：一個多功能且開源的影片分詞器

摘要

將影片內容編碼成緊湊的潛在標記已成為影片生成和理解中的基本步驟，這是為了應對像素級表示中固有的冗餘而推動的。因此，隨著以影片為中心的研究日益受到重視，對高性能、開源的影片標記器的需求不斷增加。我們介紹了 VidTok，一款多功能的影片標記器，在連續和離散標記化方面均提供了最先進的性能。VidTok 在幾個關鍵方面相較於現有方法有所進步：1）模型架構，例如卷積層和上/下採樣模塊；2）為了應對常見於傳統向量量化（VQ）的訓練不穩定性和碼本崩潰問題，我們將有限純量量化（FSQ）整合到離散影片標記化中；3）改進的訓練策略，包括兩階段訓練過程和使用降低的幀率。通過整合這些進步，VidTok 在現有方法上實現了顯著的改進，在標準化評估設置下展現出優越的性能，包括 PSNR、SSIM、LPIPS 和 FVD 等多個指標。

English

Encoding video content into compact latent tokens has become a fundamental step in video generation and understanding, driven by the need to address the inherent redundancy in pixel-level representations. Consequently, there is a growing demand for high-performance, open-source video tokenizers as video-centric research gains prominence. We introduce VidTok, a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches: 1) model architecture such as convolutional layers and up/downsampling modules; 2) to address the training instability and codebook collapse commonly associated with conventional Vector Quantization (VQ), we integrate Finite Scalar Quantization (FSQ) into discrete video tokenization; 3) improved training strategies, including a two-stage training process and the use of reduced frame rates. By integrating these advancements, VidTok achieves substantial improvements over existing methods, demonstrating superior performance across multiple metrics, including PSNR, SSIM, LPIPS, and FVD, under standardized evaluation settings.

VidTok：一個多功能且開源的影片分詞器

VidTok: A Versatile and Open-Source Video Tokenizer

摘要

Support