VidTok: 多目的でオープンソースのビデオトークナイザー

要旨

ビデオコンテンツをコンパクトな潜在トークンにエンコードすることは、ビデオ生成と理解において基本的な段階となり、ピクセルレベルの表現に内在する冗長性に対処する必要から推進されています。その結果、ビデオ中心の研究が注目される中で、高性能でオープンソースのビデオトークナイザーへの需要が増大しています。私たちは、連続的および離散的なトークン化の両方で最先端のパフォーマンスを提供する汎用性の高いビデオトークナイザーであるVidTokを紹介します。VidTokは、既存の手法に対するいくつかの主要な進歩を組み込んでいます：1）畳み込み層やアップ/ダウンサンプリングモジュールなどのモデルアーキテクチャ；2）従来のベクトル量子化（VQ）に一般的に関連付けられるトレーニングの不安定性やコードブックの崩壊に対処するために、離散的ビデオトークナイゼーションに有限スカラー量子化（FSQ）を統合；3）2段階のトレーニングプロセスやフレームレートの削減の使用を含む改良されたトレーニング戦略。これらの進歩を統合することで、VidTokは既存の手法に比べて実質的な改善を達成し、標準化された評価設定下でPSNR、SSIM、LPIPS、およびFVDを含む複数のメトリックで優れたパフォーマンスを示しています。

English

Encoding video content into compact latent tokens has become a fundamental step in video generation and understanding, driven by the need to address the inherent redundancy in pixel-level representations. Consequently, there is a growing demand for high-performance, open-source video tokenizers as video-centric research gains prominence. We introduce VidTok, a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches: 1) model architecture such as convolutional layers and up/downsampling modules; 2) to address the training instability and codebook collapse commonly associated with conventional Vector Quantization (VQ), we integrate Finite Scalar Quantization (FSQ) into discrete video tokenization; 3) improved training strategies, including a two-stage training process and the use of reduced frame rates. By integrating these advancements, VidTok achieves substantial improvements over existing methods, demonstrating superior performance across multiple metrics, including PSNR, SSIM, LPIPS, and FVD, under standardized evaluation settings.

VidTok: 多目的でオープンソースのビデオトークナイザー

VidTok: A Versatile and Open-Source Video Tokenizer

要旨

Summary

Support

Support