비드톡: 다재다능하고 오픈 소스 비디오 토크나이저

초록

비디오 콘텐츠를 조밀한 잠재 토큰으로 인코딩하는 것은 비디오 생성 및 이해에서 기본적인 단계가 되었으며, 픽셀 수준 표현의 내재적인 중복성을 해결해야 하는 필요성에 의해 주도되고 있습니다. 결과적으로, 비디오 중심 연구가 주목받으면서 고성능 오픈 소스 비디오 토크나이저에 대한 성장하는 수요가 있습니다. 우리는 VidTok을 소개합니다. 이는 연속 및 이산 토큰화 모두에서 최첨단 성능을 제공하는 다재다능한 비디오 토크나이저입니다. VidTok은 기존 방법들에 비해 여러 가지 주요한 진보를 통합하고 있습니다: 1) 합성곱 레이어 및 업/다운샘플링 모듈과 같은 모델 아키텍처; 2) 일반적으로 Vector Quantization (VQ)과 관련된 훈련 불안정성 및 코드북 붕괴를 해결하기 위해 이산 비디오 토큰화에 Finite Scalar Quantization (FSQ)을 통합하고 있습니다; 3) 두 단계 훈련 프로세스 및 감소된 프레임 속도 사용과 같은 향상된 훈련 전략. 이러한 진보를 통합함으로써 VidTok은 기존 방법들보다 상당한 개선을 이루어내며, PSNR, SSIM, LPIPS 및 FVD를 포함한 여러 메트릭에서 우수한 성능을 나타내며, 표준화된 평가 설정에서 우수한 성과를 보여줍니다.

English

Encoding video content into compact latent tokens has become a fundamental step in video generation and understanding, driven by the need to address the inherent redundancy in pixel-level representations. Consequently, there is a growing demand for high-performance, open-source video tokenizers as video-centric research gains prominence. We introduce VidTok, a versatile video tokenizer that delivers state-of-the-art performance in both continuous and discrete tokenizations. VidTok incorporates several key advancements over existing approaches: 1) model architecture such as convolutional layers and up/downsampling modules; 2) to address the training instability and codebook collapse commonly associated with conventional Vector Quantization (VQ), we integrate Finite Scalar Quantization (FSQ) into discrete video tokenization; 3) improved training strategies, including a two-stage training process and the use of reduced frame rates. By integrating these advancements, VidTok achieves substantial improvements over existing methods, demonstrating superior performance across multiple metrics, including PSNR, SSIM, LPIPS, and FVD, under standardized evaluation settings.

비드톡: 다재다능하고 오픈 소스 비디오 토크나이저

VidTok: A Versatile and Open-Source Video Tokenizer

초록

Support