VidTwin:具有解耦結構和動態的視頻變分自編碼器
VidTwin: Video VAE with Decoupled Structure and Dynamics
December 23, 2024
作者: Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian
cs.AI
摘要
近年來,影片自編碼器(Video AEs)的最新進展顯著提高了影片生成的質量和效率。本文提出了一種新穎且緊湊的影片自編碼器 VidTwin,將影片解耦為兩個不同的潛在空間:結構潛在向量,捕捉整體內容和全局運動,以及動態潛在向量,代表細節和快速運動。具體而言,我們的方法利用了一個編碼器-解碼器骨幹,並增加了兩個子模塊來分別提取這些潛在空間。第一個子模塊使用 Q-Former 來提取低頻運動趨勢,然後通過下採樣塊來去除冗餘內容細節。第二個子模塊將潛在向量沿空間維度進行平均以捕捉快速運動。大量實驗表明,VidTwin實現了高達0.20%的高壓縮率,並具有高重建質量(在MCL-JCV數據集上的PSNR為28.14),在下游生成任務中表現高效且有效。此外,我們的模型展示了可解釋性和可擴展性,為未來在影片潛在表示和生成方面的研究鋪平了道路。我們的代碼已在 https://github.com/microsoft/VidTok/tree/main/vidtwin 釋出。
English
Recent advancements in video autoencoders (Video AEs) have significantly
improved the quality and efficiency of video generation. In this paper, we
propose a novel and compact video autoencoder, VidTwin, that decouples video
into two distinct latent spaces: Structure latent vectors, which capture
overall content and global movement, and Dynamics latent vectors, which
represent fine-grained details and rapid movements. Specifically, our approach
leverages an Encoder-Decoder backbone, augmented with two submodules for
extracting these latent spaces, respectively. The first submodule employs a
Q-Former to extract low-frequency motion trends, followed by downsampling
blocks to remove redundant content details. The second averages the latent
vectors along the spatial dimension to capture rapid motion. Extensive
experiments show that VidTwin achieves a high compression rate of 0.20% with
high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and
performs efficiently and effectively in downstream generative tasks. Moreover,
our model demonstrates explainability and scalability, paving the way for
future research in video latent representation and generation. Our code has
been released at https://github.com/microsoft/VidTok/tree/main/vidtwin.Summary
AI-Generated Summary