VidTwin:具有解耦结构和动态的视频变分自动编码器
VidTwin: Video VAE with Decoupled Structure and Dynamics
December 23, 2024
作者: Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian
cs.AI
摘要
最近视频自编码器(Video AEs)的发展显著提高了视频生成的质量和效率。在本文中,我们提出了一种新颖且紧凑的视频自编码器,VidTwin,将视频解耦为两个不同的潜在空间:结构潜在向量,捕捉整体内容和全局运动,以及动态潜在向量,代表细节和快速运动。具体而言,我们的方法利用了一个编码器-解码器骨干,增加了两个子模块来分别提取这些潜在空间。第一个子模块采用Q-Former来提取低频运动趋势,然后通过下采样块去除冗余内容细节。第二个子模块沿空间维度对潜在向量进行平均以捕捉快速运动。大量实验证明,VidTwin实现了高压缩率(0.20%)和高重建质量(在MCL-JCV数据集上的PSNR为28.14),在下游生成任务中表现高效且有效。此外,我们的模型具有可解释性和可扩展性,为未来视频潜在表示和生成研究铺平了道路。我们的代码已发布在https://github.com/microsoft/VidTok/tree/main/vidtwin。
English
Recent advancements in video autoencoders (Video AEs) have significantly
improved the quality and efficiency of video generation. In this paper, we
propose a novel and compact video autoencoder, VidTwin, that decouples video
into two distinct latent spaces: Structure latent vectors, which capture
overall content and global movement, and Dynamics latent vectors, which
represent fine-grained details and rapid movements. Specifically, our approach
leverages an Encoder-Decoder backbone, augmented with two submodules for
extracting these latent spaces, respectively. The first submodule employs a
Q-Former to extract low-frequency motion trends, followed by downsampling
blocks to remove redundant content details. The second averages the latent
vectors along the spatial dimension to capture rapid motion. Extensive
experiments show that VidTwin achieves a high compression rate of 0.20% with
high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and
performs efficiently and effectively in downstream generative tasks. Moreover,
our model demonstrates explainability and scalability, paving the way for
future research in video latent representation and generation. Our code has
been released at https://github.com/microsoft/VidTok/tree/main/vidtwin.Summary
AI-Generated Summary