使用跨模态视频VAE进行大运动视频自编码
Large Motion Video Autoencoding with Cross-modal Video VAE
December 23, 2024
作者: Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen
cs.AI
摘要
学习一个强大的视频变分自动编码器(VAE)对于减少视频冗余并促进高效视频生成至关重要。直接将图像VAE应用于单独的帧可能导致时间不一致和次优的压缩率,因为缺乏时间压缩。现有的视频VAE已经开始解决时间压缩问题;然而,它们经常受到重建性能不足的困扰。在本文中,我们提出了一种新颖而强大的视频自动编码器,能够实现高保真视频编码。首先,我们观察到仅仅将图像VAE扩展为3D VAE来纠缠空间和时间压缩可能会引入运动模糊和细节失真伪影。因此,我们提出了具有时间感知的空间压缩,以更好地对空间信息进行编码和解码。此外,我们集成了一个轻量级的运动压缩模型,用于进一步进行时间压缩。其次,我们建议利用文本到视频数据集中固有的文本信息,并将文本引导纳入我们的模型。这显著提高了重建质量,特别是在细节保留和时间稳定性方面。第三,我们通过同时在图像和视频上进行联合训练进一步提高了我们模型的多功能性,这不仅提高了重建质量,还使模型能够执行图像和视频自动编码。针对最新强基准的广泛评估显示了我们方法的卓越性能。项目网站可在以下链接找到:https://yzxing87.github.io/vae/。
English
Learning a robust video Variational Autoencoder (VAE) is essential for
reducing video redundancy and facilitating efficient video generation. Directly
applying image VAEs to individual frames in isolation can result in temporal
inconsistencies and suboptimal compression rates due to a lack of temporal
compression. Existing Video VAEs have begun to address temporal compression;
however, they often suffer from inadequate reconstruction performance. In this
paper, we present a novel and powerful video autoencoder capable of
high-fidelity video encoding. First, we observe that entangling spatial and
temporal compression by merely extending the image VAE to a 3D VAE can
introduce motion blur and detail distortion artifacts. Thus, we propose
temporal-aware spatial compression to better encode and decode the spatial
information. Additionally, we integrate a lightweight motion compression model
for further temporal compression. Second, we propose to leverage the textual
information inherent in text-to-video datasets and incorporate text guidance
into our model. This significantly enhances reconstruction quality,
particularly in terms of detail preservation and temporal stability. Third, we
further improve the versatility of our model through joint training on both
images and videos, which not only enhances reconstruction quality but also
enables the model to perform both image and video autoencoding. Extensive
evaluations against strong recent baselines demonstrate the superior
performance of our method. The project website can be found
at~https://yzxing87.github.io/vae/{https://yzxing87.github.io/vae/}.Summary
AI-Generated Summary