使用跨模态视频VAE进行大运动视频自编码

摘要

学习一个强大的视频变分自动编码器（VAE）对于减少视频冗余并促进高效视频生成至关重要。直接将图像VAE应用于单独的帧可能导致时间不一致和次优的压缩率，因为缺乏时间压缩。现有的视频VAE已经开始解决时间压缩问题；然而，它们经常受到重建性能不足的困扰。在本文中，我们提出了一种新颖而强大的视频自动编码器，能够实现高保真视频编码。首先，我们观察到仅仅将图像VAE扩展为3D VAE来纠缠空间和时间压缩可能会引入运动模糊和细节失真伪影。因此，我们提出了具有时间感知的空间压缩，以更好地对空间信息进行编码和解码。此外，我们集成了一个轻量级的运动压缩模型，用于进一步进行时间压缩。其次，我们建议利用文本到视频数据集中固有的文本信息，并将文本引导纳入我们的模型。这显著提高了重建质量，特别是在细节保留和时间稳定性方面。第三，我们通过同时在图像和视频上进行联合训练进一步提高了我们模型的多功能性，这不仅提高了重建质量，还使模型能够执行图像和视频自动编码。针对最新强基准的广泛评估显示了我们方法的卓越性能。项目网站可在以下链接找到：https://yzxing87.github.io/vae/。

English

Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~https://yzxing87.github.io/vae/{https://yzxing87.github.io/vae/}.

使用跨模态视频VAE进行大运动视频自编码

Large Motion Video Autoencoding with Cross-modal Video VAE

摘要

Summary

Support