使用跨模態影片VAE進行大運動影片自編碼

Large Motion Video Autoencoding with Cross-modal Video VAE

December 23, 2024
作者: Yazhou Xing, Yang Fei, Yingqing He, Jingye Chen, Jiaxin Xie, Xiaowei Chi, Qifeng Chen
cs.AI

摘要

學習建立一個強大的影片變分自編碼器(VAE)對於減少影片冗餘並促進高效影片生成至關重要。直接將影像VAE應用於個別幀可能導致時間不一致和次優的壓縮率,這是由於缺乏時間壓縮。現有的影片VAE已經開始解決時間壓縮的問題;然而,它們通常受到重建性能不足的困擾。在本文中,我們提出了一種新穎且強大的影片自編碼器,能夠進行高保真度的影片編碼。首先,我們觀察到通過將影像VAE擴展為3D VAE來交織空間和時間壓縮可能會引入運動模糊和細節失真。因此,我們提出了具有時間感知的空間壓縮,以更好地編碼和解碼空間信息。此外,我們還整合了一個輕量級運動壓縮模型,以進一步進行時間壓縮。其次,我們建議利用文本-影片數據集中固有的文本信息,並將文本引導納入我們的模型中。這顯著提高了重建質量,特別是在保留細節和時間穩定性方面。第三,我們通過對圖像和影片進行聯合訓練進一步提高了我們模型的多功能性,這不僅增強了重建質量,還使模型能夠執行圖像和影片自編碼。通過與最近的強基線進行廣泛評估,證明了我們方法的卓越性能。項目網站可在以下網址找到:https://yzxing87.github.io/vae/。
English
Learning a robust video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation. Directly applying image VAEs to individual frames in isolation can result in temporal inconsistencies and suboptimal compression rates due to a lack of temporal compression. Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance. In this paper, we present a novel and powerful video autoencoder capable of high-fidelity video encoding. First, we observe that entangling spatial and temporal compression by merely extending the image VAE to a 3D VAE can introduce motion blur and detail distortion artifacts. Thus, we propose temporal-aware spatial compression to better encode and decode the spatial information. Additionally, we integrate a lightweight motion compression model for further temporal compression. Second, we propose to leverage the textual information inherent in text-to-video datasets and incorporate text guidance into our model. This significantly enhances reconstruction quality, particularly in terms of detail preservation and temporal stability. Third, we further improve the versatility of our model through joint training on both images and videos, which not only enhances reconstruction quality but also enables the model to perform both image and video autoencoding. Extensive evaluations against strong recent baselines demonstrate the superior performance of our method. The project website can be found at~https://yzxing87.github.io/vae/{https://yzxing87.github.io/vae/}.

Summary

AI-Generated Summary

PDF243December 24, 2024