LTX-Video:即時影像潛在擴散

LTX-Video: Realtime Video Latent Diffusion

December 30, 2024
作者: Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, Poriya Panet, Sapir Weissbuch, Victor Kulikov, Yaki Bitterman, Zeev Melumian, Ofir Bibi
cs.AI

摘要

我們介紹了 LTX-Video,一種基於 Transformer 的潛在擴散模型,通過無縫整合 Video-VAE 和去噪 Transformer 的功能,採用了一種全面的方法來生成視頻。與現有方法不同,這些方法將這些組件視為獨立的,LTX-Video 的目標是優化它們的互動,以提高效率和質量。其核心是一個精心設計的 Video-VAE,實現了高壓縮比為1:192,每個 token 的空時降解為32 x 32 x 8 像素,通過將分塊操作從 Transformer 的輸入轉移到 VAE 的輸入實現。在這個高度壓縮的潛在空間中運行,使 Transformer 能夠有效執行完整的空時自注意力,這對於生成具有時間一致性的高分辨率視頻至關重要。然而,高壓縮固有地限制了對細節的表示。為了解決這個問題,我們的 VAE 解碼器負責潛在到像素的轉換和最終的去噪步驟,直接在像素空間中產生乾淨的結果。這種方法保留了生成細節的能力,而無需產生單獨的上採樣模塊的運行時成本。我們的模型支持多種用例,包括文本到視頻和圖像到視頻的生成,兩種功能同時訓練。它實現了快於實時生成,在 Nvidia H100 GPU 上僅需 2 秒即可在 768x512 分辨率下生成 5 秒的 24 fps 視頻,優於所有現有的類似規模的模型。源代碼和預訓練模型已公開提供,為可訪問和可擴展的視頻生成設定了新的基準。
English
We introduce LTX-Video, a transformer-based latent diffusion model that adopts a holistic approach to video generation by seamlessly integrating the responsibilities of the Video-VAE and the denoising transformer. Unlike existing methods, which treat these components as independent, LTX-Video aims to optimize their interaction for improved efficiency and quality. At its core is a carefully designed Video-VAE that achieves a high compression ratio of 1:192, with spatiotemporal downscaling of 32 x 32 x 8 pixels per token, enabled by relocating the patchifying operation from the transformer's input to the VAE's input. Operating in this highly compressed latent space enables the transformer to efficiently perform full spatiotemporal self-attention, which is essential for generating high-resolution videos with temporal consistency. However, the high compression inherently limits the representation of fine details. To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space. This approach preserves the ability to generate fine details without incurring the runtime cost of a separate upsampling module. Our model supports diverse use cases, including text-to-video and image-to-video generation, with both capabilities trained simultaneously. It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video at 768x512 resolution in just 2 seconds on an Nvidia H100 GPU, outperforming all existing models of similar scale. The source code and pre-trained models are publicly available, setting a new benchmark for accessible and scalable video generation.

Summary

AI-Generated Summary

PDF413January 3, 2025