LTX-Video: リアルタイムビデオ潜在拡散

要旨

LTX-Videoは、ホリスティックなアプローチを採用し、Video-VAEとdenoising transformerの責務をシームレスに統合する、トランスフォーマーベースの潜在拡散モデルです。これらのコンポーネントを独立したものとして扱う既存の手法とは異なり、LTX-Videoは相互作用を最適化して効率と品質を向上させることを目指しています。その中心には、高い圧縮比率である1:192を達成するように慎重に設計されたVideo-VAEがあり、32 x 32 x 8ピクセルごとの空間的時間的ダウンスケーリングをトークンごとに可能にするため、パッチ化操作をトランスフォーマーの入力からVAEの入力に移動させました。この高度に圧縮された潜在空間での動作により、トランスフォーマーは高解像度のビデオを時間的一貫性を持って生成するために不可欠な完全な空間時間的セルフアテンションを効率的に実行できます。ただし、高い圧縮は微細な詳細の表現を制限します。この問題に対処するため、VAEデコーダーは潜在からピクセルへの変換と最終的なノイズ除去ステップの両方を担当し、ピクセル空間で直接クリーンな結果を生成します。このアプローチにより、別個のアップサンプリングモジュールのランタイムコストを負担することなく、微細な詳細を生成する能力が維持されます。当モデルは、テキストからビデオや画像からビデオの生成など、さまざまなユースケースをサポートし、両方の機能を同時にトレーニングします。Nvidia H100 GPU上で、768x512解像度の24 fpsの5秒ビデオをわずか2秒で生成し、同様のスケールの既存モデルを凌駕する、リアルタイムよりも高速な生成を実現します。ソースコードと事前トレーニング済みモデルは一般に公開されており、利用可能でスケーラブルなビデオ生成の新たなベンチマークを設定しています。

English

We introduce LTX-Video, a transformer-based latent diffusion model that adopts a holistic approach to video generation by seamlessly integrating the responsibilities of the Video-VAE and the denoising transformer. Unlike existing methods, which treat these components as independent, LTX-Video aims to optimize their interaction for improved efficiency and quality. At its core is a carefully designed Video-VAE that achieves a high compression ratio of 1:192, with spatiotemporal downscaling of 32 x 32 x 8 pixels per token, enabled by relocating the patchifying operation from the transformer's input to the VAE's input. Operating in this highly compressed latent space enables the transformer to efficiently perform full spatiotemporal self-attention, which is essential for generating high-resolution videos with temporal consistency. However, the high compression inherently limits the representation of fine details. To address this, our VAE decoder is tasked with both latent-to-pixel conversion and the final denoising step, producing the clean result directly in pixel space. This approach preserves the ability to generate fine details without incurring the runtime cost of a separate upsampling module. Our model supports diverse use cases, including text-to-video and image-to-video generation, with both capabilities trained simultaneously. It achieves faster-than-real-time generation, producing 5 seconds of 24 fps video at 768x512 resolution in just 2 seconds on an Nvidia H100 GPU, outperforming all existing models of similar scale. The source code and pre-trained models are publicly available, setting a new benchmark for accessible and scalable video generation.

LTX-Video: リアルタイムビデオ潜在拡散

LTX-Video: Realtime Video Latent Diffusion

要旨

Summary

Support