Lumina-Video：使用多尺度 Next-DiT 实现高效灵活的视频生成

摘要

最近的进展已经确立了扩散变压器（DiTs）作为生成建模中的主导框架。在此成功基础上，Lumina-Next通过Next-DiT在生成逼真图像方面取得了卓越表现。然而，其在视频生成方面的潜力仍然未被充分挖掘，面临着对视频数据固有的时空复杂性进行建模的重大挑战。为了解决这一问题，我们引入了Lumina-Video，这是一个利用Next-DiT的优势并为视频合成引入量身定制解决方案的框架。Lumina-Video采用了多尺度Next-DiT架构，共同学习多个patchifications以增强效率和灵活性。通过将运动评分作为显式条件，Lumina-Video还能够直接控制生成视频的动态程度。结合渐进式训练方案，逐渐提高分辨率和帧率，并采用混合自然和合成数据的多源训练方案，Lumina-Video在高训练和推断效率下实现了出色的美学质量和动作平滑度。此外，我们还提出了基于Next-DiT的视频到音频模型Lumina-V2A，为生成的视频创建同步音频。代码已发布在https://www.github.com/Alpha-VLLM/Lumina-Video。

English

Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at https://www.github.com/Alpha-VLLM/Lumina-Video.

Lumina-Video：使用多尺度 Next-DiT 实现高效灵活的视频生成

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

摘要

Summary

Support