ChatPaper.aiChatPaper

Lumina-Video:使用多尺度 Next-DiT 实现高效灵活的视频生成

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

February 10, 2025
作者: Dongyang Liu, Shicheng Li, Yutong Liu, Zhen Li, Kai Wang, Xinyue Li, Qi Qin, Yufei Liu, Yi Xin, Zhongyu Li, Bin Fu, Chenyang Si, Yuewen Cao, Conghui He, Ziwei Liu, Yu Qiao, Qibin Hou, Hongsheng Li, Peng Gao
cs.AI

摘要

最近的进展已经确立了扩散变压器(DiTs)作为生成建模中的主导框架。在此成功基础上,Lumina-Next通过Next-DiT在生成逼真图像方面取得了卓越表现。然而,其在视频生成方面的潜力仍然未被充分挖掘,面临着对视频数据固有的时空复杂性进行建模的重大挑战。为了解决这一问题,我们引入了Lumina-Video,这是一个利用Next-DiT的优势并为视频合成引入量身定制解决方案的框架。Lumina-Video采用了多尺度Next-DiT架构,共同学习多个patchifications以增强效率和灵活性。通过将运动评分作为显式条件,Lumina-Video还能够直接控制生成视频的动态程度。结合渐进式训练方案,逐渐提高分辨率和帧率,并采用混合自然和合成数据的多源训练方案,Lumina-Video在高训练和推断效率下实现了出色的美学质量和动作平滑度。此外,我们还提出了基于Next-DiT的视频到音频模型Lumina-V2A,为生成的视频创建同步音频。代码已发布在https://www.github.com/Alpha-VLLM/Lumina-Video。
English
Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at https://www.github.com/Alpha-VLLM/Lumina-Video.

Summary

AI-Generated Summary

PDF132February 11, 2025