MagicDriveDiT：自适应控制的自动驾驶高分辨率长视频生成

摘要

扩散模型的快速发展极大地改善了视频合成，特别是在可控视频生成方面，这对于自动驾驶等应用至关重要。然而，现有方法受可扩展性和控制条件整合方式的限制，无法满足自动驾驶应用对高分辨率和长视频的需求。本文介绍了一种基于DiT架构的创新方法MagicDriveDiT，并解决了这些挑战。我们的方法通过流匹配增强了可扩展性，并采用渐进式训练策略来处理复杂场景。通过融合时空条件编码，MagicDriveDiT 实现了对时空潜变量的精确控制。全面的实验表明，它在生成更高分辨率和更多帧的逼真街景视频方面表现出优越性能。MagicDriveDiT 显著提高了视频生成质量和时空控制，扩展了其在自动驾驶各种任务中的潜在应用。

English

The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is essential for applications like autonomous driving. However, existing methods are limited by scalability and how control conditions are integrated, failing to meet the needs for high-resolution and long videos for autonomous driving applications. In this paper, we introduce MagicDriveDiT, a novel approach based on the DiT architecture, and tackle these challenges. Our method enhances scalability through flow matching and employs a progressive training strategy to manage complex scenarios. By incorporating spatial-temporal conditional encoding, MagicDriveDiT achieves precise control over spatial-temporal latents. Comprehensive experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames. MagicDriveDiT significantly improves video generation quality and spatial-temporal controls, expanding its potential applications across various tasks in autonomous driving.

MagicDriveDiT：自适应控制的自动驾驶高分辨率长视频生成

MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

摘要

Summary

Support