MagicDriveDiT: 적응 제어를 통한 자율 주행을 위한 고해상도 장기 비디오 생성

초록

확산 모델의 신속한 발전은 비디오 합성을 큤게 향상시켰으며, 특히 자율 주행과 같은 응용 프로그램에 필수적인 제어 가능한 비디오 생성에 있어서 그 효과를 발휘하고 있습니다. 그러나 기존 방법은 확장성과 제어 조건의 통합 방식에 제약을 받아 자율 주행 응용 프로그램에 대한 고해상도 및 장시간 비디오의 요구 사항을 충족시키지 못하고 있습니다. 본 논문에서는 DiT 아키텍처를 기반으로 한 혁신적인 접근 방식인 MagicDriveDiT를 소개하고 이러한 도전 과제에 대처합니다. 우리의 방법은 흐름 일치를 통해 확장성을 향상시키고 복잡한 시나리오를 관리하기 위해 점진적 훈련 전략을 채택합니다. 공간-시간 조건부 인코딩을 통합함으로써 MagicDriveDiT는 공간-시간 잠재 변수에 대한 정밀한 제어를 달성합니다. 포괄적인 실험에서 MagicDriveDiT는 고해상도 및 더 많은 프레임으로 현실적인 거리 장면 비디오를 생성하는 데 우수한 성능을 보여주었습니다. MagicDriveDiT는 비디오 생성 품질과 공간-시간 제어를 크게 향상시키며, 자율 주행의 다양한 작업 영역에 걸쳐 잠재적인 응용 가능성을 확대시킵니다.

English

The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is essential for applications like autonomous driving. However, existing methods are limited by scalability and how control conditions are integrated, failing to meet the needs for high-resolution and long videos for autonomous driving applications. In this paper, we introduce MagicDriveDiT, a novel approach based on the DiT architecture, and tackle these challenges. Our method enhances scalability through flow matching and employs a progressive training strategy to manage complex scenarios. By incorporating spatial-temporal conditional encoding, MagicDriveDiT achieves precise control over spatial-temporal latents. Comprehensive experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames. MagicDriveDiT significantly improves video generation quality and spatial-temporal controls, expanding its potential applications across various tasks in autonomous driving.

MagicDriveDiT: 적응 제어를 통한 자율 주행을 위한 고해상도 장기 비디오 생성

MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

초록

Summary

Support