명시적 3D 모델링을 사용한 세계 일관성 비디오 확산

초록

최근 확산 모델의 발전으로 이미지 및 비디오 생성에서 새로운 기준이 설정되었으며, 단일 및 다중 프레임 컨텍스트에서 현실적인 시각적 합성이 가능해졌습니다. 그러나 이러한 모델은 여전히 효율적이고 명확하게 3D 일관된 콘텐츠를 생성하는 데 어려움을 겪고 있습니다. 이를 해결하기 위해 우리는 XYZ 이미지를 사용하여 전역 3D 좌표를 인코딩하는 혁신적인 프레임워크인 World-consistent Video Diffusion (WVD)를 제안합니다. 더 구체적으로, RGB 및 XYZ 프레임의 결합 분포를 학습하기 위해 확산 트랜스포머를 훈련시킵니다. 이 접근 방식은 유연한 인페인팅 전략을 통해 다중 작업 적응성을 지원합니다. 예를 들어, WVD는 지정된 카메라 궤적을 따라 XYZ 투영을 사용하여 원본 RGB에서 XYZ 프레임을 추정하거나 새로운 RGB 프레임을 생성할 수 있습니다. 이를 통해 WVD는 단일 이미지에서 3D 생성, 다중 뷰 스테레오 및 카메라 제어 비디오 생성과 같은 작업을 통합합니다. 우리의 접근 방식은 여러 벤치마크에서 경쟁력 있는 성능을 보여주며, 단일 사전 훈련된 모델을 사용하여 3D 일관된 비디오 및 이미지 생성에 확장 가능한 솔루션을 제공합니다.

English

Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.

명시적 3D 모델링을 사용한 세계 일관성 비디오 확산

World-consistent Video Diffusion with Explicit 3D Modeling

초록

Summary

Support