MarDini: 대규모 비디오 생성을 위한 마스크된 자기 회귀 확산

초록

우리는 MarDini를 소개합니다. 이는 마스크 자기 회귀(Masked Auto-Regression, MAR)의 장점을 통합된 확산 모델(Diffusion Model, DM) 프레임워크로 통합한 새로운 비디오 확산 모델 패밀리입니다. 여기서 MAR은 시간적 계획을 다루고, DM은 공간 생성에 초점을 맞춥니다. 비대칭 네트워크 디자인에서 다음과 같이 작동합니다: i) 대부분의 매개변수를 포함하는 MAR 기반 계획 모델이 낮은 해상도 입력을 사용하여 각 마스크된 프레임에 대한 계획 신호를 생성합니다; ii) 가벼운 생성 모델은 이러한 신호를 사용하여 확산 소음 제거를 통해 고해상도 프레임을 생성합니다. MarDini의 MAR은 어떤 위치의 어떤 수의 마스크된 프레임에 조건을 걸고 비디오 생성을 가능하게 합니다: 단일 모델은 비디오 보간(예: 중간 프레임 마스킹), 이미지에서 비디오 생성(예: 두 번째 프레임부터 마스킹) 및 비디오 확장(예: 프레임의 절반 마스킹)을 처리할 수 있습니다. 효율적인 디자인은 대부분의 계산 자원을 낮은 해상도 계획 모델에 할당하여 계산적으로 비용이 많이 들지만 중요한 시공간 주의를 규모에 맞게 가능하게 합니다. MarDini는 비디오 보간에 대한 새로운 최첨단을 설정하며, 한편, 몇 단계의 추론 내에서 효율적으로 더 비싼 고급 이미지에서 비디오 모델과 유사한 비디오를 생성합니다.

English

We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini's MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.

MarDini: 대규모 비디오 생성을 위한 마스크된 자기 회귀 확산

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

초록

Support