MarDini:用於大規模視頻生成的遮罩自回歸擴散
MarDini: Masked Autoregressive Diffusion for Video Generation at Scale
October 26, 2024
作者: Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C. Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, Jui-Chieh Wu, Sen He, Tao Xiang, Jürgen Schmidhuber, Juan-Manuel Pérez-Rúa
cs.AI
摘要
我們介紹了 MarDini,這是一個新的影片擴散模型系列,將遮罩自回歸(MAR)的優勢融入統一的擴散模型(DM)框架中。在這裡,MAR 負責時間規劃,而 DM 專注於在非對稱網路設計中的空間生成:i)基於 MAR 的規劃模型包含大部分參數,使用低分辨率輸入為每個遮罩幀生成規劃信號;ii)一個輕量級生成模型使用這些信號通過擴散去噪產生高分辨率幀。MarDini 的 MAR 使得可以條件生成視頻,並在任何幀位置上條件於任意數量的遮罩幀:一個模型可以處理視頻插補(例如,遮罩中間幀)、圖像到視頻生成(例如,從第二幀開始遮罩)和視頻擴展(例如,遮罩一半幀)。這種高效的設計將大部分計算資源分配給低分辨率規劃模型,使得在規模上可以實現計算昂貴但重要的時空關注。MarDini 為視頻插補設定了新的最先進水準;與此同時,在幾個推理步驟內,它可以高效地生成與更昂貴的先進圖像到視頻模型相媲美的視頻。
English
We introduce MarDini, a new family of video diffusion models that integrate
the advantages of masked auto-regression (MAR) into a unified diffusion model
(DM) framework. Here, MAR handles temporal planning, while DM focuses on
spatial generation in an asymmetric network design: i) a MAR-based planning
model containing most of the parameters generates planning signals for each
masked frame using low-resolution input; ii) a lightweight generation model
uses these signals to produce high-resolution frames via diffusion de-noising.
MarDini's MAR enables video generation conditioned on any number of masked
frames at any frame positions: a single model can handle video interpolation
(e.g., masking middle frames), image-to-video generation (e.g., masking from
the second frame onward), and video expansion (e.g., masking half the frames).
The efficient design allocates most of the computational resources to the
low-resolution planning model, making computationally expensive but important
spatio-temporal attention feasible at scale. MarDini sets a new
state-of-the-art for video interpolation; meanwhile, within few inference
steps, it efficiently generates videos on par with those of much more expensive
advanced image-to-video models.Summary
AI-Generated Summary