ACDiT:插值自回歸條件建模與擴散Transformer
ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer
December 10, 2024
作者: Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun
cs.AI
摘要
近來對綜合多模型的興趣激增,迫使各種模態得以統一。然而,這種統一受到不同方法論的影響。持續的視覺生成需要全序列擴散式方法,儘管這與文本領域中的自回歸建模有所不同。我們認為自回歸建模,即基於過去確定性經驗來預測未來,對於發展視覺生成模型和潛在的統一多模型至關重要。本文探討了自回歸建模和全參數擴散之間的插值,以建模視覺信息。在核心部分,我們提出了ACDiT,一種自回歸分塊條件擴散Transformer,其中擴散的區塊大小,即自回歸單元的大小,可以靈活調整,以在標記級自回歸和全序列擴散之間進行插值。ACDiT易於實現,只需在訓練期間創建Skip-Causal Attention Mask(SCAM)即可。在推論期間,該過程在擴散去噪和自回歸解碼之間迭代,可以充分利用KV-Cache。我們驗證了ACDiT在圖像和視頻生成任務上的有效性。我們還展示了得益於自回歸建模,ACDiT可以無縫地應用於視覺理解任務,儘管是在擴散目標上進行訓練。對自回歸建模和擴散之間的權衡進行分析,展示了ACDiT在長時間視覺生成任務中的潛力。這些優勢使其有望成為未來統一模型的骨幹。
English
The recent surge of interest in comprehensive multimodal models has
necessitated the unification of diverse modalities. However, the unification
suffers from disparate methodologies. Continuous visual generation necessitates
the full-sequence diffusion-based approach, despite its divergence from the
autoregressive modeling in the text domain. We posit that autoregressive
modeling, i.e., predicting the future based on past deterministic experience,
remains crucial in developing both a visual generation model and a potential
unified multimodal model. In this paper, we explore an interpolation between
the autoregressive modeling and full-parameters diffusion to model visual
information. At its core, we present ACDiT, an Autoregressive blockwise
Conditional Diffusion Transformer, where the block size of diffusion, i.e., the
size of autoregressive units, can be flexibly adjusted to interpolate between
token-wise autoregression and full-sequence diffusion. ACDiT is easy to
implement, as simple as creating a Skip-Causal Attention Mask (SCAM) during
training. During inference, the process iterates between diffusion denoising
and autoregressive decoding that can make full use of KV-Cache. We verify the
effectiveness of ACDiT on image and video generation tasks. We also demonstrate
that benefitted from autoregressive modeling, ACDiT can be seamlessly used in
visual understanding tasks despite being trained on the diffusion objective.
The analysis of the trade-off between autoregressive modeling and diffusion
demonstrates the potential of ACDiT to be used in long-horizon visual
generation tasks. These strengths make it promising as the backbone of future
unified models.Summary
AI-Generated Summary