ACDiT：插值自回歸條件建模與擴散Transformer

摘要

近來對綜合多模型的興趣激增，迫使各種模態得以統一。然而，這種統一受到不同方法論的影響。持續的視覺生成需要全序列擴散式方法，儘管這與文本領域中的自回歸建模有所不同。我們認為自回歸建模，即基於過去確定性經驗來預測未來，對於發展視覺生成模型和潛在的統一多模型至關重要。本文探討了自回歸建模和全參數擴散之間的插值，以建模視覺信息。在核心部分，我們提出了ACDiT，一種自回歸分塊條件擴散Transformer，其中擴散的區塊大小，即自回歸單元的大小，可以靈活調整，以在標記級自回歸和全序列擴散之間進行插值。ACDiT易於實現，只需在訓練期間創建Skip-Causal Attention Mask（SCAM）即可。在推論期間，該過程在擴散去噪和自回歸解碼之間迭代，可以充分利用KV-Cache。我們驗證了ACDiT在圖像和視頻生成任務上的有效性。我們還展示了得益於自回歸建模，ACDiT可以無縫地應用於視覺理解任務，儘管是在擴散目標上進行訓練。對自回歸建模和擴散之間的權衡進行分析，展示了ACDiT在長時間視覺生成任務中的潛力。這些優勢使其有望成為未來統一模型的骨幹。

English

The recent surge of interest in comprehensive multimodal models has necessitated the unification of diverse modalities. However, the unification suffers from disparate methodologies. Continuous visual generation necessitates the full-sequence diffusion-based approach, despite its divergence from the autoregressive modeling in the text domain. We posit that autoregressive modeling, i.e., predicting the future based on past deterministic experience, remains crucial in developing both a visual generation model and a potential unified multimodal model. In this paper, we explore an interpolation between the autoregressive modeling and full-parameters diffusion to model visual information. At its core, we present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer, where the block size of diffusion, i.e., the size of autoregressive units, can be flexibly adjusted to interpolate between token-wise autoregression and full-sequence diffusion. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We verify the effectiveness of ACDiT on image and video generation tasks. We also demonstrate that benefitted from autoregressive modeling, ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. These strengths make it promising as the backbone of future unified models.

ACDiT：插值自回歸條件建模與擴散Transformer

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

摘要

Summary

Support