ACDiT:插值自回归条件建模与扩散变换器

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

December 10, 2024
作者: Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun
cs.AI

摘要

近来对综合多模态模型的兴趣激增,迫使各种模态的统一化。然而,这种统一化存在着不同的方法论。持续视觉生成需要全序列扩散为基础的方法,尽管这与文本领域中的自回归建模有所不同。我们认为自回归建模,即基于过去确定性经验预测未来的方法,在开发视觉生成模型和潜在的统一多模态模型中仍然至关重要。在本文中,我们探讨了自回归建模和全参数扩散之间的插值来建模视觉信息。在核心部分,我们提出了ACDiT,即Autoregressive blockwise Conditional Diffusion Transformer,其中扩散的块大小,即自回归单元的大小,可以灵活调整以在标记级自回归和全序列扩散之间插值。ACDiT易于实现,只需在训练期间创建一个Skip-Causal Attention Mask(SCAM)即可。在推断期间,该过程在扩散去噪和自回归解码之间迭代,可以充分利用KV-Cache。我们验证了ACDiT在图像和视频生成任务上的有效性。我们还展示了受益于自回归建模,ACDiT可以在视觉理解任务中无缝使用,尽管是在扩散目标上进行训练。自回归建模和扩散之间的权衡分析展示了ACDiT在长期视觉生成任务中的潜力。这些优势使其有望成为未来统一模型的支柱。
English
The recent surge of interest in comprehensive multimodal models has necessitated the unification of diverse modalities. However, the unification suffers from disparate methodologies. Continuous visual generation necessitates the full-sequence diffusion-based approach, despite its divergence from the autoregressive modeling in the text domain. We posit that autoregressive modeling, i.e., predicting the future based on past deterministic experience, remains crucial in developing both a visual generation model and a potential unified multimodal model. In this paper, we explore an interpolation between the autoregressive modeling and full-parameters diffusion to model visual information. At its core, we present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer, where the block size of diffusion, i.e., the size of autoregressive units, can be flexibly adjusted to interpolate between token-wise autoregression and full-sequence diffusion. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We verify the effectiveness of ACDiT on image and video generation tasks. We also demonstrate that benefitted from autoregressive modeling, ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. These strengths make it promising as the backbone of future unified models.

Summary

AI-Generated Summary

PDF302December 11, 2024