因果扩散变换器用于生成建模
Causal Diffusion Transformers for Generative Modeling
December 16, 2024
作者: Chaorui Deng, Deyao Zh, Kunchang Li, Shi Guan, Haoqi Fan
cs.AI
摘要
我们介绍了因果扩散(Causal Diffusion),作为扩散模型的自回归(AR)对应物。它是一个友好于离散和连续模态的下一个标记预测框架,并且与现有的下一个标记预测模型(如LLaMA和GPT)兼容。尽管最近的研究尝试将扩散与AR模型结合起来,但我们表明,向扩散模型引入顺序分解可以显著提高其性能,并实现在AR和扩散生成模式之间平稳过渡。因此,我们提出了CausalFusion - 一个仅解码器的变压器,通过跨顺序标记和扩散噪声水平进行双因子分解数据,从而在ImageNet生成基准测试中取得了最先进的结果,同时也享受生成任意数量标记以进行上下文推理的AR优势。我们进一步通过联合图像生成和字幕模型展示了CausalFusion的多模态能力,并展示了CausalFusion在零照片上下文图像操作中的能力。我们希望这项工作能为社区提供一个关于训练离散和连续数据上的多模态模型的新视角。
English
We introduce Causal Diffusion as the autoregressive (AR) counterpart of
Diffusion models. It is a next-token(s) forecasting framework that is friendly
to both discrete and continuous modalities and compatible with existing
next-token prediction models like LLaMA and GPT. While recent works attempt to
combine diffusion with AR models, we show that introducing sequential
factorization to a diffusion model can substantially improve its performance
and enables a smooth transition between AR and diffusion generation modes.
Hence, we propose CausalFusion - a decoder-only transformer that
dual-factorizes data across sequential tokens and diffusion noise levels,
leading to state-of-the-art results on the ImageNet generation benchmark while
also enjoying the AR advantage of generating an arbitrary number of tokens for
in-context reasoning. We further demonstrate CausalFusion's multimodal
capabilities through a joint image generation and captioning model, and
showcase CausalFusion's ability for zero-shot in-context image manipulations.
We hope that this work could provide the community with a fresh perspective on
training multimodal models over discrete and continuous data.Summary
AI-Generated Summary