因果扩散变换器用于生成建模

摘要

我们介绍了因果扩散（Causal Diffusion），作为扩散模型的自回归（AR）对应物。它是一个友好于离散和连续模态的下一个标记预测框架，并且与现有的下一个标记预测模型（如LLaMA和GPT）兼容。尽管最近的研究尝试将扩散与AR模型结合起来，但我们表明，向扩散模型引入顺序分解可以显著提高其性能，并实现在AR和扩散生成模式之间平稳过渡。因此，我们提出了CausalFusion - 一个仅解码器的变压器，通过跨顺序标记和扩散噪声水平进行双因子分解数据，从而在ImageNet生成基准测试中取得了最先进的结果，同时也享受生成任意数量标记以进行上下文推理的AR优势。我们进一步通过联合图像生成和字幕模型展示了CausalFusion的多模态能力，并展示了CausalFusion在零照片上下文图像操作中的能力。我们希望这项工作能为社区提供一个关于训练离散和连续数据上的多模态模型的新视角。

English

We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion's multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion's ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.

因果扩散变换器用于生成建模

Causal Diffusion Transformers for Generative Modeling

摘要

Support