因果擴散轉換器用於生成建模
Causal Diffusion Transformers for Generative Modeling
December 16, 2024
作者: Chaorui Deng, Deyao Zh, Kunchang Li, Shi Guan, Haoqi Fan
cs.AI
摘要
我們介紹了因果擴散(Causal Diffusion)作為擴散模型的自回歸(AR)對應。這是一種友好於離散和連續模態的下一個標記預測框架,與現有的像LLaMA和GPT的下一個標記預測模型兼容。雖然最近的研究嘗試將擴散與AR模型結合,但我們表明,將順序分解引入擴散模型可以顯著提高其性能,並實現在AR和擴散生成模式之間的平滑過渡。因此,我們提出了CausalFusion - 一種僅解碼器的變壓器,它在順序標記和擴散噪聲水平之間雙重分解數據,並在ImageNet生成基準測試中取得了最先進的結果,同時享有生成任意數量標記進行上下文推理的AR優勢。我們進一步通過聯合圖像生成和字幕模型展示了CausalFusion的多模態能力,展示了CausalFusion在零樣本上下文圖像操作方面的能力。我們希望這項工作能為社區提供在離散和連續數據上訓練多模態模型的新視角。
English
We introduce Causal Diffusion as the autoregressive (AR) counterpart of
Diffusion models. It is a next-token(s) forecasting framework that is friendly
to both discrete and continuous modalities and compatible with existing
next-token prediction models like LLaMA and GPT. While recent works attempt to
combine diffusion with AR models, we show that introducing sequential
factorization to a diffusion model can substantially improve its performance
and enables a smooth transition between AR and diffusion generation modes.
Hence, we propose CausalFusion - a decoder-only transformer that
dual-factorizes data across sequential tokens and diffusion noise levels,
leading to state-of-the-art results on the ImageNet generation benchmark while
also enjoying the AR advantage of generating an arbitrary number of tokens for
in-context reasoning. We further demonstrate CausalFusion's multimodal
capabilities through a joint image generation and captioning model, and
showcase CausalFusion's ability for zero-shot in-context image manipulations.
We hope that this work could provide the community with a fresh perspective on
training multimodal models over discrete and continuous data.Summary
AI-Generated Summary