인과 확산 트랜스포머를 이용한 생성 모델링

초록

우리는 Causal Diffusion을 확산 모델의 자기회귀(AR) 대응물로 소개합니다. 이는 이산 및 연속적인 형태에 모두 적합하며 LLaMA와 GPT와 같은 기존의 다음 토큰 예측 모델과 호환됩니다. 최근 연구들은 확산과 AR 모델을 결합하려고 시도했지만, 우리는 확산 모델에 순차적 요소분해를 도입함으로써 성능을 크게 향상시킬 수 있고 AR 및 확산 생성 모드 사이의 부드러운 전환을 가능하게 합니다. 따라서 우리는 CausalFusion을 제안합니다 - 순차적 토큰 및 확산 잡음 수준을 거침없이 이중 요소분해하는 디코더 전용 트랜스포머로, ImageNet 생성 벤치마크에서 최신 결과를 보여주며 인문학적 추론을 위해 임의의 토큰 수를 생성하는 AR의 이점을 누립니다. 우리는 더 나아가 CausalFusion의 다중 모달 능력을 이미지 생성 및 캡션 모델을 통해 보여주며, CausalFusion의 제로샷 인문학 이미지 조작 능력을 선보입니다. 우리는 이 연구가 이산 및 연속 데이터에 대한 다중 모달 모델 교육에 대한 새로운 시각을 제공할 수 있기를 희망합니다.

English

We introduce Causal Diffusion as the autoregressive (AR) counterpart of Diffusion models. It is a next-token(s) forecasting framework that is friendly to both discrete and continuous modalities and compatible with existing next-token prediction models like LLaMA and GPT. While recent works attempt to combine diffusion with AR models, we show that introducing sequential factorization to a diffusion model can substantially improve its performance and enables a smooth transition between AR and diffusion generation modes. Hence, we propose CausalFusion - a decoder-only transformer that dual-factorizes data across sequential tokens and diffusion noise levels, leading to state-of-the-art results on the ImageNet generation benchmark while also enjoying the AR advantage of generating an arbitrary number of tokens for in-context reasoning. We further demonstrate CausalFusion's multimodal capabilities through a joint image generation and captioning model, and showcase CausalFusion's ability for zero-shot in-context image manipulations. We hope that this work could provide the community with a fresh perspective on training multimodal models over discrete and continuous data.

인과 확산 트랜스포머를 이용한 생성 모델링

Causal Diffusion Transformers for Generative Modeling

초록

Support