MonoFormer:一個Transformer同時應用擴散和自回歸
MonoFormer: One Transformer for Both Diffusion and Autoregression
September 24, 2024
作者: Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding, Yifan Sun, Xinyan Xiao, Jingdong Wang
cs.AI
摘要
大多數現有的多模式方法使用不同的主幹來進行基於自回歸的離散文本生成和基於擴散的連續視覺生成,或者通過對視覺數據進行離散化,以便將自回歸應用於文本和視覺生成。在本文中,我們提出研究一個簡單的想法:為自回歸和擴散共享一個Transformer。這種可行性來自兩個主要方面:(i) Transformer已成功應用於視覺生成的擴散,以及(ii) Transformer用於自回歸和擴散的訓練非常相似,區別僅在於擴散使用雙向注意力遮罩,而自回歸使用因果關注遮罩。實驗結果顯示,我們的方法實現了與當前最先進方法相當的圖像生成性能,同時保持了文本生成能力。該項目可在https://monoformer.github.io/ 公開獲取。
English
Most existing multimodality methods use separate backbones for
autoregression-based discrete text generation and diffusion-based continuous
visual generation, or the same backbone by discretizing the visual data to use
autoregression for both text and visual generation. In this paper, we propose
to study a simple idea: share one transformer for both autoregression and
diffusion. The feasibility comes from two main aspects: (i) Transformer is
successfully applied to diffusion for visual generation, and (ii) transformer
training for autoregression and diffusion is very similar, and the difference
merely lies in that diffusion uses bidirectional attention mask and
autoregression uses causal attention mask. Experimental results show that our
approach achieves comparable image generation performance to current
state-of-the-art methods as well as maintains the text generation capability.
The project is publicly available at https://monoformer.github.io/.Summary
AI-Generated Summary