MonoFormer：一個Transformer同時應用擴散和自回歸

摘要

大多數現有的多模式方法使用不同的主幹來進行基於自回歸的離散文本生成和基於擴散的連續視覺生成，或者通過對視覺數據進行離散化，以便將自回歸應用於文本和視覺生成。在本文中，我們提出研究一個簡單的想法：為自回歸和擴散共享一個Transformer。這種可行性來自兩個主要方面：(i) Transformer已成功應用於視覺生成的擴散，以及(ii) Transformer用於自回歸和擴散的訓練非常相似，區別僅在於擴散使用雙向注意力遮罩，而自回歸使用因果關注遮罩。實驗結果顯示，我們的方法實現了與當前最先進方法相當的圖像生成性能，同時保持了文本生成能力。該項目可在https://monoformer.github.io/ 公開獲取。

English

Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation. In this paper, we propose to study a simple idea: share one transformer for both autoregression and diffusion. The feasibility comes from two main aspects: (i) Transformer is successfully applied to diffusion for visual generation, and (ii) transformer training for autoregression and diffusion is very similar, and the difference merely lies in that diffusion uses bidirectional attention mask and autoregression uses causal attention mask. Experimental results show that our approach achieves comparable image generation performance to current state-of-the-art methods as well as maintains the text generation capability. The project is publicly available at https://monoformer.github.io/.

MonoFormer：一個Transformer同時應用擴散和自回歸

MonoFormer: One Transformer for Both Diffusion and Autoregression

摘要

Summary

Support

Support