利用具備緊湊文本感知一維標記的模型，實現將文本到圖像遮罩生成模型民主化。

摘要

圖像分詞器是現代文本到圖像生成模型的基礎，但訓練過程常常困難。此外，大多數現有的文本到圖像模型依賴大規模、高質量的私有數據集，使得復制這些模型具有挑戰性。在這項工作中，我們介紹了基於Transformer的文本感知1維分詞器（TA-TiTok），這是一種高效且強大的圖像分詞器，可以利用離散或連續的1維標記。TA-TiTok在分詞器解碼階段（即去標記化）獨特地整合了文本信息，加快了收斂速度並增強了性能。TA-TiTok還受益於簡化但有效的單階段訓練過程，消除了先前1維分詞器中使用的複雜的兩階段蒸餾的需求。這種設計使其能夠無縫擴展到大型數據集。基於此，我們介紹了一系列文本到圖像的遮罩生成模型（MaskGen），僅在開放數據上進行訓練，同時實現了與在私有數據上訓練的模型相當的性能。我們的目標是發布高效且強大的TA-TiTok分詞器以及基於開放數據和開放權重訓練的MaskGen模型，以促進更廣泛的訪問並實現對文本到圖像遮罩生成模型領域的民主化。

English

Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.

利用具備緊湊文本感知一維標記的模型，實現將文本到圖像遮罩生成模型民主化。

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

摘要

Summary

Support