利用紧凑的文本感知一维标记使文本到图像掩蔽生成模型民主化
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
January 13, 2025
作者: Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, Liang-Chieh Chen
cs.AI
摘要
图像标记器构成了现代文本到图像生成模型的基础,但训练起来往往非常困难。此外,大多数现有的文本到图像模型依赖于大规模、高质量的私有数据集,这使得它们很难被复制。在这项工作中,我们介绍了一种名为文本感知基于Transformer的一维标记器(TA-TiTok)的高效且强大的图像标记器,可以利用离散或连续的一维标记。TA-TiTok在标记器解码阶段(即去标记化)独特地整合了文本信息,加速了收敛并增强了性能。TA-TiTok还受益于简化但有效的单阶段训练过程,消除了以前一维标记器中使用的复杂的两阶段蒸馏的需求。这种设计使其能够无缝扩展到大型数据集。基于此,我们引入了一系列文本到图像的遮蔽生成模型(MaskGen),仅在开放数据上训练,同时实现了与在私有数据上训练的模型相当的性能。我们旨在发布高效强大的TA-TiTok标记器和基于开放数据和开放权重的MaskGen模型,以促进更广泛的获取并使文本到图像遮蔽生成模型领域民主化。
English
Image tokenizers form the foundation of modern text-to-image generative
models but are notoriously difficult to train. Furthermore, most existing
text-to-image models rely on large-scale, high-quality private datasets, making
them challenging to replicate. In this work, we introduce Text-Aware
Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful
image tokenizer that can utilize either discrete or continuous 1-dimensional
tokens. TA-TiTok uniquely integrates textual information during the tokenizer
decoding stage (i.e., de-tokenization), accelerating convergence and enhancing
performance. TA-TiTok also benefits from a simplified, yet effective, one-stage
training process, eliminating the need for the complex two-stage distillation
used in previous 1-dimensional tokenizers. This design allows for seamless
scalability to large datasets. Building on this, we introduce a family of
text-to-image Masked Generative Models (MaskGen), trained exclusively on open
data while achieving comparable performance to models trained on private data.
We aim to release both the efficient, strong TA-TiTok tokenizers and the
open-data, open-weight MaskGen models to promote broader access and democratize
the field of text-to-image masked generative models.Summary
AI-Generated Summary