利用紧凑的文本感知一维标记使文本到图像掩蔽生成模型民主化

摘要

图像标记器构成了现代文本到图像生成模型的基础，但训练起来往往非常困难。此外，大多数现有的文本到图像模型依赖于大规模、高质量的私有数据集，这使得它们很难被复制。在这项工作中，我们介绍了一种名为文本感知基于Transformer的一维标记器（TA-TiTok）的高效且强大的图像标记器，可以利用离散或连续的一维标记。TA-TiTok在标记器解码阶段（即去标记化）独特地整合了文本信息，加速了收敛并增强了性能。TA-TiTok还受益于简化但有效的单阶段训练过程，消除了以前一维标记器中使用的复杂的两阶段蒸馏的需求。这种设计使其能够无缝扩展到大型数据集。基于此，我们引入了一系列文本到图像的遮蔽生成模型（MaskGen），仅在开放数据上训练，同时实现了与在私有数据上训练的模型相当的性能。我们旨在发布高效强大的TA-TiTok标记器和基于开放数据和开放权重的MaskGen模型，以促进更广泛的获取并使文本到图像遮蔽生成模型领域民主化。

English

Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.

利用紧凑的文本感知一维标记使文本到图像掩蔽生成模型民主化

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

摘要

Summary

热门论文

1比特LLM时代：所有大型语言模型均为1.58比特。
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

DeepSeek-R1：通过强化学习激励LLMs中的推理能力
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Qwen2.5 技术报告
Qwen2.5 Technical Report

Support

Support

摘要

Summary

热门论文

1比特LLM时代：所有大型语言模型均为1.58比特。The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

DeepSeek-R1：通过强化学习激励LLMs中的推理能力DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Qwen2.5 技术报告Qwen2.5 Technical Report

1比特LLM时代：所有大型语言模型均为1.58比特。
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

DeepSeek-R1：通过强化学习激励LLMs中的推理能力
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Qwen2.5 技术报告
Qwen2.5 Technical Report