ChatPaper.aiChatPaper

UniTok:视觉生成与理解的一体化分词器

UniTok: A Unified Tokenizer for Visual Generation and Understanding

February 27, 2025
作者: Chuofan Ma, Yi Jiang, Junfeng Wu, Jihan Yang, Xin Yu, Zehuan Yuan, Bingyue Peng, Xiaojuan Qi
cs.AI

摘要

视觉生成与理解之间的表征差异,在将这两种能力整合到单一框架中时构成了关键障碍。为弥合这一鸿沟,我们提出了UniTok,一种离散视觉标记器,它既能编码细粒度细节以支持生成任务,又能捕捉高层语义以促进理解任务。尽管近期研究表明,这些目标可能在训练过程中引发损失冲突,但我们揭示出,其根本瓶颈在于离散标记的表示能力受限。为此,我们引入了多码本量化技术,通过将向量量化分解为多个独立的子码本,从而扩展潜在特征空间,同时避免了因码本过大导致的训练不稳定性。我们的方法显著提升了统一离散标记器的性能上限,使其能够媲美甚至超越领域特定的连续标记器。例如,UniTok在ImageNet数据集上取得了令人瞩目的rFID值0.38(对比SD-VAE的0.87)和零样本准确率78.6%(对比CLIP的76.2%)。我们的代码已公开于https://github.com/FoundationVision/UniTok。
English
The representation disparity between visual generation and understanding imposes a critical gap in integrating these capabilities into a single framework. To bridge this gap, we introduce UniTok, a discrete visual tokenizer that encodes fine-grained details for generation while also capturing high-level semantics for understanding. Despite recent studies have shown that these objectives could induce loss conflicts in training, we reveal that the underlying bottleneck stems from limited representational capacity of discrete tokens. We address this by introducing multi-codebook quantization, which divides vector quantization with several independent sub-codebooks to expand the latent feature space, while avoiding training instability caused by overlarge codebooks. Our method significantly raises the upper limit of unified discrete tokenizers to match or even surpass domain-specific continuous tokenizers. For instance, UniTok achieves a remarkable rFID of 0.38 (versus 0.87 for SD-VAE) and a zero-shot accuracy of 78.6% (versus 76.2% for CLIP) on ImageNet. Our code is available at https://github.com/FoundationVision/UniTok.

Summary

AI-Generated Summary

PDF292February 28, 2025