TokenFlow:用于多模态理解和生成的统一图像标记器
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
December 4, 2024
作者: Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu
cs.AI
摘要
我们提出了TokenFlow,这是一种新颖的统一图像标记器,弥合了多模态理解和生成之间长期存在的差距。先前的研究尝试使用单一的面向重建的向量量化(VQ)编码器来统一这两个任务。我们观察到,理解和生成需要根本不同粒度的视觉信息。这导致了一个关键的权衡,特别是在多模态理解任务中牺牲了性能。TokenFlow通过创新的双码书架构来解决这一挑战,该架构解耦了语义和像素级特征学习,同时通过共享映射机制保持它们的对齐。这种设计通过共享索引,实现了对理解任务至关重要的高级语义表示和对生成至关重要的细粒度视觉特征的直接访问。我们广泛的实验证明了TokenFlow在多个维度上的优越性。利用TokenFlow,我们首次展示了离散视觉输入可以在理解性能上超越LLaVA-1.5 13B,实现了7.2%的平均改进。对于图像重建,我们在384*384分辨率下实现了强劲的FID得分为0.63。此外,TokenFlow在自回归图像生成方面表现出了最先进的性能,256*256分辨率下的GenEval得分为0.55,实现了与SDXL可比较的结果。
English
We present TokenFlow, a novel unified image tokenizer that bridges the
long-standing gap between multimodal understanding and generation. Prior
research attempt to employ a single reconstruction-targeted Vector Quantization
(VQ) encoder for unifying these two tasks. We observe that understanding and
generation require fundamentally different granularities of visual information.
This leads to a critical trade-off, particularly compromising performance in
multimodal understanding tasks. TokenFlow addresses this challenge through an
innovative dual-codebook architecture that decouples semantic and pixel-level
feature learning while maintaining their alignment via a shared mapping
mechanism. This design enables direct access to both high-level semantic
representations crucial for understanding tasks and fine-grained visual
features essential for generation through shared indices. Our extensive
experiments demonstrate TokenFlow's superiority across multiple dimensions.
Leveraging TokenFlow, we demonstrate for the first time that discrete visual
input can surpass LLaVA-1.5 13B in understanding performance, achieving a 7.2\%
average improvement. For image reconstruction, we achieve a strong FID score of
0.63 at 384*384 resolution. Moreover, TokenFlow establishes state-of-the-art
performance in autoregressive image generation with a GenEval score of 0.55 at
256*256 resolution, achieving comparable results to SDXL.Summary
AI-Generated Summary