TokenFlow:用於多模態理解和生成的統一圖像標記器

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

December 4, 2024
作者: Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, Xinglong Wu
cs.AI

摘要

我們提出了TokenFlow,一種新型的統一圖像分詞器,彌合了多模應理解和生成之間長期存在的差距。先前的研究嘗試使用一個針對重建目標的向量量化(VQ)編碼器來統一這兩個任務。我們觀察到,理解和生成需要基本上不同粒度的視覺信息。這導致了一個關鍵的折衷,特別是在多模應理解任務的表現上受到損害。TokenFlow通過一種創新的雙碼書架構應對這一挑戰,該架構解耦了語義和像素級特徵學習,同時通過共享映射機制保持它們的對齊。這種設計使得通過共享索引直接訪問對於理解任務至關重要的高級語義表示和對於生成至關重要的細粒度視覺特徵成為可能。我們的大量實驗證明了TokenFlow在多個維度上的優越性。利用TokenFlow,我們首次展示了離散視覺輸入可以在理解性能上超越LLaVA-1.5 13B,實現了7.2%的平均改進。對於圖像重建,我們在384*384分辨率下實現了強大的FID分數為0.63。此外,TokenFlow在自回歸圖像生成方面建立了最先進的性能,256*256分辨率下的GenEval分數為0.55,實現了與SDXL可比的結果。
English
We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our extensive experiments demonstrate TokenFlow's superiority across multiple dimensions. Leveraging TokenFlow, we demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance, achieving a 7.2\% average improvement. For image reconstruction, we achieve a strong FID score of 0.63 at 384*384 resolution. Moreover, TokenFlow establishes state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256*256 resolution, achieving comparable results to SDXL.

Summary

AI-Generated Summary

PDF303December 5, 2024