MergeVQ：基于解耦令牌合并与量化的视觉生成与表示统一框架

摘要

基于向量量化（VQ）的掩码图像建模（MIM）在自监督预训练和图像生成领域均取得了显著成功。然而，现有方法大多难以在共享潜在空间中平衡生成质量与表征学习及效率之间的关系。为突破这一范式的局限，我们提出了MergeVQ，该模型将令牌合并技术融入基于VQ的生成模型，旨在统一架构中弥合图像生成与视觉表征学习之间的鸿沟。在预训练阶段，MergeVQ通过编码器自注意力模块后的令牌合并模块，将top-k语义从潜在空间解耦，以便后续进行无查找量化（LFQ）和全局对齐，并在解码器中通过交叉注意力恢复其细粒度细节以完成重建。针对第二阶段的生成任务，我们引入了MergeAR，它执行KV缓存压缩以实现高效的光栅顺序预测。在ImageNet上的大量实验验证了MergeVQ作为自回归生成模型，在视觉表征学习和图像生成任务中均展现出竞争力，同时保持了良好的令牌效率和推理速度。代码和模型将在https://apexgen-x.github.io/MergeVQ 提供。

English

Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.

MergeVQ：基于解耦令牌合并与量化的视觉生成与表示统一框架

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

摘要

Summary

Support

Support