MergeVQ:基于解耦令牌合并与量化的视觉生成与表示统一框架
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
April 1, 2025
作者: Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, Zhen Lei
cs.AI
摘要
基于向量量化(VQ)的掩码图像建模(MIM)在自监督预训练和图像生成领域均取得了显著成功。然而,现有方法大多难以在共享潜在空间中平衡生成质量与表征学习及效率之间的关系。为突破这一范式的局限,我们提出了MergeVQ,该模型将令牌合并技术融入基于VQ的生成模型,旨在统一架构中弥合图像生成与视觉表征学习之间的鸿沟。在预训练阶段,MergeVQ通过编码器自注意力模块后的令牌合并模块,将top-k语义从潜在空间解耦,以便后续进行无查找量化(LFQ)和全局对齐,并在解码器中通过交叉注意力恢复其细粒度细节以完成重建。针对第二阶段的生成任务,我们引入了MergeAR,它执行KV缓存压缩以实现高效的光栅顺序预测。在ImageNet上的大量实验验证了MergeVQ作为自回归生成模型,在视觉表征学习和图像生成任务中均展现出竞争力,同时保持了良好的令牌效率和推理速度。代码和模型将在https://apexgen-x.github.io/MergeVQ 提供。
English
Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great
success in both self-supervised pre-training and image generation. However,
most existing methods struggle to address the trade-off in shared latent space
for generation quality vs. representation learning and efficiency. To push the
limits of this paradigm, we propose MergeVQ, which incorporates token merging
techniques into VQ-based generative models to bridge the gap between image
generation and visual representation learning in a unified architecture. During
pre-training, MergeVQ decouples top-k semantics from latent space with the
token merge module after self-attention blocks in the encoder for subsequent
Look-up Free Quantization (LFQ) and global alignment and recovers their
fine-grained details through cross-attention in the decoder for reconstruction.
As for the second-stage generation, we introduce MergeAR, which performs KV
Cache compression for efficient raster-order prediction. Extensive experiments
on ImageNet verify that MergeVQ as an AR generative model achieves competitive
performance in both visual representation learning and image generation tasks
while maintaining favorable token efficiency and inference speed. The code and
model will be available at https://apexgen-x.github.io/MergeVQ.Summary
AI-Generated Summary