QLIP:文本对齐视觉标记统一自回归多模态理解与生成
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
February 7, 2025
作者: Yue Zhao, Fuzhao Xue, Scott Reed, Linxi Fan, Yuke Zhu, Jan Kautz, Zhiding Yu, Philipp Krähenbühl, De-An Huang
cs.AI
摘要
我们介绍了Quantized Language-Image Pretraining(QLIP),这是一种将最先进的重建质量与最先进的零样本图像理解相结合的视觉标记化方法。QLIP使用基于二进制球面量化的自动编码器进行训练,具有重建和语言-图像对齐目标。我们首次展示了这两个目标并不需要相互矛盾。我们在训练过程中动态平衡了这两个损失项,并展示了一个两阶段训练流程有效地将图像-语言预训练的大批量需求与重建目标所施加的内存瓶颈相混合。我们验证了QLIP在多模态理解和文本条件图像生成方面的有效性,使用单一模型。具体来说,QLIP可作为LLaVA的视觉编码器和LlamaGen的图像标记器的插入替换,性能相当甚至更好。最后,我们展示了QLIP实现了一个统一的混合模态自回归模型,用于理解和生成。
English
We introduce Quantized Language-Image Pretraining (QLIP), a visual
tokenization method that combines state-of-the-art reconstruction quality with
state-of-the-art zero-shot image understanding. QLIP trains a
binary-spherical-quantization-based autoencoder with reconstruction and
language-image alignment objectives. We are the first to show that the two
objectives do not need to be at odds. We balance the two loss terms dynamically
during training and show that a two-stage training pipeline effectively mixes
the large-batch requirements of image-language pre-training with the memory
bottleneck imposed by the reconstruction objective. We validate the
effectiveness of QLIP for multimodal understanding and text-conditioned image
generation with a single model. Specifically, QLIP serves as a drop-in
replacement for the visual encoder for LLaVA and the image tokenizer for
LlamaGen with comparable or even better performance. Finally, we demonstrate
that QLIP enables a unified mixed-modality auto-regressive model for
understanding and generation.Summary
AI-Generated Summary