ChatPaper.aiChatPaper

VisionZip:在视觉语言模型中,更长更好但并非必要

VisionZip: Longer is Better but Not Necessary in Vision Language Models

December 5, 2024
作者: Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia
cs.AI

摘要

最近视觉-语言模型的进展通过增加视觉标记的长度来提高性能,使其比文本标记长得多,并显著提高了计算成本。然而,我们观察到由流行的视觉编码器(如CLIP和SigLIP)生成的视觉标记包含显著的冗余。为了解决这个问题,我们引入了VisionZip,这是一种简单而有效的方法,它选择一组信息丰富的标记输入到语言模型中,减少了视觉标记的冗余,提高了效率,同时保持了模型性能。所提出的VisionZip 可广泛应用于图像和视频理解任务,并且非常适用于真实场景中的多轮对话,在这些场景中,先前的方法往往表现不佳。实验结果显示,VisionZip 在几乎所有设置下的性能至少比先前的最先进方法提高了5%。此外,我们的方法显著提高了模型推理速度,将预填充时间提高了8倍,并使LLaVA-Next 13B模型的推理速度比LLaVA-Next 7B模型更快,同时实现了更好的结果。此外,我们分析了这种冗余的原因,并鼓励社区专注于提取更好的视觉特征,而不仅仅是增加标记长度。我们的代码可在https://github.com/dvlab-research/VisionZip 找到。
English
Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .

Summary

AI-Generated Summary

PDF11113December 6, 2024