VisionZip:在視覺語言模型中,長度較長更好,但不一定是必要的。

VisionZip: Longer is Better but Not Necessary in Vision Language Models

December 5, 2024
作者: Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, Jiaya Jia
cs.AI

摘要

最近在視覺語言模型方面的進展通過增加視覺標記的長度來提高性能,使其比文本標記長得多,但也大幅提高了計算成本。然而,我們觀察到由流行的視覺編碼器(如CLIP和SigLIP)生成的視覺標記存在顯著的冗余。為了解決這個問題,我們引入了VisionZip,這是一種簡單而有效的方法,它選擇了一組信息豐富的標記輸入到語言模型中,減少了視覺標記的冗余,提高了效率,同時保持了模型的性能。所提出的VisionZip可以廣泛應用於圖像和視頻理解任務,特別適用於現實場景中的多輪對話,以前的方法在這些場景中往往表現不佳。實驗結果顯示,VisionZip在幾乎所有設置中的表現至少比以前的最先進方法提高了5%。此外,我們的方法顯著提高了模型推斷速度,將預填充時間提高了8倍,使LLaVA-Next 13B模型比LLaVA-Next 7B模型更快地進行推斷並取得更好的結果。此外,我們分析了這種冗余的原因,鼓勵社區專注於提取更好的視覺特徵,而不僅僅是增加標記的長度。我們的代碼可在https://github.com/dvlab-research/VisionZip 找到。
English
Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .

Summary

AI-Generated Summary

PDF10513December 6, 2024