VisionZip: 시각 언어 모델에서는 더 긴 것이 더 나은 것이지만 필수는 아닙니다.

초록

최근 시각-언어 모델의 발전으로 성능이 향상되었는데, 이는 시각 토큰의 길이를 증가시킴으로써 텍스트 토큰보다 훨씬 길게 만들어 계산 비용을 상당히 높였기 때문이다. 그러나 우리는 CLIP와 SigLIP와 같은 인기 있는 시각 인코더에 의해 생성된 시각 토큰에 상당한 중복이 포함되어 있다는 것을 관찰했다. 이를 해결하기 위해 우리는 VisionZip이라는 간단하면서도 효과적인 방법을 소개하며, 이 방법은 언어 모델에 입력하기 위한 정보가 풍부한 토큰 집합을 선택하여 시각 토큰의 중복을 줄이고 효율성을 향상시키면서 모델 성능을 유지한다. 제안된 VisionZip은 이미지 및 비디오 이해 작업에 널리 적용될 수 있으며, 이전 방법들이 성능이 부족한 실제 시나리오에서의 다중 대화에 적합하다. 실험 결과는 VisionZip이 이전 최첨단 방법보다 거의 모든 설정에서 최소 5%의 성능 향상을 보여주는 것을 보여준다. 더불어, 우리의 방법은 모델 추론 속도를 크게 향상시켜, 프리핑 시간을 8배 빠르게 하고 LLaVA-Next 13B 모델이 더 나은 결과를 달성하면서 LLaVA-Next 7B 모델보다 빠르게 추론할 수 있도록 한다. 더 나아가, 우리는 이 중복의 원인을 분석하고 커뮤니티가 시각적 특징을 단순히 토큰 길이를 증가시키는 것이 아니라 더 나은 추출에 집중하도록 권장한다. 우리의 코드는 https://github.com/dvlab-research/VisionZip 에서 확인할 수 있다.

English

Recent advancements in vision-language models have enhanced performance by increasing the length of visual tokens, making them much longer than text tokens and significantly raising computational costs. However, we observe that the visual tokens generated by popular vision encoders, such as CLIP and SigLIP, contain significant redundancy. To address this, we introduce VisionZip, a simple yet effective method that selects a set of informative tokens for input to the language model, reducing visual token redundancy and improving efficiency while maintaining model performance. The proposed VisionZip can be widely applied to image and video understanding tasks and is well-suited for multi-turn dialogues in real-world scenarios, where previous methods tend to underperform. Experimental results show that VisionZip outperforms the previous state-of-the-art method by at least 5% performance gains across nearly all settings. Moreover, our method significantly enhances model inference speed, improving the prefilling time by 8x and enabling the LLaVA-Next 13B model to infer faster than the LLaVA-Next 7B model while achieving better results. Furthermore, we analyze the causes of this redundancy and encourage the community to focus on extracting better visual features rather than merely increasing token length. Our code is available at https://github.com/dvlab-research/VisionZip .

VisionZip: 시각 언어 모델에서는 더 긴 것이 더 나은 것이지만 필수는 아닙니다.

VisionZip: Longer is Better but Not Necessary in Vision Language Models

초록

Summary

Support