ZipVL: 동적 토큰 희소화와 KV 캐시 압축을 통한 효율적인 대형 비전-언어 모델

초록

대형 비전-언어 모델(LVLMs)의 효율성은 특히 고해상도 이미지나 비디오가 포함된 시나리오에서 선행 단계에서의 어텐션 메커니즘의 계산 병목과 디코딩 단계에서의 키-값(KV) 캐시 검색의 메모리 병목으로 제한된다. 시각적 콘텐츠는 종종 상당한 중복성을 나타내어 LVLMs 내에서 매우 희소한 어텐션 맵을 생성한다. 이러한 희소성은 어텐션 계산을 가속화하거나 다양한 방법을 통해 KV 캐시를 압축하는 데 활용될 수 있다. 그러나 대부분의 연구는 이러한 병목 중 하나만 다루며 서로 다른 레이어나 작업에 대한 희소성의 동적 조정을 충분히 지원하지 않는다. 본 논문에서는 LVLMs를 위한 효율적 추론 프레임워크인 ZipVL을 제시한다. 이는 중요한 토큰의 동적 비율 할당 전략을 통해 계산 및 메모리 병목을 해결한다. 이 비율은 고정된 하이퍼파라미터가 아닌 레이어별 어텐션 점수 분포에 기반하여 적응적으로 결정되어 덜 복잡한 작업에 대한 효율성을 향상시키면서 더 어려운 작업에 대한 높은 성능을 유지한다. 그런 다음, 정규화된 어텐션 점수에 따라 중요한 토큰을 선택하고, 이러한 중요한 토큰에 대해서만 선행 단계에서 어텐션 메커니즘을 수행하여 가속화한다. 디코딩 단계에서 메모리 병목을 완화하기 위해 중요한 토큰의 캐시에는 고비트 양자화를, 중요하지 않은 토큰에는 저비트 양자화를 적용하는 혼합 정밀도 양자화를 사용한다. 실험 결과 ZipVL은 LongVA-7B 모델의 Video-MME 벤치마크에서 2.6배의 선행 단계 가속화와 GPU 메모리 사용량을 50.0% 줄이는 효과를 보여주며, 정확도 감소는 0.2%로 매우 낮게 유지하면서 LVLMs의 생성 효율성을 획기적으로 향상시킨다.

English

The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention maps within LVLMs. This sparsity can be leveraged to accelerate attention computation or compress the KV cache through various approaches. However, most studies focus on addressing only one of these bottlenecks and do not adequately support dynamic adjustment of sparsity concerning distinct layers or tasks. In this paper, we present ZipVL, an efficient inference framework designed for LVLMs that resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform attention mechanism solely on those important tokens to accelerate the prefill phase. To mitigate the memory bottleneck in the decoding phase, we employ mixed-precision quantization to the KV cache, where high-bit quantization is used for caches of important tokens, while low-bit quantization is applied to those of less importance. Our experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6times and reduce GPU memory usage by 50.0%, with a minimal accuracy reduction of only 0.2% on Video-MME benchmark over LongVA-7B model, effectively enhancing the generation efficiency of LVLMs.

ZipVL: 동적 토큰 희소화와 KV 캐시 압축을 통한 효율적인 대형 비전-언어 모델

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

초록

Summary

Support