ZipVL：具有動態標記稀疏化和KV快取壓縮的高效大視覺語言模型

摘要

大型視覺語言模型（LVLMs）的效率受到計算瓶頸的限制，在預填充階段是注意機制的計算瓶頸，解碼階段是檢索鍵值（KV）緩存的記憶瓶頸，尤其在涉及高分辨率圖像或視頻的情況下。視覺內容通常存在相當多的冗餘，導致LVLMs內部的注意力地圖非常稀疏。這種稀疏性可以通過各種方法來加速注意力計算或壓縮KV緩存。然而，大多數研究僅專注於解決這兩個瓶頸中的一個，並且對於關於不同層或任務的稀疏性動態調整的支持不夠充分。本文提出了ZipVL，一個為LVLMs設計的高效推理框架，通過重要標記的動態比例分配策略解決計算和記憶瓶頸。這個比例是根據層特定的注意力分數分佈自適應確定的，而不是固定的超參數，從而提高了對於較不複雜任務的效率，同時保持了對於更具挑戰性任務的高性能。然後，我們基於它們的歸一化注意力分數選擇重要標記，並僅對這些重要標記執行注意機制以加速預填充階段。為了減輕解碼階段的記憶瓶頸，我們對KV緩存進行了混合精度量化，其中對於重要標記的緩存使用高位量化，而對於不太重要的則應用低位量化。我們的實驗表明，ZipVL可以將預填充階段加速2.6倍，將GPU記憶體使用量減少50.0％，在LongVA-7B模型上的Video-MME基準上僅降低0.2％的準確性，有效提高了LVLMs的生成效率。

English

The efficiency of large vision-language models (LVLMs) is constrained by the computational bottleneck of the attention mechanism during the prefill phase and the memory bottleneck of fetching the key-value (KV) cache in the decoding phase, particularly in scenarios involving high-resolution images or videos. Visual content often exhibits substantial redundancy, resulting in highly sparse attention maps within LVLMs. This sparsity can be leveraged to accelerate attention computation or compress the KV cache through various approaches. However, most studies focus on addressing only one of these bottlenecks and do not adequately support dynamic adjustment of sparsity concerning distinct layers or tasks. In this paper, we present ZipVL, an efficient inference framework designed for LVLMs that resolves both computation and memory bottlenecks through a dynamic ratio allocation strategy of important tokens. This ratio is adaptively determined based on the layer-specific distribution of attention scores, rather than fixed hyper-parameters, thereby improving efficiency for less complex tasks while maintaining high performance for more challenging ones. Then we select important tokens based on their normalized attention scores and perform attention mechanism solely on those important tokens to accelerate the prefill phase. To mitigate the memory bottleneck in the decoding phase, we employ mixed-precision quantization to the KV cache, where high-bit quantization is used for caches of important tokens, while low-bit quantization is applied to those of less importance. Our experiments demonstrate that ZipVL can accelerate the prefill phase by 2.6times and reduce GPU memory usage by 50.0%, with a minimal accuracy reduction of only 0.2% on Video-MME benchmark over LongVA-7B model, effectively enhancing the generation efficiency of LVLMs.

ZipVL：具有動態標記稀疏化和KV快取壓縮的高效大視覺語言模型

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

摘要

Summary

Support

Support