通过剪裁跨注意力视觉特征实现高效的LLaMA-3.2-Vision

摘要

视觉令牌缩减技术有效降低了大型视觉语言模型（LVLMs）中因大量图像特征带来的推理成本。与现有研究仅针对自注意力机制的LVLMs进行令牌剪枝不同，我们的工作独辟蹊径，专注于基于交叉注意力机制的模型，这类模型展现出更卓越的性能。我们发现，在交叉注意力层中，图像令牌的键值（KV）缓存规模远超自注意力层中的文本令牌，成为计算性能的主要瓶颈。为解决这一问题，我们利用交叉注意力图的稀疏特性，有选择性地剪除冗余的视觉特征。我们的Trimmed Llama模型无需额外训练，即可显著降低KV缓存需求。得益于视觉特征减少50%，该模型在保持基准性能的同时，有效降低了推理延迟和内存占用。

English

Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

通过剪裁跨注意力视觉特征实现高效的LLaMA-3.2-Vision

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

摘要

Summary

Support

Support