通过剪裁跨注意力视觉特征实现高效的LLaMA-3.2-Vision
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
April 1, 2025
作者: Jewon Lee, Ki-Ung Song, Seungmin Yang, Donguk Lim, Jaeyeon Kim, Wooksu Shin, Bo-Kyeong Kim, Yong Jae Lee, Tae-Ho Kim
cs.AI
摘要
视觉令牌缩减技术有效降低了大型视觉语言模型(LVLMs)中因大量图像特征带来的推理成本。与现有研究仅针对自注意力机制的LVLMs进行令牌剪枝不同,我们的工作独辟蹊径,专注于基于交叉注意力机制的模型,这类模型展现出更卓越的性能。我们发现,在交叉注意力层中,图像令牌的键值(KV)缓存规模远超自注意力层中的文本令牌,成为计算性能的主要瓶颈。为解决这一问题,我们利用交叉注意力图的稀疏特性,有选择性地剪除冗余的视觉特征。我们的Trimmed Llama模型无需额外训练,即可显著降低KV缓存需求。得益于视觉特征减少50%,该模型在保持基准性能的同时,有效降低了推理延迟和内存占用。
English
Visual token reduction lowers inference costs caused by extensive image
features in large vision-language models (LVLMs). Unlike relevant studies that
prune tokens in self-attention-only LVLMs, our work uniquely addresses
cross-attention-based models, which achieve superior performance. We identify
that the key-value (KV) cache size for image tokens in cross-attention layers
significantly exceeds that of text tokens in self-attention layers, posing a
major compute bottleneck. To mitigate this issue, we exploit the sparse nature
in cross-attention maps to selectively prune redundant visual features. Our
Trimmed Llama effectively reduces KV cache demands without requiring additional
training. By benefiting from 50%-reduced visual features, our model can reduce
inference latency and memory usage while achieving benchmark parity.Summary
AI-Generated Summary