ShortV：通过冻结无效层中的视觉标记实现高效多模态大语言模型

摘要

多模态大语言模型（MLLMs）因其庞大的模型规模及海量的视觉标记而面临高昂的计算成本。本文通过引入一种新颖的度量标准——层贡献度（Layer Contribution, LC），来探究MLLMs中的层级冗余问题。LC量化了某一层变换对视觉与文本标记分别产生的影响，其计算涉及移除该层对指定标记的变换后，模型输出差异的测量。初步实验表明，在处理视觉标记时，MLLMs的许多层级贡献微乎其微。基于这一发现，我们提出了ShortV，一种无需额外训练的方法，它利用LC识别无效层级，并在这些层级中冻结视觉标记的更新。实验结果显示，ShortV能在约60%的MLLM层级中冻结视觉标记，从而显著降低与更新视觉标记相关的计算开销。例如，在LLaVA-NeXT-13B模型上，它实现了50%的浮点运算（FLOPs）削减，同时保持了卓越的性能。相关代码将公开发布于https://github.com/icip-cas/ShortV。

English

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer Contribution (LC), which quantifies the impact of a layer's transformations on visual and text tokens, respectively. The calculation of LC involves measuring the divergence in model output that results from removing the layer's transformations on the specified tokens. Our pilot experiment reveals that many layers of MLLMs exhibit minimal contribution during the processing of visual tokens. Motivated by this observation, we propose ShortV, a training-free method that leverages LC to identify ineffective layers, and freezes visual token updates in these layers. Experiments show that ShortV can freeze visual token in approximately 60\% of the MLLM layers, thereby dramatically reducing computational costs related to updating visual tokens. For example, it achieves a 50\% reduction in FLOPs on LLaVA-NeXT-13B while maintaining superior performance. The code will be publicly available at https://github.com/icip-cas/ShortV

ShortV：通过冻结无效层中的视觉标记实现高效多模态大语言模型

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

摘要

Summary

Support

Support