羽化油门：重访用于加速视觉-语言模型的视觉标记修剪

摘要

最近关于加速视觉-语言模型的研究表明，尽管对视觉信息进行高度压缩，仍然可以在各种视觉-语言任务中保持强大的性能。在这项工作中，我们研究了一种流行的加速方法，即在语言模型内部对视觉标记进行早期修剪，并发现其在许多任务中表现出色并不是由于其异常的视觉信息压缩能力，而是由于基准测试对细粒度视觉能力的评估能力有限。换句话说，我们展示了加速方法中的一个核心问题，即对图像顶部的大多数标记进行修剪。然而，这个问题只在一小部分任务（如定位）的性能中体现出来。对于其他评估任务，采用有缺陷的修剪策略仍然能够保持强大的性能。鉴于所研究的加速技术的视觉能力有限，我们提出了FEATHER（Fast and Effective Acceleration wiTH Ensemble cRiteria），这是一种简单直接的方法，它（1）解决了早期层次修剪中发现的问题，（2）结合了均匀采样以确保覆盖所有图像区域，（3）在两个阶段应用修剪，以使标准在后续层次变得更加有效，同时仍通过早期层次修剪实现显著加速。通过可比较的计算节省，我们发现与原始加速方法相比，FEATHER在以视觉为中心的定位基准上性能提升超过5倍。

English

Recent works on accelerating Vision-Language Models show that strong performance can be maintained across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model and find that its strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks' limited ability to assess fine-grained visual capabilities. Namely, we demonstrate a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, this issue is only reflected in performance for a small subset of tasks such as localization. For the other evaluated tasks, strong performance is maintained with the flawed pruning strategy. Noting the limited visual capabilities of the studied acceleration technique, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that (1) resolves the identified issue with early-layer pruning, (2) incorporates uniform sampling to ensure coverage across all image regions, and (3) applies pruning in two stages to allow the criteria to become more effective at a later layer while still achieving significant speedup through early-layer pruning. With comparable computational savings, we find that FEATHER has more than 5times performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.

羽化油门：重访用于加速视觉-语言模型的视觉标记修剪

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

摘要

Summary

Support