調節節流閥：重新審視用於加速視覺語言模型的視覺標記修剪

摘要

最近有關加速視覺語言模型的研究表明，儘管高度壓縮視覺信息，仍然可以在各種視覺語言任務中保持強大的性能。在這項研究中，我們檢驗了早期修剪語言模型內視覺標記的流行加速方法，並發現其在許多任務中的強大性能並非來自於壓縮視覺信息的特殊能力，而是基於基準任務對細粒度視覺能力的有限評估能力。換言之，我們展示了加速方法的一個核心問題，即將圖像頂部的大多數標記修剪掉。然而，這個問題僅在一小部分任務（如定位）的性能中反映出來。對於其他評估的任務，儘管修剪策略存在缺陷，仍然保持著強大的性能。鑒於所研究的加速技術的有限視覺能力，我們提出了FEATHER（具有集成標準的快速有效加速）這一直接的方法，該方法（1）解決了早期層次修剪的識別問題，（2）採用統一取樣以確保對所有圖像區域進行覆蓋，（3）在兩個階段進行修剪，以使標準在後期層次更有效，同時通過早期層次修剪實現顯著的加速。在可比的計算節省的情況下，我們發現與原始加速方法相比，FEATHER在以視覺為中心的定位基準上的性能提升超過5倍。

English

Recent works on accelerating Vision-Language Models show that strong performance can be maintained across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model and find that its strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks' limited ability to assess fine-grained visual capabilities. Namely, we demonstrate a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, this issue is only reflected in performance for a small subset of tasks such as localization. For the other evaluated tasks, strong performance is maintained with the flawed pruning strategy. Noting the limited visual capabilities of the studied acceleration technique, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that (1) resolves the identified issue with early-layer pruning, (2) incorporates uniform sampling to ensure coverage across all image regions, and (3) applies pruning in two stages to allow the criteria to become more effective at a later layer while still achieving significant speedup through early-layer pruning. With comparable computational savings, we find that FEATHER has more than 5times performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.

調節節流閥：重新審視用於加速視覺語言模型的視覺標記修剪

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

摘要

Summary

Support