調節節流閥:重新審視用於加速視覺語言模型的視覺標記修剪
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration
December 17, 2024
作者: Mark Endo, Xiaohan Wang, Serena Yeung-Levy
cs.AI
摘要
最近有關加速視覺語言模型的研究表明,儘管高度壓縮視覺信息,仍然可以在各種視覺語言任務中保持強大的性能。在這項研究中,我們檢驗了早期修剪語言模型內視覺標記的流行加速方法,並發現其在許多任務中的強大性能並非來自於壓縮視覺信息的特殊能力,而是基於基準任務對細粒度視覺能力的有限評估能力。換言之,我們展示了加速方法的一個核心問題,即將圖像頂部的大多數標記修剪掉。然而,這個問題僅在一小部分任務(如定位)的性能中反映出來。對於其他評估的任務,儘管修剪策略存在缺陷,仍然保持著強大的性能。鑒於所研究的加速技術的有限視覺能力,我們提出了FEATHER(具有集成標準的快速有效加速)這一直接的方法,該方法(1)解決了早期層次修剪的識別問題,(2)採用統一取樣以確保對所有圖像區域進行覆蓋,(3)在兩個階段進行修剪,以使標準在後期層次更有效,同時通過早期層次修剪實現顯著的加速。在可比的計算節省的情況下,我們發現與原始加速方法相比,FEATHER在以視覺為中心的定位基準上的性能提升超過5倍。
English
Recent works on accelerating Vision-Language Models show that strong
performance can be maintained across a variety of vision-language tasks despite
highly compressing visual information. In this work, we examine the popular
acceleration approach of early pruning of visual tokens inside the language
model and find that its strong performance across many tasks is not due to an
exceptional ability to compress visual information, but rather the benchmarks'
limited ability to assess fine-grained visual capabilities. Namely, we
demonstrate a core issue with the acceleration approach where most tokens
towards the top of the image are pruned away. Yet, this issue is only reflected
in performance for a small subset of tasks such as localization. For the other
evaluated tasks, strong performance is maintained with the flawed pruning
strategy. Noting the limited visual capabilities of the studied acceleration
technique, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble
cRiteria), a straightforward approach that (1) resolves the identified issue
with early-layer pruning, (2) incorporates uniform sampling to ensure coverage
across all image regions, and (3) applies pruning in two stages to allow the
criteria to become more effective at a later layer while still achieving
significant speedup through early-layer pruning. With comparable computational
savings, we find that FEATHER has more than 5times performance improvement
on the vision-centric localization benchmarks compared to the original
acceleration approach.Summary
AI-Generated Summary