쓰로틀 피더링: 시각 언어 모델 가속을 위한 시각 토큰 가지치기 재방문

초록

최근의 시각-언어 모델 가속화에 관한 연구는 시각 정보를 매우 압축하더라도 다양한 시각-언어 작업에서 강력한 성능을 유지할 수 있다는 것을 보여줍니다. 본 연구에서는 언어 모델 내 시각 토큰의 초기 가지치기라는 인기 있는 가속화 접근 방식을 조사하고, 이 방식이 다양한 작업에서 강력한 성능을 보이는 것은 시각 정보를 효율적으로 압축하는 능력 때문이 아니라 벤치마크가 세밀한 시각 능력을 평가하는 데 제한이 있는 것 때문임을 발견했습니다. 즉, 이미지 상단의 대부분의 토큰이 제거되는 가속화 접근 방식의 핵심 문제를 보여주었으며, 이 문제는 위치 지정과 같은 일부 작업에서만 성능에 반영됩니다. 다른 평가된 작업에서는 잘못된 가지치기 전략으로도 강력한 성능이 유지됩니다. 연구된 가속화 기술의 시각 능력이 제한되었다는 점을 고려하여, 우리는 FEATHER(Fast and Effective Acceleration wiTH Ensemble cRiteria)라는 간단한 방법을 제안합니다. 이 방법은 (1) 초기 레이어 가지치기에서 발견된 문제를 해결하고, (2) 모든 이미지 영역을 커버하기 위해 균일 샘플링을 통합하며, (3) 가지치기를 두 단계로 적용하여 나중에 기준이 더 효과적으로 작동하도록 하면서도 초기 레이어 가지치기를 통해 상당한 가속을 달성합니다. 비슷한 계산 비용 절감으로, FEATHER는 원래의 가속화 접근 방식과 비교하여 시각 중심의 위치 지정 벤치마크에서 5배 이상의 성능 향상을 보입니다.

English

Recent works on accelerating Vision-Language Models show that strong performance can be maintained across a variety of vision-language tasks despite highly compressing visual information. In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model and find that its strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks' limited ability to assess fine-grained visual capabilities. Namely, we demonstrate a core issue with the acceleration approach where most tokens towards the top of the image are pruned away. Yet, this issue is only reflected in performance for a small subset of tasks such as localization. For the other evaluated tasks, strong performance is maintained with the flawed pruning strategy. Noting the limited visual capabilities of the studied acceleration technique, we propose FEATHER (Fast and Effective Acceleration wiTH Ensemble cRiteria), a straightforward approach that (1) resolves the identified issue with early-layer pruning, (2) incorporates uniform sampling to ensure coverage across all image regions, and (3) applies pruning in two stages to allow the criteria to become more effective at a later layer while still achieving significant speedup through early-layer pruning. With comparable computational savings, we find that FEATHER has more than 5times performance improvement on the vision-centric localization benchmarks compared to the original acceleration approach.

쓰로틀 피더링: 시각 언어 모델 가속을 위한 시각 토큰 가지치기 재방문

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

초록

Support