PyramidDrop: 피라미드를 통한 대형 비전-언어 모델 가속화를 통한 시각 중복성 감소

초록

대형 비전-언어 모델(LVLMs)에서 이미지는 풍부한 정보를 담은 입력으로 작용합니다. "사진 한 장이 천 개의 말보다 가치 있다"는 속담이 시사하는 대로, 현재 LVLMs에서 하나의 이미지를 표현하는 데는 수백 개에서 수천 개의 토큰이 필요할 수 있습니다. 이로 인해 입력 이미지 해상도가 증가함에 따라 제곱적으로 증가하는 상당한 계산 비용이 발생하며, 결과적으로 교육 및 추론의 효율성에 심각한 영향을 미칩니다. 이전 방법들은 LVLMs의 초기 레이어에서 이미지 토큰의 수를 줄이려고 시도해왔습니다. 그러나 이러한 전략은 필연적으로 중요한 이미지 정보의 손실을 초래하여 모델 성능을 저하시킵니다. 이러한 도전에 대처하기 위해 우리는 경험적 연구를 통해 얕은 레이어에서 LVLMs에게 모든 시각적 토큰이 필요하며, 모델의 깊은 레이어에서 토큰 중복이 점진적으로 증가한다는 것을 밝혀내었습니다. 이에 따라 우리는 LLMs의 효율성을 향상시키기 위한 시각적 중복 감소 전략인 PyramidDrop을 제안합니다. 구체적으로, LVLM을 여러 단계로 분할하고 각 단계의 끝에서 이미지 토큰의 일부를 미리 정의된 비율로 삭제하여 모델 레이어 전체에 걸쳐 피라미드 모양의 시각적 토큰을 생성합니다. 삭제는 무시할 수 있는 시간 오버헤드를 가진 가벼운 유사성 계산에 기반합니다. 광범위한 실험 결과, PyramidDrop은 LLaVA-NeXT의 교육 시간을 40% 줄이고 추론 FLOPs 가속도를 55% 달성할 수 있으며 성능은 비슷합니다. 또한 PyramidDrop은 교육 없이 추론 가속도를 위한 플러그 앤 플레이 전략으로 작동할 수 있으며, 동료들보다 더 나은 성능과 낮은 추론 비용을 제공합니다. PyramidDrop에 의해 소개된 통찰과 방법이 미래 연구에 영감을 주어 이미지 토큰의 역할을 더 깊이 조사할 것을 기대합니다.

English

In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom "A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information, ultimately diminishing model performance. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers of the model. To this end, we propose PyramidDrop, a visual redundancy reduction strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. Specifically, we partition the LVLM into several stages and drop part of the image tokens at the end of each stage with a pre-defined ratio, creating pyramid-like visual tokens across model layers. The dropping is based on a lightweight similarity calculation with a negligible time overhead. Extensive experiments demonstrate that PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts. We hope that the insights and approach introduced by PyramidDrop will inspire future research to further investigate the role of image tokens in LVLMs.

PyramidDrop: 피라미드를 통한 대형 비전-언어 모델 가속화를 통한 시각 중복성 감소

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

초록

Summary

Support