PyramidDrop：透過金字塔視覺冗餘減少來加速您的大型視覺語言模型

摘要

在大型視覺語言模型（LVLMs）中，圖像作為輸入承載豐富的信息。正如諺語“一圖勝千言”所暗示的，目前的LVLMs中代表單張圖像可能需要數百甚至數千個標記。這導致顯著的計算成本，隨著輸入圖像解析度的增加呈二次方增長，從而嚴重影響了訓練和推理的效率。先前的方法嘗試在LVLMs的早期層之前或內部減少圖像標記的數量。然而，這些策略不可避免地導致關鍵圖像信息的損失，最終降低了模型的性能。為應對這一挑戰，我們進行了一項實證研究，揭示了在淺層LVLMs中所有視覺標記的必要性，以及在模型的深層中標記冗余性逐漸增加。為此，我們提出了PyramidDrop，一種用於LVLMs的視覺冗余減少策略，以提高其在訓練和推理中的效率，同時性能損失可以忽略不計。具體而言，我們將LVLM分為幾個階段，在每個階段的末尾丟棄部分圖像標記，並以預定比例創建跨模型層的金字塔狀視覺標記。丟棄基於輕量級相似性計算，時間開銷微乎其微。大量實驗表明，PyramidDrop可以實現與LLaVA-NeXT相比，訓練時間加速40％，推理FLOPs加速55％，並具有可比擬的性能。此外，PyramidDrop還可以作為一種即插即用的策略用於推理加速，無需訓練，性能更好，推理成本更低。我們希望PyramidDrop所介紹的見解和方法將激發未來研究進一步探討圖像標記在LVLMs中的作用。

English

In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom "A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information, ultimately diminishing model performance. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers of the model. To this end, we propose PyramidDrop, a visual redundancy reduction strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. Specifically, we partition the LVLM into several stages and drop part of the image tokens at the end of each stage with a pre-defined ratio, creating pyramid-like visual tokens across model layers. The dropping is based on a lightweight similarity calculation with a negligible time overhead. Extensive experiments demonstrate that PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts. We hope that the insights and approach introduced by PyramidDrop will inspire future research to further investigate the role of image tokens in LVLMs.

PyramidDrop：透過金字塔視覺冗餘減少來加速您的大型視覺語言模型

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

摘要

Summary

Support

Support