重新思考MLLM中的Token Reduction:走向統一的訓練免費加速範式

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

November 26, 2024
作者: Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang
cs.AI

摘要

為了加速推斷重型多模態大型語言模型(MLLMs),本研究重新思考了目前無需訓練的標記減少研究的現狀。我們遺憾地發現現有方法的關鍵組件緊密相互交織,它們的相互連接和影響對於比較、轉移和擴展仍不清楚。因此,我們提出了一個統一的「篩選-相關-壓縮」範式,將標記減少分解為管道內的三個明確階段,保持一致的設計目標和元素,同時允許獨特的實現。此外,我們對流行的作品進行了解密並納入我們的範式,以展示其普遍性。最後,我們提供了一套基於這個範式的方法,通過推斷的不同階段在速度和準確性之間取得平衡。在10個基準測試中的實驗結果顯示,我們的方法可以實現高達82.4%的FLOPs減少,對性能影響最小,同時超越了最先進的無需訓練的方法。我們的項目頁面位於https://ficoco-accelerate.github.io/。
English
To accelerate the inference of heavy Multimodal Large Language Models (MLLMs), this study rethinks the current landscape of training-free token reduction research. We regret to find that the critical components of existing methods are tightly intertwined, with their interconnections and effects remaining unclear for comparison, transfer, and expansion. Therefore, we propose a unified ''filter-correlate-compress'' paradigm that decomposes the token reduction into three distinct stages within a pipeline, maintaining consistent design objectives and elements while allowing for unique implementations. We additionally demystify the popular works and subsume them into our paradigm to showcase its universality. Finally, we offer a suite of methods grounded in the paradigm, striking a balance between speed and accuracy throughout different phases of the inference. Experimental results across 10 benchmarks indicate that our methods can achieve up to an 82.4% reduction in FLOPs with a minimal impact on performance, simultaneously surpassing state-of-the-art training-free methods. Our project page is at https://ficoco-accelerate.github.io/.

Summary

AI-Generated Summary

PDF192November 27, 2024