在MLLMs中重新思考Token Reduction:走向统一的无训练加速范式
Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration
November 26, 2024
作者: Yuhang Han, Xuyang Liu, Pengxiang Ding, Donglin Wang, Honggang Chen, Qingsen Yan, Siteng Huang
cs.AI
摘要
为加速推理大型多模态语言模型(MLLMs),本研究重新思考了无需训练的标记减少研究的当前格局。我们遗憾地发现现有方法的关键组件紧密相连,它们的相互关系和影响对于比较、迁移和扩展仍然不清楚。因此,我们提出了一个统一的“过滤-相关-压缩”范式,将标记减少分解为管道内的三个明确定义的阶段,保持一致的设计目标和元素,同时允许独特的实现。我们还揭示了流行作品并将其纳入我们的范式,展示其普适性。最后,我们提供了一套基于该范式的方法,通过推理的不同阶段在速度和准确性之间取得平衡。在10个基准测试中的实验结果表明,我们的方法在最小影响性能的情况下,可以实现高达82.4%的FLOPs减少,同时超越了最先进的无需训练的方法。我们的项目页面位于https://ficoco-accelerate.github.io/。
English
To accelerate the inference of heavy Multimodal Large Language Models
(MLLMs), this study rethinks the current landscape of training-free token
reduction research. We regret to find that the critical components of existing
methods are tightly intertwined, with their interconnections and effects
remaining unclear for comparison, transfer, and expansion. Therefore, we
propose a unified ''filter-correlate-compress'' paradigm that decomposes the
token reduction into three distinct stages within a pipeline, maintaining
consistent design objectives and elements while allowing for unique
implementations. We additionally demystify the popular works and subsume them
into our paradigm to showcase its universality. Finally, we offer a suite of
methods grounded in the paradigm, striking a balance between speed and accuracy
throughout different phases of the inference. Experimental results across 10
benchmarks indicate that our methods can achieve up to an 82.4% reduction in
FLOPs with a minimal impact on performance, simultaneously surpassing
state-of-the-art training-free methods. Our project page is at
https://ficoco-accelerate.github.io/.Summary
AI-Generated Summary