在MLLMs中重新思考Token Reduction：走向统一的无训练加速范式

摘要

为加速推理大型多模态语言模型（MLLMs），本研究重新思考了无需训练的标记减少研究的当前格局。我们遗憾地发现现有方法的关键组件紧密相连，它们的相互关系和影响对于比较、迁移和扩展仍然不清楚。因此，我们提出了一个统一的“过滤-相关-压缩”范式，将标记减少分解为管道内的三个明确定义的阶段，保持一致的设计目标和元素，同时允许独特的实现。我们还揭示了流行作品并将其纳入我们的范式，展示其普适性。最后，我们提供了一套基于该范式的方法，通过推理的不同阶段在速度和准确性之间取得平衡。在10个基准测试中的实验结果表明，我们的方法在最小影响性能的情况下，可以实现高达82.4%的FLOPs减少，同时超越了最先进的无需训练的方法。我们的项目页面位于https://ficoco-accelerate.github.io/。

English

To accelerate the inference of heavy Multimodal Large Language Models (MLLMs), this study rethinks the current landscape of training-free token reduction research. We regret to find that the critical components of existing methods are tightly intertwined, with their interconnections and effects remaining unclear for comparison, transfer, and expansion. Therefore, we propose a unified ''filter-correlate-compress'' paradigm that decomposes the token reduction into three distinct stages within a pipeline, maintaining consistent design objectives and elements while allowing for unique implementations. We additionally demystify the popular works and subsume them into our paradigm to showcase its universality. Finally, we offer a suite of methods grounded in the paradigm, striking a balance between speed and accuracy throughout different phases of the inference. Experimental results across 10 benchmarks indicate that our methods can achieve up to an 82.4% reduction in FLOPs with a minimal impact on performance, simultaneously surpassing state-of-the-art training-free methods. Our project page is at https://ficoco-accelerate.github.io/.

在MLLMs中重新思考Token Reduction：走向统一的无训练加速范式

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

摘要

Summary

Support

Support