MLLMs에서 Token Reduction을 재고: 훈련 없이 가속화를 위한 통합 패러다임으로

초록

무거운 다중 모달 대형 언어 모델(Multimodal Large Language Models, MLLMs)의 추론 가속화를 위해, 본 연구는 훈련 없이 토큰 축소 연구의 현재 상황을 재고하였습니다. 기존 방법의 중요 구성 요소들이 서로 깊게 얽혀 있어 비교, 이식 및 확장을 위한 상호 연결과 영향이 명확하지 않다는 점을 유감스럽게 인지했습니다. 따라서, 우리는 토큰 축소를 파이프라인 내에서 세 가지 명확한 단계로 분해하는 ''필터-상관-압축'' 통합 패러다임을 제안합니다. 이를 통해 일관된 설계 목표와 요소를 유지하면서도 독특한 구현이 가능합니다. 또한, 우리는 널리 알려진 작업들을 해부하고 우리의 패러다임에 편입시켜 보편성을 보여줍니다. 마지막으로, 추론의 다양한 단계에서 속도와 정확도 사이의 균형을 유지하며 우리의 패러다임에 기반한 방법 모음을 제시합니다. 10가지 벤치마크를 통한 실험 결과는 우리의 방법이 FLOP(FLoating-point Operations)에서 최대 82.4%의 감소를 달성할 수 있으며 성능에 미미한 영향을 미치면서 동시에 최첨단 훈련 없이 토큰 축소 방법을 능가한다는 것을 보여줍니다. 저희 프로젝트 페이지는 https://ficoco-accelerate.github.io/ 에서 확인하실 수 있습니다.

English

To accelerate the inference of heavy Multimodal Large Language Models (MLLMs), this study rethinks the current landscape of training-free token reduction research. We regret to find that the critical components of existing methods are tightly intertwined, with their interconnections and effects remaining unclear for comparison, transfer, and expansion. Therefore, we propose a unified ''filter-correlate-compress'' paradigm that decomposes the token reduction into three distinct stages within a pipeline, maintaining consistent design objectives and elements while allowing for unique implementations. We additionally demystify the popular works and subsume them into our paradigm to showcase its universality. Finally, we offer a suite of methods grounded in the paradigm, striking a balance between speed and accuracy throughout different phases of the inference. Experimental results across 10 benchmarks indicate that our methods can achieve up to an 82.4% reduction in FLOPs with a minimal impact on performance, simultaneously surpassing state-of-the-art training-free methods. Our project page is at https://ficoco-accelerate.github.io/.

MLLMs에서 Token Reduction을 재고: 훈련 없이 가속화를 위한 통합 패러다임으로

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

초록

Summary

Support