AIM: 토큰 병합과 가지치기를 통한 다중 모달 LLM의 적응 추론

초록

대형 언어 모델(LLM)은 시각 데이터인 이미지와 비디오에 대한 강력한 이해를 나타내는 멀티모달 LLM의 생성을 가능케했습니다. 그러나 이러한 모델은 일반적으로 시각 인코더로부터의 광범위한 시각 토큰에 의존하며, 이는 높은 계산 요구를 야기하여 자원 제한적 환경 및 장기간 컨텍스트 작업에서의 적용 가능성을 제한합니다. 본 연구에서는 멀티모달 LLM을 위한 훈련 없이 적응 추론 방법을 제안합니다. 이 방법은 최소한의 성능 하락으로 다양한 효율 요구 사항을 수용할 수 있습니다. 우리의 방법은 LLM 이전에 임베딩 유사성에 기반한 반복적인 토큰 병합과 멀티모달 중요도에 기초한 LLM 레이어 내 점진적인 토큰 가지치기로 구성됩니다. 우리의 방법은 간소한 디자인으로 비디오 및 이미지 LLM에 모두 적용할 수 있습니다. 다양한 비디오 및 이미지 벤치마크에서의 포괄적인 실험 결과는 우리의 방법이 계산 부하를 크게 줄이면서(예: FLOP의 7배 감소) 비디오 및 이미지 LLM의 성능을 유지하는 것을 보여줍니다. 더불어 유사한 계산 비용 하에서 우리의 방법이 장기 비디오 이해에서 최첨단 기법을 능가하는 것을 확인할 수 있습니다(예: MLVU에서 +4.6). 게다가 우리의 철저한 분석은 토큰 중복과 LLM 레이어 동작에 대한 통찰을 제공하여 효율적인 멀티모달 LLM 설계에 대한 미래 연구에 대한 지침을 제공합니다. 우리의 코드는 https://github.com/LaVi-Lab/AIM에서 제공될 예정입니다.

English

Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders, leading to high computational demands, which limits their applicability in resource-constrained environments and for long-context tasks. In this work, we propose a training-free adaptive inference method for multi-modal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. Our method consists of a) iterative token merging based on embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that, our method substantially reduces computation load (e.g., a 7-fold reduction in FLOPs) while preserving the performance of video and image LLMs. Further, under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding (e.g., +4.6 on MLVU). Additionally, our in-depth analysis provides insights into token redundancy and LLM layer behaviors, offering guidance for future research in designing efficient multi-modal LLMs. Our code will be available at https://github.com/LaVi-Lab/AIM.

AIM: 토큰 병합과 가지치기를 통한 다중 모달 LLM의 적응 추론

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

초록

Summary

Support