초기 레이어에서의 보석 발견: 1000배의 입력 토큰 감소로 장거리 문맥을 가속화하는 LLM들

초록

대규모 언어 모델(Large Language Models, LLMs)은 긴 문맥 입력을 처리하는 놀라운 능력을 보여주었지만, 이는 증가된 계산 자원과 지연 시간이라는 비용이 따릅니다. 저희 연구는 LLM 추론을 가속화하고 GPU 메모리 소비를 줄이기 위한 새로운 접근 방식을 소개합니다. 저희 연구는 LLM이 쿼리에 대한 답변을 생성하기 전 초기 레이어에서 관련 토큰을 식별할 수 있다는 것을 입증합니다. 이 통찰력을 활용하여, 저희는 알고리즘을 제안합니다. 이 알고리즘은 LLM의 초기 레이어를 필터로 사용하여 입력 토큰을 선택하고 압축함으로써, 후속 처리를 위한 문맥 길이를 크게 줄입니다. 저희의 방법인 GemFilter는 기존 기술인 표준 어텐션(standard attention) 및 SnapKV/H2O와 비교하여 속도와 메모리 효율성 모두에서 상당한 개선을 보입니다. 특히, SOTA 방법과 비교하여 2.4배의 속도 향상과 GPU 메모리 사용량 감소율이 30%인 성과를 달성합니다. '바늘 찾기' 작업에서의 평가 결과, GemFilter는 표준 어텐션과 SnapKV를 크게 능가하며 LongBench 챌린지에서도 비슷한 성과를 보입니다. GemFilter는 간단하며 훈련이 필요 없으며, 다양한 LLM에 널리 적용할 수 있습니다. 중요한 점은, 이는 사람들이 선택된 입력 시퀀스를 검토할 수 있도록 함으로써 해석 가능성을 제공합니다. 이러한 발견은 LLM 배포에 실용적인 혜택을 제공할 뿐만 아니라, LLM 내부 메커니즘에 대한 우리의 이해를 향상시켜 LLM 설계 및 추론에 대한 추가 최적화를 위한 길을 열어줍니다. 저희의 코드는 https://github.com/SalesforceAIResearch/GemFilter에서 확인할 수 있습니다.

English

Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. Our research demonstrates that LLMs can identify relevant tokens in the early layers before generating answers to a query. Leveraging this insight, we propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing. Our method, GemFilter, demonstrates substantial improvements in both speed and memory efficiency compared to existing techniques, such as standard attention and SnapKV/H2O. Notably, it achieves a 2.4times speedup and 30\% reduction in GPU memory usage compared to SOTA methods. Evaluation on the Needle in a Haystack task shows that GemFilter significantly outperforms standard attention, SnapKV and demonstrates comparable performance on the LongBench challenge. GemFilter is simple, training-free, and broadly applicable across different LLMs. Crucially, it provides interpretability by allowing humans to inspect the selected input sequence. These findings not only offer practical benefits for LLM deployment, but also enhance our understanding of LLM internal mechanisms, paving the way for further optimizations in LLM design and inference. Our code is available at https://github.com/SalesforceAIResearch/GemFilter.

초기 레이어에서의 보석 발견: 1000배의 입력 토큰 감소로 장거리 문맥을 가속화하는 LLM들

Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

초록

Summary

Support

Support