在早期層中發現寶石：加速長文本LLM，並實現1000倍輸入標記減少。

摘要

大型語言模型（LLMs）展示了處理長文本輸入的卓越能力，但這是以增加計算資源和延遲為代價的。我們的研究引入了一種新穎的方法來加速LLM推論並減少GPU內存消耗，以解決長文本的瓶頸問題。我們的研究表明，LLMs可以在生成查詢答案之前在早期層識別相關的標記。利用這一見解，我們提出了一種算法，使用LLM的早期層作為篩選器來選擇和壓縮輸入標記，從而顯著減少後續處理的上下文長度。我們的方法GemFilter相較於現有技術（如標準注意力和SnapKV/H2O），在速度和內存效率方面均取得了顯著的改進。值得注意的是，與SOTA方法相比，GemFilter實現了2.4倍的加速和30％的GPU內存使用減少。在針芥堆任務上的評估顯示，GemFilter明顯優於標準注意力、SnapKV，並在LongBench挑戰上表現出可比的性能。GemFilter簡單、無需訓練，並且在不同的LLMs上廣泛應用。重要的是，它通過允許人類檢查所選擇的輸入序列提供了可解釋性。這些發現不僅為LLM部署提供了實際好處，還增進了我們對LLM內部機制的理解，為LLM設計和推論的進一步優化鋪平了道路。我們的代碼可在https://github.com/SalesforceAIResearch/GemFilter找到。

English

Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. Our research demonstrates that LLMs can identify relevant tokens in the early layers before generating answers to a query. Leveraging this insight, we propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing. Our method, GemFilter, demonstrates substantial improvements in both speed and memory efficiency compared to existing techniques, such as standard attention and SnapKV/H2O. Notably, it achieves a 2.4times speedup and 30\% reduction in GPU memory usage compared to SOTA methods. Evaluation on the Needle in a Haystack task shows that GemFilter significantly outperforms standard attention, SnapKV and demonstrates comparable performance on the LongBench challenge. GemFilter is simple, training-free, and broadly applicable across different LLMs. Crucially, it provides interpretability by allowing humans to inspect the selected input sequence. These findings not only offer practical benefits for LLM deployment, but also enhance our understanding of LLM internal mechanisms, paving the way for further optimizations in LLM design and inference. Our code is available at https://github.com/SalesforceAIResearch/GemFilter.

在早期層中發現寶石：加速長文本LLM，並實現1000倍輸入標記減少。

Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

摘要

Summary

Support

Support