SepLLM：通過將一個片段壓縮為一個分隔符來加速大型語言模型

摘要

大型語言模型（LLMs）在各種自然語言處理任務中展現出卓越的表現。然而，由於其巨大的尺寸對計算需求和推理速度造成了相當大的挑戰，主要是由於其二次複雜度。在這項工作中，我們已經確定了一個關鍵模式：某些看似毫無意義的特殊標記（即分隔符）與語義上有意義的標記相比，在注意力分數中佔有不成比例的地位。這一觀察結果表明，這些分隔符標記之間的段落信息可以有效地壓縮到分隔符標記本身，而不會有顯著的信息損失。在這一洞察的指導下，我們引入了SepLLM，一個即插即用的框架，通過壓縮這些段落並消除冗餘標記來加速推理。此外，我們實現了用於訓練加速的高效內核。跨訓練免費、從頭開始訓練和後訓練設置的實驗結果展示了SepLLM的有效性。值得注意的是，在使用Llama-3-8B骨幹的情況下，SepLLM在GSM8K-CoT基準測試中實現了超過50%的KV緩存減少，同時保持可比擬的性能。此外，在流式處理設置中，SepLLM有效地處理多達4百萬個或更多標記的序列，同時保持一致的語言建模能力。

English

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless special tokens (i.e., separators) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM's effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.

SepLLM：通過將一個片段壓縮為一個分隔符來加速大型語言模型

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

摘要

Support