SepLLM：通过将一个段落压缩为一个分隔符来加速大型语言模型

摘要

大型语言模型（LLMs）在各种自然语言处理任务中表现出色。然而，由于其庞大的规模，它们在计算需求和推理速度方面面临着相当大的挑战，这是由于它们的二次复杂性。在这项工作中，我们发现了一个关键模式：某些看似毫无意义的特殊标记（即分隔符）在注意力得分中的贡献远远超过语义上有意义的标记。这一观察结果表明，这些分隔符标记之间的段落信息可以被有效地压缩到分隔符标记本身中，而不会有显著的信息损失。在这一洞察的指导下，我们引入了SepLLM，这是一个即插即用的框架，通过压缩这些段落并消除冗余标记来加速推理。此外，我们实现了用于训练加速的高效内核。跨无训练、从头训练和后训练设置的实验结果显示了SepLLM的有效性。值得注意的是，在Llama-3-8B骨干模型的基础上，SepLLM在GSM8K-CoT基准测试中实现了超过50%的KV缓存减少，同时保持了可比较的性能。此外，在流式处理设置中，SepLLM能够有效处理长达400万个标记或更多的序列，同时保持一致的语言建模能力。

English

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless special tokens (i.e., separators) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM's effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.

SepLLM：通过将一个段落压缩为一个分隔符来加速大型语言模型

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

摘要

Summary

Support