ChatPaper.aiChatPaper

从数小时到数分钟:无损加速超长序列生成至10万标记

From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens

February 26, 2025
作者: Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng
cs.AI

摘要

利用大型语言模型(LLMs)生成超长序列已成为日益关键的任务,但这一过程依然耗时严重,尤其是对于长达10万标记的序列。尽管存在传统的推测解码方法,但单纯扩展其生成限制不仅无法加速过程,反而可能带来负面影响。通过深入分析,我们识别出阻碍高效生成的三大挑战:频繁的模型重载、动态键值(KV)管理以及重复生成。为解决这些问题,我们引入了TOKENSWIFT,一个旨在显著加速超长序列生成过程的新颖框架,同时保持目标模型的内在质量。实验结果显示,TOKENSWIFT在不同规模(1.5B、7B、8B、14B)和架构(MHA、GQA)的模型上均实现了超过3倍的加速。这一加速效果为超长序列生成节省了数小时的时间,确立了TOKENSWIFT作为在空前长度上可扩展且有效的解决方案。代码可在https://github.com/bigai-nlco/TokenSwift 找到。
English
Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.

Summary

AI-Generated Summary

PDF242March 4, 2025