ChatPaper.aiChatPaper

FR-Spec:通过频率排序的推测采样加速大规模词汇语言模型

FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling

February 20, 2025
作者: Weilin Zhao, Tengyu Pan, Xu Han, Yudi Zhang, Ao Sun, Yuxiang Huang, Kaihuo Zhang, Weilun Zhao, Yuxuan Li, Jianyong Wang, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

推测采样技术作为一种加速大型语言模型(LLMs)自回归生成过程的重要方法,通过采用“草拟-验证”机制,在每次前向传播中生成多个令牌。尽管当前最先进的推测采样方法仅使用单层和语言建模(LM)头作为草拟模型,实现了显著的层压缩,但在处理大词汇量LLMs(如拥有128k词汇量的Llama-3-8B)时,其效率提升大幅受限。为此,我们提出了FR-Spec,一种基于频率排序的推测采样框架,通过压缩词汇空间来优化草拟候选选择。通过将草拟搜索限制在按频率优先的令牌子集内,我们的方法在确保最终输出分布等价的同时,将LM头的计算开销减少了75%。在多个数据集上的实验表明,相较于当前最先进的推测采样方法EAGLE-2,FR-Spec平均实现了1.12倍的加速。
English
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12times speedup over the state-of-the-art speculative sampling method EAGLE-2.

Summary

AI-Generated Summary

PDF72March 5, 2025