EAGLE-3：通过训练时测试扩展大型语言模型的推理加速

摘要

现代大型语言模型（LLM）的序列化特性使其成本高昂且运行缓慢，而推测采样已被证明是解决这一问题的有效方案。诸如EAGLE等方法在特征层面执行自回归，通过复用目标模型的顶层特征，取得了优于传统推测采样的效果。LLM领域的一个日益增长的趋势是扩大训练数据规模，以在不增加推理成本的前提下提升模型智能。然而，我们观察到，数据规模的扩大对EAGLE的改进效果有限。我们发现，这一限制源于EAGLE的特征预测约束。本文中，我们提出了EAGLE-3，它摒弃了特征预测，转而直接进行令牌预测，并通过一种名为“训练时测试”的技术，用多层特征融合取代了对顶层特征的依赖。这些改进显著提升了性能，使草稿模型能够充分利用扩大后的训练数据。我们的实验涵盖了聊天模型和推理模型，并在五项任务上进行了评估。结果显示，EAGLE-3实现了最高6.5倍的加速比，相比EAGLE-2提升了约1.4倍。代码已发布于https://github.com/SafeAILab/EAGLE。

English

The sequential nature of modern LLMs makes them expensive and slow, and speculative sampling has proven to be an effective solution to this problem. Methods like EAGLE perform autoregression at the feature level, reusing top-layer features from the target model to achieve better results than vanilla speculative sampling. A growing trend in the LLM community is scaling up training data to improve model intelligence without increasing inference costs. However, we observe that scaling up data provides limited improvements for EAGLE. We identify that this limitation arises from EAGLE's feature prediction constraints. In this paper, we introduce EAGLE-3, which abandons feature prediction in favor of direct token prediction and replaces reliance on top-layer features with multi-layer feature fusion via a technique named training-time test. These improvements significantly enhance performance and enable the draft model to fully benefit from scaling up training data. Our experiments include both chat models and reasoning models, evaluated on five tasks. The results show that EAGLE-3 achieves a speedup ratio up to 6.5x, with about 1.4x improvement over EAGLE-2. The code is available at https://github.com/SafeAILab/EAGLE.

EAGLE-3：通过训练时测试扩展大型语言模型的推理加速

EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

摘要

Summary

Support

Support