自适应计算剪枝的遗忘Transformer

摘要

近期提出的遗忘变换器（Forgetting Transformer, FoX）在softmax注意力机制中引入了遗忘门，其表现持续优于或与基于RoPE的标准变换器相当。值得注意的是，FoX中的许多注意力头倾向于快速遗忘，使得它们在每个时间步的输出主要依赖于局部上下文。基于这一观察，我们为FoX提出了自适应计算剪枝（Adaptive Computation Pruning, ACP）方法，该方法动态剪除那些因遗忘门作用而显著衰减的输入输出依赖关系所涉及的计算。这是通过动态设定的剪枝阈值实现的，确保被剪除的注意力权重保持可忽略不计。我们将ACP应用于FoX的语言模型预训练中，结果显示，在不同模型规模和上下文长度下，softmax注意力机制中的浮点运算次数（FLOPs）均减少了约70%，训练吞吐量因此提升了大约10%至35%。此外，更长的上下文长度带来了更大的计算节省。所有这些速度提升均未导致性能下降。我们还进行了多项分析，以深入理解我们的方法，例如研究剪枝模式并分析不同注意力头间FLOP节省的分布情况。我们的代码可在https://github.com/zhixuan-lin/arctic-fox 获取。

English

The recently proposed Forgetting Transformer (FoX) incorporates a forget gate into softmax attention and has shown consistently better or on-par performance compared to the standard RoPE-based Transformer. Notably, many attention heads in FoX tend to forget quickly, causing their output at each timestep to rely primarily on the local context. Based on this observation, we propose Adaptive Computation Pruning (ACP) for FoX, a method that dynamically prunes computations involving input-output dependencies that are strongly decayed by the forget gate. This is achieved using a dynamically set pruning threshold that ensures that the pruned attention weights remain negligible. We apply ACP to language model pretraining with FoX and show it consistently reduces the number of FLOPs in softmax attention by around 70% across different model sizes and context lengths, resulting in a roughly 10% to 35% improvement in training throughput. Furthermore, longer context lengths yield greater computational savings. All these speed improvements are achieved without any performance degradation. We also perform several analyses to provide deeper insights into our method, such as examining the pruning patterns and analyzing the distribution of FLOP savings across different attention heads. Our code is available at https://github.com/zhixuan-lin/arctic-fox.

自适应计算剪枝的遗忘Transformer

Adaptive Computation Pruning for the Forgetting Transformer

摘要

Summary

Support

Support