遗忘Transformer：带遗忘门的Softmax注意力机制

摘要

现代循环序列模型的核心组件之一是遗忘门。尽管Transformer模型并不具备显式的循环结构，但我们证明，通过以数据依赖的方式对未归一化的注意力分数进行降权，可以自然地将遗忘门融入Transformer中。我们将这种注意力机制命名为遗忘注意力（Forgetting Attention），并将由此产生的模型称为遗忘Transformer（FoX）。实验表明，FoX在长上下文语言建模、长度外推以及短上下文下游任务上均优于Transformer，而在长上下文下游任务上则与Transformer表现相当。此外，FoX与FlashAttention算法兼容，且无需任何位置嵌入。多项分析，包括“大海捞针”测试，证实FoX同样保持了Transformer在长上下文处理能力上相对于Mamba-2、HGRN2和DeltaNet等循环序列模型的优势。我们还引入了一种“Pro”块设计，该设计整合了循环序列模型中的一些常见架构组件，发现它显著提升了FoX和Transformer的性能。我们的代码已发布于https://github.com/zhixuan-lin/forgetting-transformer。

English

An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism the Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at https://github.com/zhixuan-lin/forgetting-transformer.

遗忘Transformer：带遗忘门的Softmax注意力机制

Forgetting Transformer: Softmax Attention with a Forget Gate

摘要

Summary

Support