ChatPaper.aiChatPaper

门控增量网络:利用增量规则改进Mamba2

Gated Delta Networks: Improving Mamba2 with Delta Rule

December 9, 2024
作者: Songlin Yang, Jan Kautz, Ali Hatamizadeh
cs.AI

摘要

线性变换器作为标准Transformer的高效替代方案备受关注,但在检索和长上下文任务中的表现有限。为了解决这些限制,最近的研究探索了两种不同的机制:用于自适应内存控制的门控机制和用于精确内存修改的增量更新规则。我们观察到这些机制是互补的:门控机制实现快速内存擦除,而增量规则促进有针对性的更新。基于这一观察,我们引入了门控增量规则,并开发了一种针对现代硬件优化的并行训练算法。我们提出的架构,门控增量网络(Gated DeltaNet),在多个基准测试中始终优于现有模型,如Mamba2和DeltaNet,包括语言建模、常识推理、上下文检索、长度外推和长上下文理解。我们通过开发将门控增量网络层与滑动窗口注意力或Mamba2层相结合的混合架构,进一步提升性能,实现了训练效率和任务性能的双重提升。
English
Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.

Summary

AI-Generated Summary

PDF113December 10, 2024