閘門式德爾塔網絡:利用德爾塔規則改進Mamba2

Gated Delta Networks: Improving Mamba2 with Delta Rule

December 9, 2024
作者: Songlin Yang, Jan Kautz, Ali Hatamizadeh
cs.AI

摘要

線性轉換器已被視為標準Transformer的高效替代方案,但在檢索和長文本任務中的表現有限。為了解決這些限制,最近的研究探索了兩種不同的機制:閘控制適應性記憶控制和Δ更新規則用於精確記憶修改。我們觀察到這些機制是互補的:閘控制使快速記憶消除成為可能,而Δ規則則促進有針對性的更新。基於這一洞察,我們引入了閘控Δ規則並開發了一個針對現代硬體優化的並行訓練算法。我們提出的架構,閘控Δ網絡(Gated DeltaNet),在多個基準測試中持續超越現有模型,如Mamba2和DeltaNet,包括語言建模、常識推理、上下文檢索、長度外推和長文本理解。我們通過開發將閘控Δ網絡層與滑動窗口注意力或Mamba2層結合的混合架構,進一步提高了性能,實現了訓練效率的提升和優越的任務表現。
English
Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.

Summary

AI-Generated Summary

PDF103December 10, 2024