게이트형 델타 네트워크: 델타 규칙을 활용하여 Mamba2 개선하기

초록

선형 변환기는 표준 트랜스포머에 비해 효율적인 대안으로 주목받고 있지만, 검색 및 장기 문맥 작업에서의 성능은 제한되어 왔습니다. 이러한 한계를 해결하기 위해 최근 연구는 적응형 메모리 제어를 위한 게이팅 및 정확한 메모리 수정을 위한 델타 업데이트 규칙 두 가지 다른 메커니즘을 탐구했습니다. 우리는 이러한 메커니즘이 보완적이라는 것을 관찰했습니다: 게이팅은 빠른 메모리 삭제를 가능하게 하고 델타 규칙은 특정 업데이트를 용이하게 합니다. 이 통찰력을 기반으로 우리는 게이트 델타 규칙을 소개하고 현대 하드웨어에 최적화된 병렬 훈련 알고리즘을 개발했습니다. 우리가 제안하는 구조인 게이트 델타넷은 언어 모델링, 상식적 추론, 문맥 중심 검색, 길이 추정 및 장기 문맥 이해를 포함한 여러 벤치마크에서 Mamba2 및 델타넷과 같은 기존 모델을 일관되게 능가합니다. 또한 게이트 델타넷 레이어를 슬라이딩 윈도우 어텐션 또는 Mamba2 레이어와 결합하는 하이브리드 구조를 개발함으로써 향상된 훈련 효율성과 우수한 작업 성능을 달성했습니다.

English

Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.

게이트형 델타 네트워크: 델타 규칙을 활용하여 Mamba2 개선하기

Gated Delta Networks: Improving Mamba2 with Delta Rule

초록

Summary

Support