Taipan：具高效且具表達力的狀態空間語言模型，搭配選擇性注意力

摘要

在自然語言處理（NLP）中，高效的長文本語言建模仍然是一個重大挑戰。儘管Transformer在語言任務中佔主導地位，但由於在訓練過程中存在二次計算複雜度以及推理過程中記憶成本線性增長的問題，它們在處理長序列時表現不佳。最近提出的狀態空間模型（SSMs）如Mamba提供了具有恆定記憶體使用的替代方案，但在需要大量上下文檢索的任務中表現不佳。我們引入了Taipan，一種新穎的混合架構，將Mamba-2與選擇性注意力層（SALs）相結合。這些SALs識別需要進行長距離交互作用的標記，刪除較不重要的特徵，然後使用注意力模塊增強它們的表示。這種方法在保持Mamba效率的同時，實現了類似Transformer在內存密集型任務中的性能。通過限制注意力預算，Taipan將準確預測的範圍擴展到長達100萬個標記的上下文長度，同時保持計算效率。我們的實驗表明，在各種規模和任務中，Taipan的性能優越，為高效的長文本語言建模提供了一個有前途的解決方案。

English

Efficient long-context language modeling remains a significant challenge in Natural Language Processing (NLP). While Transformers dominate language tasks, they struggle with long sequences due to quadratic computational complexity in training and linearly scaling memory costs during inference. Recent State Space Models (SSMs) such as Mamba offer alternatives with constant memory usage, but they underperform in tasks requiring extensive in-context retrieval. We introduce Taipan, a novel hybrid architecture that combines Mamba-2 with Selective Attention Layers (SALs). These SALs identify tokens requiring long-range interactions, remove less important features, and then augment their representations using the attention module. This approach balances Mamba's efficiency with Transformer-like performance in memory-intensive tasks. By constraining the attention budget, Taipan extends accurate predictions to context lengths of up to 1 million tokens while preserving computational efficiency. Our experiments demonstrate Taipan's superior performance across various scales and tasks, offering a promising solution for efficient long-context language modeling.

Taipan：具高效且具表達力的狀態空間語言模型，搭配選擇性注意力

Taipan: Efficient and Expressive State Space Language Models with Selective Attention

摘要

Summary

Support

Support