MoBA：面向长上下文大语言模型的混合块注意力机制

摘要

提升有效上下文长度对于推动大型语言模型（LLMs）向通用人工智能（AGI）迈进至关重要。然而，传统注意力机制中计算复杂度的二次增长带来了难以承受的开销。现有方法要么引入了强偏置结构，如针对特定任务的汇聚或窗口注意力，要么彻底将注意力机制修改为线性近似，这些方法在复杂推理任务中的表现仍有待充分探索。在本研究中，我们提出了一种遵循“少结构”原则的解决方案，使模型能够自主决定关注何处，而非引入预设的偏置。我们引入了块注意力混合机制（Mixture of Block Attention, MoBA），这一创新方法将专家混合（Mixture of Experts, MoE）的原则应用于注意力机制。这一新颖架构在长上下文任务中展现了卓越性能，同时具备一项关键优势：能够在全注意力和稀疏注意力之间无缝切换，从而在不牺牲性能风险的前提下提升效率。MoBA已成功部署，支持Kimi的长上下文请求，并在LLMs的高效注意力计算方面展现了显著进步。我们的代码已公开于https://github.com/MoonshotAI/MoBA。

English

Scaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored. In this work, we propose a solution that adheres to the ``less structure'' principle, allowing the model to determine where to attend autonomously, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi's long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. Our code is available at https://github.com/MoonshotAI/MoBA.

MoBA：面向长上下文大语言模型的混合块注意力机制

MoBA: Mixture of Block Attention for Long-Context LLMs

摘要

Summary

Support