MoBA:面向长上下文大语言模型的混合块注意力机制
MoBA: Mixture of Block Attention for Long-Context LLMs
February 18, 2025
作者: Enzhe Lu, Zhejun Jiang, Jingyuan Liu, Yulun Du, Tao Jiang, Chao Hong, Shaowei Liu, Weiran He, Enming Yuan, Yuzhi Wang, Zhiqi Huang, Huan Yuan, Suting Xu, Xinran Xu, Guokun Lai, Yanru Chen, Huabin Zheng, Junjie Yan, Jianlin Su, Yuxin Wu, Neo Y. Zhang, Zhilin Yang, Xinyu Zhou, Mingxing Zhang, Jiezhong Qiu
cs.AI
摘要
提升有效上下文长度对于推动大型语言模型(LLMs)向通用人工智能(AGI)迈进至关重要。然而,传统注意力机制中计算复杂度的二次增长带来了难以承受的开销。现有方法要么引入了强偏置结构,如针对特定任务的汇聚或窗口注意力,要么彻底将注意力机制修改为线性近似,这些方法在复杂推理任务中的表现仍有待充分探索。
在本研究中,我们提出了一种遵循“少结构”原则的解决方案,使模型能够自主决定关注何处,而非引入预设的偏置。我们引入了块注意力混合机制(Mixture of Block Attention, MoBA),这一创新方法将专家混合(Mixture of Experts, MoE)的原则应用于注意力机制。这一新颖架构在长上下文任务中展现了卓越性能,同时具备一项关键优势:能够在全注意力和稀疏注意力之间无缝切换,从而在不牺牲性能风险的前提下提升效率。MoBA已成功部署,支持Kimi的长上下文请求,并在LLMs的高效注意力计算方面展现了显著进步。我们的代码已公开于https://github.com/MoonshotAI/MoBA。
English
Scaling the effective context length is essential for advancing large
language models (LLMs) toward artificial general intelligence (AGI). However,
the quadratic increase in computational complexity inherent in traditional
attention mechanisms presents a prohibitive overhead. Existing approaches
either impose strongly biased structures, such as sink or window attention
which are task-specific, or radically modify the attention mechanism into
linear approximations, whose performance in complex reasoning tasks remains
inadequately explored.
In this work, we propose a solution that adheres to the ``less structure''
principle, allowing the model to determine where to attend autonomously, rather
than introducing predefined biases. We introduce Mixture of Block Attention
(MoBA), an innovative approach that applies the principles of Mixture of
Experts (MoE) to the attention mechanism. This novel architecture demonstrates
superior performance on long-context tasks while offering a key advantage: the
ability to seamlessly transition between full and sparse attention, enhancing
efficiency without the risk of compromising performance. MoBA has already been
deployed to support Kimi's long-context requests and demonstrates significant
advancements in efficient attention computation for LLMs. Our code is available
at https://github.com/MoonshotAI/MoBA.Summary
AI-Generated Summary