容量感知推理:缓解专家混合模型中的滞后效应
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
March 7, 2025
作者: Shwai He, Weilin Cai, Jiayi Huang, Ang Li
cs.AI
摘要
专家混合模型(Mixture of Experts, MoE)通过利用稀疏专家激活,在性能与效率之间取得平衡,是扩展大规模语言模型的有效架构。然而,在专家并行机制下,MoE因令牌到专家分配不均而面临推理效率低下的问题,即部分专家过载而其他专家利用率不足。这种不平衡导致资源利用不佳和延迟增加,其中最繁忙的专家决定了整体延迟,这一现象我们定义为“拖尾效应”。为缓解此问题,我们提出了容量感知推理,包含两项关键技术:(1)容量感知令牌丢弃,通过舍弃过载令牌来调控MoE的最大延迟;(2)容量感知令牌重定向,将溢出的令牌重新分配给利用率低的专家,平衡令牌分布。这些技术共同优化了高负载与低负载专家的利用率,使得MoE推理管道更为高效。大量实验验证了我们方法的有效性,展示了推理效率的显著提升,例如在Mixtral-8×7B-Instruct模型上实现了0.2%的平均性能提升和1.94倍的推理加速。
English
The Mixture of Experts (MoE) is an effective architecture for scaling large
language models by leveraging sparse expert activation, optimizing the
trade-off between performance and efficiency. However, under expert
parallelism, MoE suffers from inference inefficiencies due to imbalanced
token-to-expert assignment, where some experts are overloaded while others
remain underutilized. This imbalance leads to poor resource utilization and
increased latency, as the most burdened expert dictates the overall delay, a
phenomenon we define as the \textit{Straggler Effect}. To mitigate
this, we propose Capacity-Aware Inference, including two key techniques: (1)
\textit{Capacity-Aware Token Drop}, which discards overloaded tokens
to regulate the maximum latency of MoE, and (2) \textit{Capacity-Aware
Token Reroute}, which reallocates overflowed tokens to underutilized experts,
balancing the token distribution. These techniques collectively optimize both
high-load and low-load expert utilization, leading to a more efficient MoE
inference pipeline. Extensive experiments demonstrate the effectiveness of our
methods, showing significant improvements in inference efficiency, e.g., 0.2\%
average performance increase and a 1.94times inference speedup on
Mixtral-8times7B-Instruct.Summary
AI-Generated Summary