ChatPaper.aiChatPaper

HybriMoE:面向高效MoE推理的CPU-GPU混合调度与缓存管理

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

April 8, 2025
作者: Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, Meng Li
cs.AI

摘要

混合专家(Mixture of Experts, MoE)架构展现出显著优势,因其能在不按比例增加计算量的前提下提升模型容量。然而,大规模MoE模型仍带来巨大的内存需求,这在资源受限平台上通常需要专家卸载,并伴随显著开销。为降低专家加载开销,提出了混合CPU-GPU推理方案,利用CPU计算资源,但面临两大挑战:一方面,MoE模型的专家激活模式极不稳定,使得现有工作中的固定映射策略效率低下;另一方面,由于专家规模、结构多样及工作负载分布不均等因素,MoE的CPU-GPU混合调度本身极为复杂。针对这些挑战,本文提出HybriMoE,一种混合CPU-GPU推理框架,通过创新的CPU-GPU调度与缓存管理系统提升资源利用率。HybriMoE引入了(i)动态层内调度策略以平衡CPU与GPU间的工作负载,(ii)基于影响的层间预取算法,以及(iii)评分驱动的缓存算法,以缓解专家激活的不稳定性。我们在kTransformers框架上实现了HybriMoE,并在三种广泛使用的基于MoE的大语言模型上进行了评估。实验结果表明,与最先进的混合MoE推理框架相比,HybriMoE在预填充阶段平均加速1.33倍,在解码阶段平均加速1.70倍。代码已开源:https://github.com/PKU-SEC-Lab/HybriMoE。
English
The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33times in the prefill stage and 1.70times in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.

Summary

AI-Generated Summary

PDF142April 9, 2025