ChatPaper.aiChatPaper

HybriMoE:面向高效MoE推理的CPU-GPU混合调度與快取管理

HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference

April 8, 2025
作者: Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, Meng Li
cs.AI

摘要

混合專家(Mixture of Experts, MoE)架構展現了顯著的優勢,因其能在不按比例增加計算量的情況下提升模型容量。然而,大型MoE模型的規模仍帶來了巨大的記憶體需求,這通常需要在資源受限的平台上進行專家卸載,並產生顯著的開銷。混合CPU-GPU推理被提出來利用CPU計算以減少專家載入的開銷,但面臨著主要挑戰:一方面,MoE模型的專家激活模式極不穩定,使得現有工作中的固定映射策略效率低下;另一方面,由於專家規模、結構的多樣性以及工作負載分佈不均等,MoE的混合CPU-GPU調度本質上極為複雜。為應對這些挑戰,本文提出了HybriMoE,一個混合CPU-GPU推理框架,通過新穎的CPU-GPU調度與快取管理系統來提升資源利用率。HybriMoE引入了(i)一種動態層內調度策略以平衡CPU與GPU間的工作負載,(ii)一個基於影響的層間預取算法,以及(iii)一個基於分數的快取算法來緩解專家激活的不穩定性。我們在kTransformers框架上實現了HybriMoE,並在三個廣泛使用的基於MoE的大型語言模型上進行了評估。實驗結果表明,與最先進的混合MoE推理框架相比,HybriMoE在預填充階段平均加速了1.33倍,在解碼階段平均加速了1.70倍。我們的代碼可在以下網址獲取:https://github.com/PKU-SEC-Lab/HybriMoE。
English
The Mixture of Experts (MoE) architecture has demonstrated significant advantages as it enables to increase the model capacity without a proportional increase in computation. However, the large MoE model size still introduces substantial memory demands, which usually requires expert offloading on resource-constrained platforms and incurs significant overhead. Hybrid CPU-GPU inference has been proposed to leverage CPU computation to reduce expert loading overhead but faces major challenges: on one hand, the expert activation patterns of MoE models are highly unstable, rendering the fixed mapping strategies in existing works inefficient; on the other hand, the hybrid CPU-GPU schedule for MoE is inherently complex due to the diverse expert sizes, structures, uneven workload distribution, etc. To address these challenges, in this paper, we propose HybriMoE, a hybrid CPU-GPU inference framework that improves resource utilization through a novel CPU-GPU scheduling and cache management system. HybriMoE introduces (i) a dynamic intra-layer scheduling strategy to balance workloads across CPU and GPU, (ii) an impact-driven inter-layer prefetching algorithm, and (iii) a score-based caching algorithm to mitigate expert activation instability. We implement HybriMoE on top of the kTransformers framework and evaluate it on three widely used MoE-based LLMs. Experimental results demonstrate that HybriMoE achieves an average speedup of 1.33times in the prefill stage and 1.70times in the decode stage compared to state-of-the-art hybrid MoE inference framework. Our code is available at: https://github.com/PKU-SEC-Lab/HybriMoE.

Summary

AI-Generated Summary

PDF122April 9, 2025