閱讀:將LLMs重構為具有系統共同設計的路由器解耦的專家混合模型
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design
October 24, 2024
作者: Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami Bejnordi, Aditya Akella, Zhangyang Wang
cs.AI
摘要
大型語言模型(LLMs)的普及導致採用了動態利用專門化子網絡以提高效率和性能的專家混合(MoE)架構。儘管MoE模型具有許多好處,在推斷過程中仍面臨重大挑戰,包括由於模型架構與系統政策之間設計不協調而導致的內存管理效率低和子優化批處理。此外,從頭開始訓練MoEs的傳統方法在成本方面日益不可取。本文提出了一個新穎的框架Read-ME,將預訓練的密集LLMs轉換為較小的MoE模型(與“升級”通用MoEs相反),避免了從頭訓練的高成本。我們的方法利用激活稀疏性來提取專家。為了組成專家,我們檢查了廣泛採用的逐層路由器設計並展示其冗餘性,因此我們引入了與MoE主幹解耦的預閘控路由器,有助於系統友好的預計算和前瞻性調度,增強專家感知批處理和緩存。因此,我們的共同設計解決了算法和系統方面的重要差距,在資源受限環境中建立了一個可擴展且高效的LLM推斷替代方案。Read-ME在相似規模的其他流行開源密集模型上表現優異,MMLU提高了最多10.1%,並將平均端到端延遲時間提高了最多6.1%。代碼可在以下網址找到:https://github.com/VITA-Group/READ-ME。
English
The proliferation of large language models (LLMs) has led to the adoption of
Mixture-of-Experts (MoE) architectures that dynamically leverage specialized
subnetworks for improved efficiency and performance. Despite their benefits,
MoE models face significant challenges during inference, including inefficient
memory management and suboptimal batching, due to misaligned design choices
between the model architecture and the system policies. Furthermore, the
conventional approach of training MoEs from scratch is increasingly prohibitive
in terms of cost. In this paper, we propose a novel framework Read-ME that
transforms pre-trained dense LLMs into smaller MoE models (in contrast to
"upcycling" generalist MoEs), avoiding the high costs of ground-up training.
Our approach employs activation sparsity to extract experts. To compose
experts, we examine the widely-adopted layer-wise router design and show its
redundancy, and thus we introduce the pre-gating router decoupled from the MoE
backbone that facilitates system-friendly pre-computing and lookahead
scheduling, enhancing expert-aware batching and caching. Our codesign therefore
addresses critical gaps on both the algorithmic and system fronts, establishing
a scalable and efficient alternative for LLM inference in resource-constrained
settings. Read-ME outperforms other popular open-source dense models of similar
scales, achieving improvements of up to 10.1% on MMLU, and improving mean
end-to-end latency up to 6.1%. Codes are available at:
https://github.com/VITA-Group/READ-ME.Summary
AI-Generated Summary