ChatPaper.aiChatPaper

閱讀:將LLMs重構為具有系統共同設計的路由器解耦的專家混合模型

Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design

October 24, 2024
作者: Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami Bejnordi, Aditya Akella, Zhangyang Wang
cs.AI

摘要

大型語言模型(LLMs)的普及導致採用了動態利用專門化子網絡以提高效率和性能的專家混合(MoE)架構。儘管MoE模型具有許多好處,在推斷過程中仍面臨重大挑戰,包括由於模型架構與系統政策之間設計不協調而導致的內存管理效率低和子優化批處理。此外,從頭開始訓練MoEs的傳統方法在成本方面日益不可取。本文提出了一個新穎的框架Read-ME,將預訓練的密集LLMs轉換為較小的MoE模型(與“升級”通用MoEs相反),避免了從頭訓練的高成本。我們的方法利用激活稀疏性來提取專家。為了組成專家,我們檢查了廣泛採用的逐層路由器設計並展示其冗餘性,因此我們引入了與MoE主幹解耦的預閘控路由器,有助於系統友好的預計算和前瞻性調度,增強專家感知批處理和緩存。因此,我們的共同設計解決了算法和系統方面的重要差距,在資源受限環境中建立了一個可擴展且高效的LLM推斷替代方案。Read-ME在相似規模的其他流行開源密集模型上表現優異,MMLU提高了最多10.1%,並將平均端到端延遲時間提高了最多6.1%。代碼可在以下網址找到:https://github.com/VITA-Group/READ-ME。
English
The proliferation of large language models (LLMs) has led to the adoption of Mixture-of-Experts (MoE) architectures that dynamically leverage specialized subnetworks for improved efficiency and performance. Despite their benefits, MoE models face significant challenges during inference, including inefficient memory management and suboptimal batching, due to misaligned design choices between the model architecture and the system policies. Furthermore, the conventional approach of training MoEs from scratch is increasingly prohibitive in terms of cost. In this paper, we propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models (in contrast to "upcycling" generalist MoEs), avoiding the high costs of ground-up training. Our approach employs activation sparsity to extract experts. To compose experts, we examine the widely-adopted layer-wise router design and show its redundancy, and thus we introduce the pre-gating router decoupled from the MoE backbone that facilitates system-friendly pre-computing and lookahead scheduling, enhancing expert-aware batching and caching. Our codesign therefore addresses critical gaps on both the algorithmic and system fronts, establishing a scalable and efficient alternative for LLM inference in resource-constrained settings. Read-ME outperforms other popular open-source dense models of similar scales, achieving improvements of up to 10.1% on MMLU, and improving mean end-to-end latency up to 6.1%. Codes are available at: https://github.com/VITA-Group/READ-ME.

Summary

AI-Generated Summary

PDF152November 16, 2024