Read-ME: 시스템 공동 설계를 통해 라우터 분리형 전문가 혼합으로 LLMs를 리팩터링하기

초록

대형 언어 모델(LLMs)의 확산으로 인해 특화된 하위 네트워크를 동적으로 활용하여 효율성과 성능을 향상시키는 Mixture-of-Experts (MoE) 아키텍처가 채택되었습니다. 그러나 MoE 모델은 추론 중에 비효율적인 메모리 관리와 부적절한 배치 등의 중요한 도전 과제에 직면하고 있습니다. 이는 모델 아키텍처와 시스템 정책 사이의 일치하지 않는 설계 선택으로 인한 것입니다. 더욱이 MoEs를 처음부터 훈련하는 전통적인 방법은 점점 더 높은 비용으로 인해 불가능해지고 있습니다. 본 논문에서는 사전 훈련된 밀집 LLMs를 작은 MoE 모델로 변환하는 새로운 프레임워크인 Read-ME를 제안합니다("일반적인 MoEs를 업사이클링"하는 대신). 이를 통해 지식 전달 비용을 피할 수 있습니다. 저희 방법은 전문가를 추출하기 위해 활성화 희소성을 활용합니다. 전문가를 구성하기 위해 우리는 널리 사용되는 레이어별 라우터 디자인을 검토하고 그 중복성을 보여줌으로써 MoE 백본과 분리된 프리-게이팅 라우터를 소개합니다. 이는 시스템 친화적인 사전 계산 및 미리보기 스케줄링을 용이하게 하여 전문가 인식 배치 및 캐싱을 향상시킵니다. 따라서 저희의 공동 설계는 알고리즘과 시스템 양면의 중요한 간극을 해결하여 자원 제약 환경에서 LLM 추론을 위한 확장 가능하고 효율적인 대안을 제시합니다. Read-ME는 유사한 규모의 인기 있는 오픈 소스 밀집 모델을 능가하여 MMLU에서 최대 10.1%의 개선을 달성하고 평균 종단 간 지연 시간을 최대 6.1% 향상시킵니다. 코드는 다음에서 확인할 수 있습니다: https://github.com/VITA-Group/READ-ME.

English

The proliferation of large language models (LLMs) has led to the adoption of Mixture-of-Experts (MoE) architectures that dynamically leverage specialized subnetworks for improved efficiency and performance. Despite their benefits, MoE models face significant challenges during inference, including inefficient memory management and suboptimal batching, due to misaligned design choices between the model architecture and the system policies. Furthermore, the conventional approach of training MoEs from scratch is increasingly prohibitive in terms of cost. In this paper, we propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models (in contrast to "upcycling" generalist MoEs), avoiding the high costs of ground-up training. Our approach employs activation sparsity to extract experts. To compose experts, we examine the widely-adopted layer-wise router design and show its redundancy, and thus we introduce the pre-gating router decoupled from the MoE backbone that facilitates system-friendly pre-computing and lookahead scheduling, enhancing expert-aware batching and caching. Our codesign therefore addresses critical gaps on both the algorithmic and system fronts, establishing a scalable and efficient alternative for LLM inference in resource-constrained settings. Read-ME outperforms other popular open-source dense models of similar scales, achieving improvements of up to 10.1% on MMLU, and improving mean end-to-end latency up to 6.1%. Codes are available at: https://github.com/VITA-Group/READ-ME.

Read-ME: 시스템 공동 설계를 통해 라우터 분리형 전문가 혼합으로 LLMs를 리팩터링하기

Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design

초록

Support