CMoE：用于高效LLM推断的快速专家混合建模

摘要

大型语言模型（LLMs）通过扩展模型参数取得了令人印象深刻的性能，但这也带来了显著的推理开销。主导LLM参数的前馈网络（FFNs）在隐藏神经元中表现出很高的激活稀疏性。为了利用这一点，研究人员提出了使用专家混合（MoE）架构，其中只有一部分参数被激活。然而，现有方法通常需要大量的训练数据和资源，限制了它们的实用性。我们提出了CMoE（Carved MoE），这是一个新颖的框架，可以从密集模型中高效地雕刻出MoE模型。CMoE通过高效的专家分组和轻量级调整实现了卓越的性能。首先，根据激活率将神经元分组为共享专家和路由专家。接下来，我们构建了一个无需从头训练的路由机制，结合了可微分的路由过程和负载平衡。使用适度的数据，CMoE可以在五分钟内从一个7B的密集模型中产生一个设计良好、可用的MoE。通过轻量级微调，它可以在不到一个小时内实现高性能恢复。我们将我们的代码公开发布在https://github.com/JarvisPei/CMoE。

English

Large language models (LLMs) achieve impressive performance by scaling model parameters, but this comes with significant inference overhead. Feed-forward networks (FFNs), which dominate LLM parameters, exhibit high activation sparsity in hidden neurons. To exploit this, researchers have proposed using a mixture-of-experts (MoE) architecture, where only a subset of parameters is activated. However, existing approaches often require extensive training data and resources, limiting their practicality. We propose CMoE (Carved MoE), a novel framework to efficiently carve MoE models from dense models. CMoE achieves remarkable performance through efficient expert grouping and lightweight adaptation. First, neurons are grouped into shared and routed experts based on activation rates. Next, we construct a routing mechanism without training from scratch, incorporating a differentiable routing process and load balancing. Using modest data, CMoE produces a well-designed, usable MoE from a 7B dense model within five minutes. With lightweight fine-tuning, it achieves high-performance recovery in under an hour. We make our code publicly available at https://github.com/JarvisPei/CMoE.

CMoE：用于高效LLM推断的快速专家混合建模

CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference

摘要

Summary

Support