专家自主模型

摘要

混合专家（MoE）模型主要使用路由器将令牌分配给特定的专家模块，仅激活部分参数，通常优于密集模型。我们认为，路由器决策与专家执行之间的分离是一个关键但被忽视的问题，导致专家选择次优和学习效果不佳。为了解决这个问题，我们提出了专家自治（AoE），这是一种新颖的MoE范式，其中专家自主选择自己来处理输入。AoE基于这样一个观点，即专家意识到自己有效处理令牌的能力，这种意识体现在其内部激活的规模中。在AoE中，路由器被移除；相反，专家预先计算输入的内部激活，并根据其激活范数进行排名。只有排名靠前的专家继续进行前向传递，而其他专家则中止。通过低秩权重因子分解，预先计算激活的开销得以减少。这种自我评估然后与伙伴比较的方法确保了改进的专家选择和有效的学习。我们对拥有7亿至40亿参数的语言模型进行了预训练，表明AoE在效率上优于传统的MoE模型。

English

Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

专家自主模型

Autonomy-of-Experts Models

摘要

Summary

Support

Support