전문가의 자율성 모델

초록

전문가 혼합(Mixture-of-Experts, MoE) 모델은 대부분 라우터를 사용하여 토큰을 특정 전문가 모듈에 할당하고, 부분적 매개변수만 활성화시켜 밀집 모델을 능가하는 경우가 많습니다. 라우터의 의사 결정과 전문가의 실행 사이의 분리가 전문가 선택과 학습의 비효율적인 문제로 이어지고 있다고 주장합니다. 이에 대응하여 입력을 처리하기 위해 전문가들이 자율적으로 자신을 선택하는 새로운 MoE 패러다임인 전문가 자율성(Autonomy-of-Experts, AoE)을 제안합니다. AoE은 전문가가 토큰을 효과적으로 처리할 수 있는 능력에 대해 자각하며 내부 활성화의 규모에 반영되는 통찰에 기초합니다. AoE에서는 라우터가 제거되고, 대신 전문가들이 입력에 대한 내부 활성화를 사전 계산하고 활성화 정규화에 따라 순위가 매겨집니다. 최상위 전문가들만 전진 패스를 진행하고, 다른 전문가들은 중단됩니다. 활성화 사전 계산의 오버헤드는 저랭크 가중치 인수분해를 통해 줄어듭니다. 이 자가 평가 후 파트너 비교 접근 방식은 전문가 선택과 효과적인 학습을 보장합니다. 700M에서 4B까지의 매개변수를 가진 언어 모델을 사전 훈련하여, AoE가 효율성을 유지하면서 전통적인 MoE 모델을 능가하는 것을 시연합니다.

English

Mixture-of-Experts (MoE) models mostly use a router to assign tokens to specific expert modules, activating only partial parameters and often outperforming dense models. We argue that the separation between the router's decision-making and the experts' execution is a critical yet overlooked issue, leading to suboptimal expert selection and ineffective learning. To address this, we propose Autonomy-of-Experts (AoE), a novel MoE paradigm in which experts autonomously select themselves to process inputs. AoE is based on the insight that an expert is aware of its own capacity to effectively process a token, an awareness reflected in the scale of its internal activations. In AoE, routers are removed; instead, experts pre-compute internal activations for inputs and are ranked based on their activation norms. Only the top-ranking experts proceed with the forward pass, while the others abort. The overhead of pre-computing activations is reduced through a low-rank weight factorization. This self-evaluating-then-partner-comparing approach ensures improved expert selection and effective learning. We pre-train language models having 700M up to 4B parameters, demonstrating that AoE outperforms traditional MoE models with comparable efficiency.

전문가의 자율성 모델

Autonomy-of-Experts Models

초록

Support