매개변수 대 FLOPs: 최적 희소성을 위한 스케일링 법칙 전문가 혼합 언어 모델을 위한

초록

언어 모델의 용량을 확장하는 것은 성능을 향상시키고 새로운 능력을 발휘하는 믿을 만한 방법으로 입증되어 왔습니다. 용량은 주로 두 가지 측면으로 정의될 수 있습니다: 모델 매개변수의 수와 예제 당 계산량입니다. 확장은 일반적으로 두 가지 측면을 모두 증가시키는 것을 포함하지만, 이러한 요소 간의 정확한 상호 작용과 종합적인 용량에 대한 기여는 완전히 이해되지 않은 상태입니다. 우리는 희소한 전문가 모델의 맥락에서 이 관계를 탐구합니다. 이는 예제 당 FLOP를 비례적으로 증가시키지 않고 모델 매개변수의 수를 확장할 수 있는 기능을 제공합니다. 우리는 비활성 매개변수의 비율인 희소성 수준을 변화시키는 것이 사전 훈련 및 하류 소수 샷 평가 중 모델의 성능에 어떻게 영향을 미치는지 조사합니다. 다양한 제약 조건(예: 매개변수 크기 및 총 훈련 계산) 하에서 훈련 효율성과 모델 성능을 모두 향상시키는 최적의 희소성 수준이 있다는 것을 발견합니다. 이러한 결과는 MoEs의 확장 법칙에서 희소성의 영향을 더 잘 이해하게 해주며, 이 분야의 기존 작업을 보완하여 더 효율적인 아키텍처를 설계하는 데 통찰을 제공합니다.

English

Scaling the capacity of language models has consistently proven to be a reliable approach for improving performance and unlocking new capabilities. Capacity can be primarily defined by two dimensions: the number of model parameters and the compute per example. While scaling typically involves increasing both, the precise interplay between these factors and their combined contribution to overall capacity remains not fully understood. We explore this relationship in the context of sparse Mixture-of-Experts (MoEs), which allow scaling the number of parameters without proportionally increasing the FLOPs per example. We investigate how varying the sparsity level, i.e., the fraction of inactive parameters, impacts model's performance during pretraining and downstream few-shot evaluation. We find that under different constraints (e.g., parameter size and total training compute), there is an optimal level of sparsity that improves both training efficiency and model performance. These results provide a better understanding of the impact of sparsity in scaling laws for MoEs and complement existing works in this area, offering insights for designing more efficient architectures.

매개변수 대 FLOPs: 최적 희소성을 위한 스케일링 법칙 전문가 혼합 언어 모델을 위한

Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models

초록

Summary

Support