보석: 다면적 스케일링 법칙을 위한 모델 스위트

초록

일반적으로 스케일링 법칙은 좁은 범위의 고정 하이퍼파라미터 선택을 사용하는 모델 패밀리를 이용하여 적합시킵니다. 본 연구에서는 다양한 구조와 하이퍼파라미터 선택을 사용하여 스케일링 법칙을 연구하고, 이들이 결과적으로 어떤 처방에 영향을 미치는지 강조합니다. 우리 연구의 주요 산물로서, 우리는 지금까지 가장 포괄적인 오픈 소스 스케일링 법칙 데이터셋인 Gemstones을 공개합니다. 이 데이터셋은 20억 개의 파라미터까지 가지는 트랜스포머에서 4000개 이상의 체크포인트로 구성되어 있습니다. 이러한 모델들은 서로 다른 학습률, 쿨다운 일정 및 구조적 형태로 훈련되었습니다. 우리의 체크포인트들은 모델 너비와 깊이의 함수로 언어 모델링 성능을 예측하는 법칙과 같은 스케일링에 대한 더 복잡한 연구를 가능하게 합니다. 우리 모델 스위트의 다양한 측면을 조사함으로써, 스케일링 법칙의 처방이 실험 설계 과정 및 적합 중 사용된 특정 모델 체크포인트에 매우 민감할 수 있다는 것을 발견합니다. 코드: https://github.com/mcleish7/gemstone-scaling-laws

English

Scaling laws are typically fit using a family of models with a narrow range of frozen hyper-parameter choices. In this work we study scaling laws using a wide range of architecture and hyper-parameter choices, and highlight their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: the most comprehensive open-source scaling law dataset to date, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters; these models have been trained with different learning rates, cooldown schedules, and architectural shapes. Our checkpoints enable more complex studies of scaling, such as a law that predicts language modeling performance as a function of model width and depth. By examining the various facets of our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting. Code: https://github.com/mcleish7/gemstone-scaling-laws

보석: 다면적 스케일링 법칙을 위한 모델 스위트

Gemstones: A Model Suite for Multi-Faceted Scaling Laws

초록

Summary

Support