스케일링 스마트: 소형 모델 초기화를 통한 대형 언어 모델 사전 훈련 가속화

초록

언어 모델의 사전 훈련 단계는 종종 임의로 초기화된 매개변수로 시작됩니다. 현재 모델 확장의 추세에 따라, 많은 수의 매개변수를 훈련하는 것은 매우 느리고 비용이 많이 들 수 있습니다. 이에 반해, 작은 언어 모델은 훈련 비용이 적지만 종종 큰 모델의 정확도를 달성할 수 없습니다. 본 논문에서는 이 두 가지 다른 영역을 연결하는 흥미로운 아이디어를 탐구합니다: 작은 사전 훈련된 모델을 사용하여 큰 언어 모델을 초기화하는 방법을 개발할 수 있을까? 이러한 초기화가 훈련 시간과 최종 정확도 측면에서 어떠한 이점을 가져올까요? 본 논문에서는 HyperCloning이라는 방법을 소개합니다. 이 방법은 사전 훈련된 언어 모델의 매개변수를 확장하여 숨겨진 차원을 증가시킨 큰 모델의 매개변수로 만들 수 있습니다. 우리의 방법은 큰 모델이 작은 모델의 기능성을 유지하도록 보장합니다. 결과적으로, 훈련이 시작되기 전에 큰 모델은 이미 작은 모델의 예측 능력과 정확도를 상속받습니다. 이러한 초기화된 모델을 훈련하는 것이 대규모 언어 모델의 사전 훈련에 필요한 GPU 시간을 상당히 절약한다는 것을 입증합니다.

English

The pre-training phase of language models often begins with randomly initialized parameters. With the current trends in scaling models, training their large number of parameters can be extremely slow and costly. In contrast, small language models are less expensive to train, but they often cannot achieve the accuracy of large models. In this paper, we explore an intriguing idea to connect these two different regimes: Can we develop a method to initialize large language models using smaller pre-trained models? Will such initialization bring any benefits in terms of training time and final accuracy? In this paper, we introduce HyperCloning, a method that can expand the parameters of a pre-trained language model to those of a larger model with increased hidden dimensions. Our method ensures that the larger model retains the functionality of the smaller model. As a result, the larger model already inherits the predictive power and accuracy of the smaller model before the training starts. We demonstrate that training such an initialized model results in significant savings in terms of GPU hours required for pre-training large language models.

스케일링 스마트: 소형 모델 초기화를 통한 대형 언어 모델 사전 훈련 가속화

Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization

초록

Summary

Support

Support