MiniPLM: 사전 훈련 언어 모델을 위한 지식 증류

초록

지식 증류(Knowledge distillation, KD)은 대규모 교사 언어 모델을 활용하여 작고 성능이 우수한 학생 언어 모델을 교육하는 데 널리 사용됩니다. 미세 조정에서 효과적이지만, 사전 훈련 중에 KD는 효율성, 유연성 및 효과에 도전해야 합니다. 기존 방법은 온라인 교사 추론으로 인해 높은 계산 비용이 발생하거나 교사와 학생 언어 모델 간의 토큰화 일치가 필요하거나 교사가 생성한 훈련 데이터의 어려움과 다양성을 상실할 위험이 있습니다. 이러한 문제를 해결하기 위해 저희는 MiniPLM을 제안합니다. MiniPLM은 교사의 지식을 활용하여 훈련 데이터 분포를 개선함으로써 언어 모델의 사전 훈련을 위한 KD 프레임워크입니다. 효율성을 위해 MiniPLM은 오프라인 교사 언어 모델 추론을 수행하여 교육 시간 비용을 추가하지 않고 여러 학생 언어 모델에 대한 KD를 가능하게 합니다. 유연성을 위해 MiniPLM은 훈련 말뭉치에서만 작동하여 모델 패밀리 간에 KD를 가능하게 합니다. 효과성을 위해 MiniPLM은 대규모와 소규모 언어 모델 간의 차이를 활용하여 훈련 데이터의 어려움과 다양성을 강화하여 학생 언어 모델이 다재다능하고 정교한 지식을 습득하도록 지원합니다. 광범위한 실험 결과, MiniPLM이 9가지 널리 사용되는 하향 작업에서 학생 언어 모델의 성능을 향상시키고 언어 모델링 능력을 향상시키며 사전 훈련 계산을 줄이는 것을 입증합니다. MiniPLM의 이점은 대규모 사전 훈련 규모에까지 확장되며, 스케일링 곡선의 추정을 통해 입증됩니다. 추가 분석 결과, MiniPLM이 모델 패밀리 간의 KD를 지원하고 사전 훈련 데이터의 활용을 향상시킨다는 것을 보여줍니다. 저희의 모델, 코드 및 데이터는 https://github.com/thu-coai/MiniPLM에서 사용할 수 있습니다.

English

Knowledge distillation (KD) is widely used to train small, high-performing student language models (LMs) using large teacher LMs. While effective in fine-tuning, KD during pre-training faces challenges in efficiency, flexibility, and effectiveness. Existing methods either incur high computational costs due to online teacher inference, require tokenization matching between teacher and student LMs, or risk losing the difficulty and diversity of the teacher-generated training data. To address these issues, we propose MiniPLM, a KD framework for pre-training LMs by refining the training data distribution with the teacher's knowledge. For efficiency, MiniPLM performs offline teacher LM inference, allowing KD for multiple student LMs without adding training-time costs. For flexibility, MiniPLM operates solely on the training corpus, enabling KD across model families. For effectiveness, MiniPLM leverages the differences between large and small LMs to enhance the difficulty and diversity of the training data, helping student LMs acquire versatile and sophisticated knowledge. Extensive experiments demonstrate that MiniPLM boosts the student LMs' performance on 9 widely used downstream tasks, improves the language modeling capabilities, and reduces pre-training computation. The benefit of MiniPLM extends to large pre-training scales, evidenced by the extrapolation of the scaling curves. Further analysis reveals that MiniPLM supports KD across model families and enhances the utilization of pre-training data. Our model, code, and data are available at https://github.com/thu-coai/MiniPLM.

MiniPLM: 사전 훈련 언어 모델을 위한 지식 증류

MiniPLM: Knowledge Distillation for Pre-Training Language Models

초록

Support