大型語言模型的預訓練蒸餾:設計空間探索
Pre-training Distillation for Large Language Models: A Design Space Exploration
October 21, 2024
作者: Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, Juanzi Li
cs.AI
摘要
知識蒸餾(KD)旨在將知識從大型教師模型轉移至較小的學生模型。先前在大型語言模型(LLMs)領域應用KD的研究通常集中於後訓練階段,學生LLM直接從教師模型生成的指令和對應回應中學習。本文將KD擴展至LLMs的預訓練階段,稱為預訓練蒸餾(PD)。我們首先通過使用GLM-4-9B作為教師LLM,對一個包含1.9B參數的學生LLM進行蒸餾的初步實驗,驗證了PD的有效性。考慮到蒸餾的關鍵影響因素,我們系統性地探索了預訓練蒸餾的設計空間,涵蓋四個方面:logits處理、損失選擇、縮放定律以及離線或在線logits。我們進行了大量實驗,探索了預訓練蒸餾的設計空間,找到了更好的配置和有趣的結論,例如較大的學生LLMs通常更能從預訓練蒸餾中受益,而較大的教師LLM並不一定能保證更好的結果。我們希望我們對設計空間的探索能夠為未來的預訓練蒸餾實踐提供信息。
English
Knowledge distillation (KD) aims to transfer knowledge from a large teacher
model to a smaller student model. Previous work applying KD in the field of
large language models (LLMs) typically focused on the post-training phase,
where the student LLM learns directly from instructions and corresponding
responses generated by the teacher model. In this paper, we extend KD to the
pre-training phase of LLMs, named pre-training distillation (PD). We first
conduct a preliminary experiment using GLM-4-9B as the teacher LLM to distill a
1.9B parameter student LLM, validating the effectiveness of PD. Considering the
key impact factors of distillation, we systematically explore the design space
of pre-training distillation across four aspects: logits processing, loss
selection, scaling law, and offline or online logits. We conduct extensive
experiments to explore the design space of pre-training distillation and find
better configurations and interesting conclusions, such as larger student LLMs
generally benefiting more from pre-training distillation, while a larger
teacher LLM does not necessarily guarantee better results. We hope our
exploration of the design space will inform future practices in pre-training
distillation.Summary
AI-Generated Summary