Mix-LN：通過結合Pre-LN和Post-LN釋放更深層次的潛力

摘要

大型語言模型（LLMs）取得了顯著的成功，然而最近的研究發現，它們的深層往往貢獻微乎其微，可以進行修剪而不影響整體性能。有些人認為這是模型壓縮的機會，但我們認為這是一個訓練不足的問題，根源於廣泛使用的預層標準化（Pre-LN）。我們證明了，像GPT和LLaMA這樣的模型中常用的Pre-LN導致其深層的梯度範數減少，降低了它們的有效性。相比之下，後層標準化（Post-LN）保留了深層的較大梯度範數，但在較早的層中存在梯度消失的問題。為了解決這個問題，我們引入了Mix-LN，一種將Pre-LN和Post-LN的優勢結合在同一模型中的新型標準化技術。Mix-LN將Post-LN應用於較早的層，將Pre-LN應用於深層，確保各層之間的梯度更加均勻。這使得網絡的所有部分--無論是淺層還是深層--都能有效地參與訓練。從70M到7B不同模型大小的大量實驗表明，Mix-LN一貫優於Pre-LN和Post-LN，促進了更平衡、更健康的梯度範數在整個網絡中的分佈，提升了LLM預訓練的整體質量。此外，我們證明，使用Mix-LN預訓練的模型在監督微調（SFT）和從人類反饋中進行強化學習（RLHF）時學習效果更好，突出了高質量深層的關鍵重要性。通過有效解決當前LLMs深層的效率問題，Mix-LN發揮了它們的潛力，增強了模型容量而不增加模型大小。我們的代碼可在https://github.com/pixeli99/MixLN找到。

English

Large Language Models (LLMs) have achieved remarkable success, yet recent findings reveal that their deeper layers often contribute minimally and can be pruned without affecting overall performance. While some view this as an opportunity for model compression, we identify it as a training shortfall rooted in the widespread use of Pre-Layer Normalization (Pre-LN). We demonstrate that Pre-LN, commonly employed in models like GPT and LLaMA, leads to diminished gradient norms in its deeper layers, reducing their effectiveness. In contrast, Post-Layer Normalization (Post-LN) preserves larger gradient norms in deeper layers but suffers from vanishing gradients in earlier layers. To address this, we introduce Mix-LN, a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to the earlier layers and Pre-LN to the deeper layers, ensuring more uniform gradients across layers. This allows all parts of the network--both shallow and deep layers--to contribute effectively to training. Extensive experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training. Furthermore, we demonstrate that models pre-trained with Mix-LN learn better compared to those using Pre-LN or Post-LN during supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), highlighting the critical importance of high-quality deep layers. By effectively addressing the inefficiencies of deep layers in current LLMs, Mix-LN unlocks their potential, enhancing model capacity without increasing model size. Our code is available at https://github.com/pixeli99/MixLN.

Mix-LN：通過結合Pre-LN和Post-LN釋放更深層次的潛力

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

摘要

Support