Mix-LN：Pre-LNとPost-LNを組み合わせることで、より深い層の力を解放する

要旨

大規模言語モデル（LLMs）は顕著な成功を収めていますが、最近の研究結果によると、そのより深い層はしばしばほとんど寄与せず、削除しても全体の性能に影響を与えないことが明らかになっています。一部の人々はこれをモデルの圧縮の機会と見なしていますが、私たちは、Pre-Layer Normalization（Pre-LN）の広範な使用に起因する訓練上の欠陥として特定しています。私たちは、GPTやLLaMAなどのモデルで一般的に使用されているPre-LNが、そのより深い層において勾配の規模を低下させ、それらの効果を減少させることを実証しています。それに対し、Post-Layer Normalization（Post-LN）は、より深い層において大きな勾配の規模を保持しますが、初期の層において勾配の消失に悩まされます。この問題に対処するために、私たちは、Pre-LNとPost-LNの長所を同じモデル内で組み合わせる革新的な正規化技術であるMix-LNを導入しています。Mix-LNは、初期の層にPost-LNを適用し、より深い層にPre-LNを適用することで、層間でより均一な勾配を確保します。これにより、ネットワークのすべての部分、浅い層と深い層の両方が訓練に効果的に貢献できるようになります。70Mから7Bまでのさまざまなモデルサイズでの包括的な実験により、Mix-LNが一貫してPre-LNとPost-LNを上回り、ネットワーク全体でよりバランスの取れた、健全な勾配規模を促進し、LLMの事前トレーニングの全体的な品質を向上させることが示されました。さらに、Mix-LNで事前トレーニングされたモデルが、Pre-LNやPost-LNを使用したモデルよりも、監督された微調整（SFT）や人間のフィードバックからの強化学習（RLHF）中により良い学習を行うことを示し、高品質の深層の重要性を強調しています。現行のLLMsにおける深層の非効率性を効果的に解決することで、Mix-LNはその潜在能力を引き出し、モデルの容量を増やすことなく向上させます。私たちのコードはhttps://github.com/pixeli99/MixLNで入手可能です。

English

Large Language Models (LLMs) have achieved remarkable success, yet recent findings reveal that their deeper layers often contribute minimally and can be pruned without affecting overall performance. While some view this as an opportunity for model compression, we identify it as a training shortfall rooted in the widespread use of Pre-Layer Normalization (Pre-LN). We demonstrate that Pre-LN, commonly employed in models like GPT and LLaMA, leads to diminished gradient norms in its deeper layers, reducing their effectiveness. In contrast, Post-Layer Normalization (Post-LN) preserves larger gradient norms in deeper layers but suffers from vanishing gradients in earlier layers. To address this, we introduce Mix-LN, a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to the earlier layers and Pre-LN to the deeper layers, ensuring more uniform gradients across layers. This allows all parts of the network--both shallow and deep layers--to contribute effectively to training. Extensive experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training. Furthermore, we demonstrate that models pre-trained with Mix-LN learn better compared to those using Pre-LN or Post-LN during supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), highlighting the critical importance of high-quality deep layers. By effectively addressing the inefficiencies of deep layers in current LLMs, Mix-LN unlocks their potential, enhancing model capacity without increasing model size. Our code is available at https://github.com/pixeli99/MixLN.

Mix-LN：Pre-LNとPost-LNを組み合わせることで、より深い層の力を解放する

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

要旨

Summary

Support

Support