规模分布解耦:实现大规模语言模型的稳定高效训练
Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models
February 21, 2025
作者: Ya Wang, Zhijian Zhuo, Yutao Zeng, Xun Zhou, Jian Yang, Xiaoqing Li
cs.AI
摘要
在大规模语言模型(LLM)的预训练过程中,训练稳定性始终是一个持续存在的挑战,尤其是对于Post-Norm Transformer这类架构,它们极易出现梯度爆炸和梯度消散问题。本文提出了一种新颖的方法——尺度分布解耦(Scale-Distribution Decoupling, SDD),通过显式地解耦全连接层中权重矩阵的尺度和分布来稳定训练过程。SDD采用归一化机制来调节激活值,并引入可学习的缩放向量以保持良好的梯度条件,从而有效防止梯度爆炸和梯度消散。这种分离通过确保梯度的稳定传播,显著提升了优化效率,特别是在深层网络中。实验结果表明,我们的方法在各种LLM架构上均能稳定训练,并在不同归一化配置下优于现有技术。此外,所提方法轻量且与现有框架兼容,为稳定LLM训练提供了一种实用的解决方案。代码已发布于https://github.com/kaihemo/SDD。
English
Training stability is a persistent challenge in the pre-training of large
language models (LLMs), particularly for architectures such as Post-Norm
Transformers, which are prone to gradient explosion and dissipation. In this
paper, we propose Scale-Distribution Decoupling (SDD), a novel approach that
stabilizes training by explicitly decoupling the scale and distribution of the
weight matrix in fully-connected layers. SDD applies a normalization mechanism
to regulate activations and a learnable scaling vector to maintain
well-conditioned gradients, effectively preventing gradient explosion
and dissipation. This separation improves optimization efficiency,
particularly in deep networks, by ensuring stable gradient propagation.
Experimental results demonstrate that our method stabilizes training across
various LLM architectures and outperforms existing techniques in different
normalization configurations. Furthermore, the proposed method is lightweight
and compatible with existing frameworks, making it a practical solution for
stabilizing LLM training. Code is available at https://github.com/kaihemo/SDD.Summary
AI-Generated Summary