ChatPaper.aiChatPaper

大型语言模型中的深度之咒

The Curse of Depth in Large Language Models

February 9, 2025
作者: Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu
cs.AI

摘要

在本文中,我们介绍了“深度诅咒”这一概念,它突出、解释并解决了现代大型语言模型(LLMs)中近一半层次的效果不如预期的最新观察。我们首先确认了这一现象在最流行的LLMs家族(如Llama、Mistral、DeepSeek和Qwen)中的广泛存在。我们的分析在理论和实证上确定,导致LLMs深层次效果不佳的根本原因是广泛使用的预层归一化(Pre-LN)。虽然Pre-LN稳定了Transformer LLMs的训练,但随着模型深度增加,其输出方差呈指数增长,导致深层Transformer块的导数成为单位矩阵,因此几乎不对训练产生贡献。为了解决这一训练陷阱,我们提出了LayerNorm Scaling,通过将层归一化的输出方差按深度的平方根的倒数进行缩放。这一简单修改减轻了更深Transformer层的输出方差爆炸问题,提高了它们的贡献。我们的实验结果涵盖了从130M到1B的模型规模,表明与Pre-LN相比,LayerNorm Scaling显著提升了LLM的预训练性能。此外,这一改进顺利地延续到监督微调。所有这些收益都归因于LayerNorm Scaling使更深层次在训练过程中更有效地发挥作用。
English
In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models(LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling, which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Our experimental results, spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling significantly enhances LLM pre-training performance compared to Pre-LN. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training.

Summary

AI-Generated Summary

PDF375February 11, 2025