稀疏性法则：朝着具有更大激活的大型语言模型迈进稀疏性

摘要

激活稀疏性表示激活输出中存在大量贡献较弱的元素，这些元素可以被消除，从而使许多与大型语言模型（LLMs）相关的重要应用受益。尽管在LLMs中促进更大的激活稀疏性值得深入研究，但现有研究缺乏对激活稀疏性与潜在影响因素之间相关性的全面和定量研究。在本文中，我们对仅解码器的基于Transformer的LLMs内激活稀疏性的定量缩放特性和影响因素进行了全面研究。具体而言，我们提出了PPL-p%稀疏性，这是一个精确且性能感知的激活稀疏性度量标准，适用于任何激活函数。通过大量实验，我们发现了几个重要现象。首先，不同激活函数表现出可比较的性能，但训练时稀疏性趋势相反。激活比率（即1-稀疏比率）随着SiLU激活和ReLU激活的LLMs的训练数据量呈现出收敛增长的幂律和减小的对数空间幂律。这表明相比于SiLU，ReLU作为激活函数更有效，并且可以利用更多训练数据来提高激活稀疏性。其次，激活比率在某一瓶颈点以下的宽度-深度比例线性增加，表明在固定参数规模下更深的架构具有潜在优势。最后，在类似的宽度-深度比例下，我们惊讶地发现激活稀疏性的极限值与参数规模变化不大，即LLMs内的激活模式对参数规模不敏感。这些针对具有更大激活稀疏性的LLMs的经验规律对于使LLMs更高效和可解释具有重要意义。

English

Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-p% sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., 1-sparsity ratio) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.

稀疏性法则：朝着具有更大激活的大型语言模型迈进稀疏性

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

摘要

Summary

Support

Support