ChatPaper.aiChatPaper

稀疏性法則:朝向具有更大激活的大型語言模型前進 稀疏性

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

November 4, 2024
作者: Yuqi Luo, Chenyang Song, Xu Han, Yingfa Chen, Chaojun Xiao, Zhiyuan Liu, Maosong Sun
cs.AI

摘要

啟動稀疏性表示啟動輸出中存在大量較弱貢獻元素,這些元素可以被消除,有益於許多與大型語言模型(LLMs)相關的重要應用。儘管在LLMs中促進更大的啟動稀疏性值得深入研究,但現有研究缺乏對啟動稀疏性與潛在影響因素之間相關性的全面和定量研究。本文提出了一項針對基於Transformer的僅解碼器的LLMs中啟動稀疏性的定量縮放特性和影響因素的全面研究。具體來說,我們提出了PPL-p%稀疏性,這是一個精確且性能感知的啟動稀疏性度量標準,適用於任何啟動函數。通過大量實驗,我們發現了幾個重要現象。首先,不同的啟動函數表現出可比較的性能,但訓練時稀疏性趨勢相反。啟動比率(即1-稀疏比率)隨著SiLU啟動和ReLU啟動的LLMs的訓練數據量呈現收斂增長冪律和減少對數空間冪律。這表明ReLU作為啟動函數比SiLU更有效,可以利用更多訓練數據來提高啟動稀疏性。其次,在某一瓶頸點以下,啟動比率與寬度-深度比例呈線性增長,這表明在固定參數規模下,更深的架構具有潛在優勢。最後,在相似的寬度-深度比例下,我們驚訝地發現啟動稀疏性的極限值與參數規模變化微弱,即LLMs中的啟動模式對參數規模不敏感。這些針對具有更大啟動稀疏性的LLMs的實證定律對於使LLMs更有效和可解釋具有重要意義。
English
Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-p% sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., 1-sparsity ratio) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.

Summary

AI-Generated Summary

PDF111November 13, 2024