희소성 법칙: 더 큰 활성화를 갖는 대규모 언어 모델을 향하여 희소성

초록

활성 희소성은 활성 출력 내에서 제거될 수 있는 상당한 약하게 기여하는 요소들의 존재를 나타내며, 대규모 언어 모델(LLMs)과 관련된 많은 중요한 응용 프로그램에 이로운 영향을 미칠 수 있습니다. LLMs 내에서 더 큰 활성 희소성을 촉진하는 것이 심층 연구가 필요하지만, 기존 연구들은 활성 희소성과 잠재적으로 영향을 미칠 수 있는 요소들 간의 상관 관계에 대한 포괄적이고 양적인 연구가 부족합니다. 본 논문에서는 디코더 전용 Transformer 기반 LLMs 내에서 활성 희소성의 양적 스케일링 특성과 영향을 포괄적으로 연구합니다. 구체적으로, 우리는 모든 활성 함수에 적용 가능한 정확하고 성능을 고려한 활성 희소성 측정 항목인 PPL-p% 희소성을 제안합니다. 광범위한 실험을 통해 몇 가지 중요한 현상을 발견했습니다. 첫째, 서로 다른 활성 함수는 비슷한 성능을 보이지만 훈련 시간에 따른 희소성 추세가 반대입니다. 활성 비율(즉, 1-희소 비율)은 SiLU 활성화 및 ReLU 활성화된 LLMs에 대해 각각 훈련 데이터 양에 따라 수렴하는 증가 파워-로우 및 감소하는 로그 공간 파워-로우로 진화합니다. 이러한 결과는 ReLU가 SiLU보다 활성 함수로 더 효율적이며 더 많은 훈련 데이터를 활용하여 활성 희소성을 향상시킬 수 있음을 보여줍니다. 둘째, 병목점 이하에서 너비-깊이 비율이 선형적으로 증가함에 따라 활성 비율도 증가하며, 일정한 매개변수 규모에서 더 깊은 아키텍처의 잠재적 이점을 나타냅니다. 마지막으로, 유사한 너비-깊이 비율에서 매개변수 규모에 따라 활성 희소성의 한계 값이 약하게 변하는 것을 발견했는데, 즉 LLMs 내의 활성 패턴은 매개변수 규모에 민감하지 않음을 의미합니다. 이러한 LLMs에 대한 경험적 법칙들은 더 큰 활성 희소성을 갖는 LLMs를 더 효율적이고 해석 가능하게 만드는 데 중요한 함의를 가지고 있습니다.

English

Activation sparsity denotes the existence of substantial weakly-contributed elements within activation outputs that can be eliminated, benefiting many important applications concerned with large language models (LLMs). Although promoting greater activation sparsity within LLMs deserves deep studies, existing works lack comprehensive and quantitative research on the correlation between activation sparsity and potentially influential factors. In this paper, we present a comprehensive study on the quantitative scaling properties and influential factors of the activation sparsity within decoder-only Transformer-based LLMs. Specifically, we propose PPL-p% sparsity, a precise and performance-aware activation sparsity metric that is applicable to any activation function. Through extensive experiments, we find several important phenomena. Firstly, different activation functions exhibit comparable performance but opposite training-time sparsity trends. The activation ratio (i.e., 1-sparsity ratio) evolves as a convergent increasing power-law and decreasing logspace power-law with the amount of training data for SiLU-activated and ReLU-activated LLMs, respectively. These demonstrate that ReLU is more efficient as the activation function than SiLU and can leverage more training data to improve activation sparsity. Secondly, the activation ratio linearly increases with the width-depth ratio below a certain bottleneck point, indicating the potential advantage of a deeper architecture at a fixed parameter scale. Finally, at similar width-depth ratios, we surprisingly find that the limit value of activation sparsity varies weakly with the parameter scale, i.e., the activation patterns within LLMs are insensitive to the parameter scale. These empirical laws towards LLMs with greater activation sparsity have important implications for making LLMs more efficient and interpretable.

희소성 법칙: 더 큰 활성화를 갖는 대규모 언어 모델을 향하여 희소성

Sparsing Law: Towards Large Language Models with Greater Activation Sparsity

초록

Support