ChatPaper.aiChatPaper

大规模语言模型中激活机制的精细化分析

A Refined Analysis of Massive Activations in LLMs

March 28, 2025
作者: Louis Owen, Nilabhra Roy Chowdhury, Abhay Kumar, Fabian Güra
cs.AI

摘要

部分出于其在低精度训练和量化中的重要性,大型语言模型(LLMs)中的大规模激活现象近期成为了研究热点。然而,现有分析在范围上存在局限,且跨架构的普适性尚不明确。本文通过针对包括基于GLU与非基于GLU架构在内的广泛LLMs进行大规模激活分析,有助于填补这些空白。我们的发现挑战了先前的一些假设,其中最为关键的是:(1) 并非所有大规模激活都是有害的,即抑制它们不会导致困惑度爆炸或下游任务性能崩溃;(2) 提出的缓解策略如注意力KV偏置具有模型特异性,在某些情况下效果不佳。因此,我们探索了新颖的混合缓解策略;特别是将目标方差重缩放(TVR)与注意力KV偏置或动态Tanh(DyT)结合,在我们研究的场景中成功平衡了对大规模激活的缓解与下游模型性能的保持。我们的代码已公开于:https://github.com/bluorion-com/refine_massive_activations。
English
Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of LLMs, including both GLU-based and non-GLU-based architectures. Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases. We consequently investigate novel hybrid mitigation strategies; in particular pairing Target Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT) successfully balances the mitigation of massive activations with preserved downstream model performance in the scenarios we investigated. Our code is available at: https://github.com/bluorion-com/refine_massive_activations.

Summary

AI-Generated Summary

PDF143March 31, 2025