最近性と過度の平滑化の観点から、状態空間モデルのボトルネックを理解し軽減する方法

要旨

構造化状態空間モデル（SSM）は、トランスフォーマーの代替手段として登場しています。SSMはしばしば長いシーケンスの依存関係を捉えるのに効果的であるとされていますが、我々は厳密に証明することで、SSMが強い最近傾向バイアスによって本質的に制限されていることを明らかにします。我々の実証研究では、このバイアスがモデルの遠い情報を思い出す能力を損ない、頑健性の問題を導入することがわかりました。スケーリング実験では、SSM内のより深い構造が長い文脈の学習を促進できることが発見されました。しかし、後続の理論的分析では、SSMが深くなるにつれて、別の避けられない過度な平滑化の傾向が現れることが明らかになりました。たとえば、トークン表現がますます区別できなくなるというものです。この最近性と過度な平滑化の基本的なジレンマは、既存のSSMの拡張性を妨げています。理論的な発見に触発され、我々はSSM内の状態遷移行列の2つのチャンネルを極性化することを提案し、それぞれをゼロと1に設定することで、最近性バイアスと過度な平滑化の両方に同時に対処します。実験では、この極性化技術が一貫して長距離トークンの連想的な思い出し精度を向上させ、SSMがさらに深いアーキテクチャからさらなる恩恵を受けることを示しています。すべてのソースコードはhttps://github.com/VITA-Group/SSM-Bottleneckで公開されています。

English

Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models' ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at https://github.com/VITA-Group/SSM-Bottleneck.

最近性と過度の平滑化の観点から、状態空間モデルのボトルネックを理解し軽減する方法

Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

要旨

Support