通過最近性和過度平滑的角度來理解和緩解狀態空間模型的瓶頸。

摘要

結構化狀態空間模型（SSMs）已成為變壓器的替代方案。儘管SSMs通常被認為在捕捉長序列依賴性方面效果顯著，我們嚴謹地證明它們固有地受到強烈的最近偏差的限制。我們的實證研究還揭示了這種偏差損害了模型回憶遠距離信息的能力並引入了韌性問題。我們的擴展實驗隨後發現，SSMs中更深層次的結構可以促進學習長範疇。然而，隨著SSMs深度的增加，後續的理論分析顯示，它們呈現另一種不可避免的過度平滑的趨勢，例如，標記表示變得越來越難以區分。這種最近偏差和過度平滑之間的基本困境阻礙了現有SSMs的可擴展性。受到我們理論發現的啟發，我們提議在SSMs中極化兩個狀態轉換矩陣通道，分別設置為零和一，同時解決最近偏差和過度平滑問題。實驗表明，我們的極化技術持續增強了對長範圍標記的聯想回憶準確性，並使SSMs能夠更進一步受益於更深層次的架構。所有源代碼均在https://github.com/VITA-Group/SSM-Bottleneck 上發布。

English

Structured State Space Models (SSMs) have emerged as alternatives to transformers. While SSMs are often regarded as effective in capturing long-sequence dependencies, we rigorously demonstrate that they are inherently limited by strong recency bias. Our empirical studies also reveal that this bias impairs the models' ability to recall distant information and introduces robustness issues. Our scaling experiments then discovered that deeper structures in SSMs can facilitate the learning of long contexts. However, subsequent theoretical analysis reveals that as SSMs increase in depth, they exhibit another inevitable tendency toward over-smoothing, e.g., token representations becoming increasingly indistinguishable. This fundamental dilemma between recency and over-smoothing hinders the scalability of existing SSMs. Inspired by our theoretical findings, we propose to polarize two channels of the state transition matrices in SSMs, setting them to zero and one, respectively, simultaneously addressing recency bias and over-smoothing. Experiments demonstrate that our polarization technique consistently enhances the associative recall accuracy of long-range tokens and unlocks SSMs to benefit further from deeper architectures. All source codes are released at https://github.com/VITA-Group/SSM-Bottleneck.

通過最近性和過度平滑的角度來理解和緩解狀態空間模型的瓶頸。

Understanding and Mitigating Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

摘要

Summary

Support