解碼暗物質:專門的稀疏自編碼器用於解釋基礎模型中的罕見概念
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models
November 1, 2024
作者: Aashiq Muhamed, Mona Diab, Virginia Smith
cs.AI
摘要
針對基礎模型(FMs)可能存在的潛在風險,理解並緩解這些風險的關鍵在於開發有效的可解釋性方法。稀疏自編碼器(SAEs)已被證明是一種有前途的工具,用於解開FM表示中的複雜性,但它們往往難以捕捉數據中罕見但至關重要的概念。我們引入了專門的稀疏自編碼器(SSAEs),旨在通過專注於特定子領域來揭示這些難以捉摸的暗物質特徵。我們提出了一個實用的訓練SSAEs的方法,展示了對於數據選擇的密集檢索的有效性,以及傾斜的經驗風險最小化作為改善概念召回的訓練目標的好處。我們對SSAEs在標準指標上的評估,如下游困惑度和L_0稀疏性,顯示它們有效地捕捉了子領域尾部概念,超越了通用SAEs的能力。我們展示了SSAEs在Bias in Bios數據集上的案例研究中的實際效用,在該研究中,當應用於刪除虛假性別信息時,SSAEs實現了最差組分類準確性增加了12.5%。SSAEs為深入研究子領域中FM內部運作提供了一個強大的新視角。
English
Understanding and mitigating the potential risks associated with foundation
models (FMs) hinges on developing effective interpretability methods. Sparse
Autoencoders (SAEs) have emerged as a promising tool for disentangling FM
representations, but they struggle to capture rare, yet crucial concepts in the
data. We introduce Specialized Sparse Autoencoders (SSAEs), designed to
illuminate these elusive dark matter features by focusing on specific
subdomains. We present a practical recipe for training SSAEs, demonstrating the
efficacy of dense retrieval for data selection and the benefits of Tilted
Empirical Risk Minimization as a training objective to improve concept recall.
Our evaluation of SSAEs on standard metrics, such as downstream perplexity and
L_0 sparsity, show that they effectively capture subdomain tail concepts,
exceeding the capabilities of general-purpose SAEs. We showcase the practical
utility of SSAEs in a case study on the Bias in Bios dataset, where SSAEs
achieve a 12.5\% increase in worst-group classification accuracy when applied
to remove spurious gender information. SSAEs provide a powerful new lens for
peering into the inner workings of FMs in subdomains.Summary
AI-Generated Summary