ChatPaper.aiChatPaper

解码暗物质:专门的稀疏自编码器用于解释基础模型中的罕见概念

Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models

November 1, 2024
作者: Aashiq Muhamed, Mona Diab, Virginia Smith
cs.AI

摘要

理解和减轻与基础模型(FMs)相关的潜在风险取决于开发有效的可解释性方法。稀疏自编码器(SAEs)已经成为一个有前途的工具,用于解开FM表示中的内容,但它们难以捕捉数据中罕见但关键的概念。我们引入了专门的稀疏自编码器(SSAEs),旨在通过专注于特定子域来阐明这些难以捉摸的暗物质特征。我们提出了一个实用的训练SSAEs的方法,展示了对数据选择的密集检索的有效性以及倾斜的经验风险最小化作为改进概念回忆的训练目标的好处。我们对SSAEs在标准指标上的评估,如下游困惑度和L_0稀疏性,表明它们有效地捕捉了子域尾部概念,超越了通用SAEs的能力。我们在Bias in Bios数据集的案例研究中展示了SSAEs的实际效用,当应用于去除虚假性别信息时,SSAEs在最差组分类准确度上实现了12.5\%的提升。SSAEs为深入探究子域中FMs内部运作提供了一个强大的新视角。
English
Understanding and mitigating the potential risks associated with foundation models (FMs) hinges on developing effective interpretability methods. Sparse Autoencoders (SAEs) have emerged as a promising tool for disentangling FM representations, but they struggle to capture rare, yet crucial concepts in the data. We introduce Specialized Sparse Autoencoders (SSAEs), designed to illuminate these elusive dark matter features by focusing on specific subdomains. We present a practical recipe for training SSAEs, demonstrating the efficacy of dense retrieval for data selection and the benefits of Tilted Empirical Risk Minimization as a training objective to improve concept recall. Our evaluation of SSAEs on standard metrics, such as downstream perplexity and L_0 sparsity, show that they effectively capture subdomain tail concepts, exceeding the capabilities of general-purpose SAEs. We showcase the practical utility of SSAEs in a case study on the Bias in Bios dataset, where SSAEs achieve a 12.5\% increase in worst-group classification accuracy when applied to remove spurious gender information. SSAEs provide a powerful new lens for peering into the inner workings of FMs in subdomains.

Summary

AI-Generated Summary

PDF72November 13, 2024