Monet：Transformer 模型的单语义专家混合

摘要

理解大型语言模型（LLMs）的内部计算对于将它们与人类价值观对齐并防止产生有害行为，如生成有毒内容，至关重要。然而，机械解释性受到多义性的阻碍——即单个神经元响应多个不相关概念。虽然稀疏自编码器（SAEs）尝试通过稀疏字典学习来解开这些特征，但由于依赖事后重构损失，它们损害了LLM的性能。为了解决这个问题，我们引入了一种名为Mixture of Monosemantic Experts for Transformers（Monet）的架构，它将稀疏字典学习直接融入端到端的专家混合预训练中。我们的新颖专家分解方法使得每层专家数量可扩展至262,144个，而总参数与专家数量的平方根成比例。我们的分析表明专家之间的知识是相互排斥的，并展示了嵌入在各个专家中的参数化知识。此外，Monet允许在领域、语言和毒性缓解之间进行知识操作，而不会降低总体性能。我们追求透明的LLMs突显了通过扩展专家数量来增强机械解释性并直接切除内部知识以从根本上调整模型行为的潜力。源代码和预训练检查点可在https://github.com/dmis-lab/Monet 上获得。

English

Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance} mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust} model behavior. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Monet.

Monet：Transformer 模型的单语义专家混合

Monet: Mixture of Monosemantic Experts for Transformers

摘要

Summary

Support

Support