Monet：用於Transformer的單義專家混合模型

摘要

瞭解大型語言模型（LLMs）的內部計算對於將其與人類價值觀保持一致並防止不良行為，如生成有毒內容，至關重要。然而，機械解釋性受到多義性的阻礙——單個神經元對多個不相關的概念作出反應。儘管稀疏自編碼器（SAEs）試圖通過稀疏字典學習來解開這些特徵，但由於依賴事後重建損失，它們危及了LLM的性能。為了解決這個問題，我們引入了用於變壓器的單義專家混合體（Monet）架構，該架構將稀疏字典學習直接融入端到端的專家混合預訓練中。我們的新穎專家分解方法使專家數量能夠擴展至每層262,144個，而總參數與專家數量的平方根成比例擴展。我們的分析顯示了專家之間知識的互斥性，展示了個別專家所包含的參數化知識。此外，Monet允許在領域、語言和毒性緩解之間進行知識操作，而不會降低通用性能。我們對透明的LLMs的追求突顯了擴展專家數量以增強機械解釋性並直接切除內部知識以根本調整模型行為的潛力。源代碼和預訓練檢查點可在https://github.com/dmis-lab/Monet 上找到。

English

Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance} mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust} model behavior. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Monet.

Monet：用於Transformer的單義專家混合模型

Monet: Mixture of Monosemantic Experts for Transformers

摘要

Summary

Support