专家混合模型实现本质可解释性
Mixture of Experts Made Intrinsically Interpretable
March 5, 2025
作者: Xingyi Yang, Constantin Venhoff, Ashkan Khakzar, Christian Schroeder de Witt, Puneet K. Dokania, Adel Bibi, Philip Torr
cs.AI
摘要
在大规模语言模型中,神经元往往表现出多义性,同时编码多个不相关的概念,从而模糊了可解释性。我们提出了MoE-X,一种混合专家(Mixture-of-Experts, MoE)语言模型,旨在实现内在的可解释性,而非依赖事后分析方法。我们的方法基于一个观察:在语言模型中,具有稀疏激活的宽网络更有可能捕捉到可解释的因素。然而,直接训练如此大规模且稀疏的网络在计算上是不可行的。MoE架构通过仅激活针对特定输入的专家子集,提供了一种可扩展的替代方案,天然地与可解释性目标相契合。在MoE-X中,我们通过将MoE层重写为等效的稀疏大型多层感知机(MLP),建立了这种联系。这种方法在保持稀疏性的同时,实现了隐藏层规模的高效扩展。为了进一步增强可解释性,我们在每个专家内部强制稀疏激活,并重新设计路由机制,以优先选择激活稀疏度最高的专家。这些设计确保了只有最显著的特征会被路由并由专家处理。我们在国际象棋和自然语言任务上评估了MoE-X,结果显示它在保持与密集模型相当性能的同时,显著提升了可解释性。MoE-X的困惑度优于GPT-2,其可解释性甚至超越了基于稀疏自编码器(SAE)的方法。
English
Neurons in large language models often exhibit polysemanticity,
simultaneously encoding multiple unrelated concepts and obscuring
interpretability. Instead of relying on post-hoc methods, we present
MoE-X, a Mixture-of-Experts (MoE) language model designed to be
intrinsically interpretable. Our approach is motivated by the
observation that, in language models, wider networks with sparse activations
are more likely to capture interpretable factors. However, directly training
such large sparse networks is computationally prohibitive. MoE architectures
offer a scalable alternative by activating only a subset of experts for any
given input, inherently aligning with interpretability objectives. In MoE-X, we
establish this connection by rewriting the MoE layer as an equivalent sparse,
large MLP. This approach enables efficient scaling of the hidden size while
maintaining sparsity. To further enhance interpretability, we enforce sparse
activation within each expert and redesign the routing mechanism to prioritize
experts with the highest activation sparsity. These designs ensure that only
the most salient features are routed and processed by the experts. We evaluate
MoE-X on chess and natural language tasks, showing that it achieves performance
comparable to dense models while significantly improving interpretability.
MoE-X achieves a perplexity better than GPT-2, with interpretability surpassing
even sparse autoencoder (SAE)-based approaches.Summary
AI-Generated Summary