Monet: 트랜스포머를 위한 단의미 전문가들의 혼합

초록

대형 언어 모델 (LLM)의 내부 계산을 이해하는 것은 인간의 가치와 유해한 콘텐츠 생성과 같은 원치 않는 행동을 방지하기 위해 중요합니다. 그러나 메커니즘 해석 가능성은 다의성에 의해 방해를 받습니다. 여기서 개별 뉴런이 여러 개의 관련 없는 개념에 반응하는 다의성이 발생합니다. 희소 오토인코더 (SAE)는 희소 사전 학습을 통해 이러한 특징을 분리하려고 시도해 왔지만, 사후 재구성 손실에 의존함으로써 LLM의 성능을 희생해 왔습니다. 이 문제를 해결하기 위해 우리는 Monet 아키텍처를 소개합니다. 이 아키텍처는 희소 사전 학습을 직접 End-to-End 전문가 집합 사전 학습에 통합합니다. 우리의 새로운 전문가 분해 방법은 전문가 수를 레이어 당 262,144개로 확장할 수 있게 하며, 총 매개변수는 전문가 수의 제곱근에 비례하여 확장됩니다. 우리의 분석은 전문가 간의 지식의 상호 배타성을 입증하고 개별 전문가에 포함된 매개변수 지식을 보여줍니다. 더불어 Monet은 일반 성능을 저하시키지 않고 도메인, 언어 및 유해성 완화를 통해 지식 조작을 허용합니다. 투명한 LLM을 추구하는 우리의 노력은 전문가 수를 확장하여 메커니즘 해석 가능성을 향상시키고 내부 지식을 직접 조정하여 모델 행동을 근본적으로 조정할 수 있는 잠재력을 강조합니다. 소스 코드 및 사전 훈련된 체크포인트는 https://github.com/dmis-lab/Monet에서 사용할 수 있습니다.

English

Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity -- where individual neurons respond to multiple, unrelated concepts. While Sparse Autoencoders (SAEs) have attempted to disentangle these features through sparse dictionary learning, they have compromised LLM performance due to reliance on post-hoc reconstruction loss. To address this issue, we introduce Mixture of Monosemantic Experts for Transformers (Monet) architecture, which incorporates sparse dictionary learning directly into end-to-end Mixture-of-Experts pretraining. Our novel expert decomposition method enables scaling the expert count to 262,144 per layer while total parameters scale proportionally to the square root of the number of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance} mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust} model behavior. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Monet.

Monet: 트랜스포머를 위한 단의미 전문가들의 혼합

Monet: Mixture of Monosemantic Experts for Transformers

초록

Summary

Support