희소 오토인코더를 사용하여 조향 벡터를 분해하고 해석할 수 있을까요?

초록

스티어링 벡터는 대규모 언어 모델의 행동을 제어하는 유망한 접근 방식이다. 그러나 그들의 기저 메커니즘은 여전히 잘 이해되지 않고 있다. 희소 오토인코더(SAEs)는 스티어링 벡터를 해석하는 잠재적인 방법을 제공할 수 있지만, 최근 연구 결과에 따르면 SAE로 재구성된 벡터는 종종 원래 벡터의 스티어링 특성이 부족한 것으로 나타났다. 본 논문은 SAE를 스티어링 벡터에 직접 적용하는 것이 잘못된 분해를 초래하는 이유를 조사하며, (1) 스티어링 벡터가 SAE가 설계된 입력 분포를 벗어나 있고, (2) 스티어링 벡터가 특징 방향에서 의미 있는 음의 투영을 가질 수 있는데, 이는 SAE가 수용하도록 설계되지 않았다는 것을 확인하였다. 이러한 제한 사항은 SAE를 스티어링 벡터를 해석하는 데 직접적으로 사용하는 것을 방해한다.

English

Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.

희소 오토인코더를 사용하여 조향 벡터를 분해하고 해석할 수 있을까요?

Can sparse autoencoders be used to decompose and interpret steering vectors?

초록

Support