稀疏自編碼器能用於分解和解釋轉向向量嗎?
Can sparse autoencoders be used to decompose and interpret steering vectors?
November 13, 2024
作者: Harry Mayne, Yushi Yang, Adam Mahdi
cs.AI
摘要
轉向向量是控制大型語言模型行為的一種有前途的方法。然而,其基本機制仍然知之甚少。儘管稀疏自編碼器(SAEs)可能提供一種解釋轉向向量的潛在方法,但最近的研究發現,SAE 重建的向量通常缺乏原始向量的轉向特性。本文探討了為何直接應用 SAE 到轉向向量會產生誤導性的分解,並確定了兩個原因:(1)轉向向量落在 SAE 設計之外的輸入分佈之外,以及(2)轉向向量在特徵方向上可能具有有意義的負投影,而 SAE 並未設計來容納這種情況。這些限制阻礙了直接使用 SAE 來解釋轉向向量。
English
Steering vectors are a promising approach to control the behaviour of large
language models. However, their underlying mechanisms remain poorly understood.
While sparse autoencoders (SAEs) may offer a potential method to interpret
steering vectors, recent findings show that SAE-reconstructed vectors often
lack the steering properties of the original vectors. This paper investigates
why directly applying SAEs to steering vectors yields misleading
decompositions, identifying two reasons: (1) steering vectors fall outside the
input distribution for which SAEs are designed, and (2) steering vectors can
have meaningful negative projections in feature directions, which SAEs are not
designed to accommodate. These limitations hinder the direct use of SAEs for
interpreting steering vectors.Summary
AI-Generated Summary