稀疏自編碼器能用於分解和解釋轉向向量嗎？

摘要

轉向向量是控制大型語言模型行為的一種有前途的方法。然而，其基本機制仍然知之甚少。儘管稀疏自編碼器（SAEs）可能提供一種解釋轉向向量的潛在方法，但最近的研究發現，SAE 重建的向量通常缺乏原始向量的轉向特性。本文探討了為何直接應用 SAE 到轉向向量會產生誤導性的分解，並確定了兩個原因：（1）轉向向量落在 SAE 設計之外的輸入分佈之外，以及（2）轉向向量在特徵方向上可能具有有意義的負投影，而 SAE 並未設計來容納這種情況。這些限制阻礙了直接使用 SAE 來解釋轉向向量。

English

Steering vectors are a promising approach to control the behaviour of large language models. However, their underlying mechanisms remain poorly understood. While sparse autoencoders (SAEs) may offer a potential method to interpret steering vectors, recent findings show that SAE-reconstructed vectors often lack the steering properties of the original vectors. This paper investigates why directly applying SAEs to steering vectors yields misleading decompositions, identifying two reasons: (1) steering vectors fall outside the input distribution for which SAEs are designed, and (2) steering vectors can have meaningful negative projections in feature directions, which SAEs are not designed to accommodate. These limitations hinder the direct use of SAEs for interpreting steering vectors.

稀疏自編碼器能用於分解和解釋轉向向量嗎？

Can sparse autoencoders be used to decompose and interpret steering vectors?

摘要

Support