ChatPaper.aiChatPaper

分析特征流以增强语言模型中的解释和控制。

Analyze Feature Flow to Enhance Interpretation and Steering in Language Models

February 5, 2025
作者: Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
cs.AI

摘要

我们引入了一种新方法,系统地映射稀疏自编码器在大型语言模型的连续层中发现的特征,扩展了早期研究对层间特征链接的探讨。通过使用无数据的余弦相似度技术,我们追踪特定特征在每个阶段是如何持续存在、转变或首次出现的。这种方法产生了特征演变的细粒度流图,实现了对模型计算的细粒度可解释性和机制洞察。至关重要的是,我们展示了这些跨层特征映射如何促进通过放大或抑制选择的特征来直接引导模型行为,在文本生成中实现有针对性的主题控制。总的来说,我们的研究结果突显了一种因果、跨层可解释性框架的实用性,不仅阐明了特征如何通过前向传递发展,还提供了透明操纵大型语言模型的新手段。
English
We introduce a new approach to systematically map features discovered by sparse autoencoder across consecutive layers of large language models, extending earlier work that examined inter-layer feature links. By using a data-free cosine similarity technique, we trace how specific features persist, transform, or first appear at each stage. This method yields granular flow graphs of feature evolution, enabling fine-grained interpretability and mechanistic insights into model computations. Crucially, we demonstrate how these cross-layer feature maps facilitate direct steering of model behavior by amplifying or suppressing chosen features, achieving targeted thematic control in text generation. Together, our findings highlight the utility of a causal, cross-layer interpretability framework that not only clarifies how features develop through forward passes but also provides new means for transparent manipulation of large language models.

Summary

AI-Generated Summary

PDF582February 7, 2025