透過基於SAE的表示工程來引導LLM中的知識選擇行為
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
October 21, 2024
作者: Yu Zhao, Alessio Devoto, Giwon Hong, Xiaotang Du, Aryo Pradipta Gema, Hongru Wang, Kam-Fai Wong, Pasquale Minervini
cs.AI
摘要
大型語言模型(LLMs)可以在其參數中存儲大量事實知識。然而,它們的參數知識可能與上下文提供的信息相衝突 -- 這種現象被稱為上下文記憶知識衝突,可能導致模型行為不良,例如依賴過時或不正確的信息。通過分析LLMs的內部激活,我們發現它們可以在中間層內部記錄知識衝突的信號。這些信號使我們能夠檢測知識衝突是否發生,並使用推論時干預策略來解決它。在這項工作中,我們提出了SpARE,一種無需訓練的表示工程方法,它使用預訓練的稀疏自編碼器(SAEs)來控制LLMs的知識選擇行為。SpARE識別控制知識選擇行為的功能特徵,並將它們應用於編輯LLMs的內部激活以進行推論。我們的實驗結果顯示,SpARE可以有效控制在開放域問答任務中解決知識衝突的知識源的使用,超越現有的表示工程方法(+10%)以及對比解碼方法(+15%)。
English
Large language models (LLMs) can store a significant amount of factual
knowledge in their parameters. However, their parametric knowledge may conflict
with the information provided in the context -- this phenomenon, known as
context-memory knowledge conflicts, can lead to undesirable model
behaviour, such as reliance on outdated or incorrect information. Analysing the
internal activations of LLMs, we find that they can internally register the
signals of knowledge conflict at mid-layers. Such signals allow us to detect
whether a knowledge conflict occurs and use inference-time intervention
strategies to resolve it. In this work, we propose SpARE, a
training-free representation engineering method that uses pre-trained
sparse auto-encoders (SAEs) to control the knowledge selection behaviour of
LLMs. SpARE identifies the functional features that control the
knowledge selection behaviours and applies them to edit the internal
activations of LLMs at inference time. Our experimental results show that
SpARE can effectively control the usage of either knowledge source to
resolve knowledge conflict in open-domain question-answering tasks, surpassing
existing representation engineering methods (+10%) as well as contrastive
decoding methods (+15%).Summary
AI-Generated Summary