透過基於SAE的表示工程來引導LLM中的知識選擇行為

摘要

大型語言模型（LLMs）可以在其參數中存儲大量事實知識。然而，它們的參數知識可能與上下文提供的信息相衝突 -- 這種現象被稱為上下文記憶知識衝突，可能導致模型行為不良，例如依賴過時或不正確的信息。通過分析LLMs的內部激活，我們發現它們可以在中間層內部記錄知識衝突的信號。這些信號使我們能夠檢測知識衝突是否發生，並使用推論時干預策略來解決它。在這項工作中，我們提出了SpARE，一種無需訓練的表示工程方法，它使用預訓練的稀疏自編碼器（SAEs）來控制LLMs的知識選擇行為。SpARE識別控制知識選擇行為的功能特徵，並將它們應用於編輯LLMs的內部激活以進行推論。我們的實驗結果顯示，SpARE可以有效控制在開放域問答任務中解決知識衝突的知識源的使用，超越現有的表示工程方法（+10%）以及對比解碼方法（+15%）。

English

Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context -- this phenomenon, known as context-memory knowledge conflicts, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use inference-time intervention strategies to resolve it. In this work, we propose SpARE, a training-free representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. SpARE identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that SpARE can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods (+10%) as well as contrastive decoding methods (+15%).

透過基於SAE的表示工程來引導LLM中的知識選擇行為

Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering

摘要

Summary

Support

Support