框架表示假设：多令牌LLM可解释性和概念引导文本生成

摘要

在促进对大型语言模型（LLMs）的信任方面，可解释性是一个关键挑战，这源于从模型参数中提取推理的复杂性。我们提出了框架表示假设，这是一个理论上健壮的框架，基于线性表示假设（LRH），通过对多词标建模来解释和控制LLMs。先前的研究探索了LRH以将LLM表示与语言概念相连接，但局限于单词标的分析。由于大多数单词由多个词标组成，我们将LRH扩展到多词标，从而使其能够在包含成千上万概念的任何文本数据上使用。为此，我们提出单词可以被解释为框架，即一系列向量的有序序列，更好地捕捉词标与单词之间的关系。然后，概念可以被表示为共享相同概念的单词框架的平均值。我们通过基于前k个概念引导解码的工具展示了这些工具，可以直观地利用所选概念引导文本生成。我们在Llama 3.1、Gemma 2和Phi 3系列上验证了这些想法，展示了性别和语言偏见，揭示了有害内容，但也展示了弥补它们的潜力，从而实现更安全、更透明的LLMs。代码可在https://github.com/phvv-me/frame-representation-hypothesis.git获取。

English

Interpretability is a key challenge in fostering trust for Large Language Models (LLMs), which stems from the complexity of extracting reasoning from model's parameters. We present the Frame Representation Hypothesis, a theoretically robust framework grounded in the Linear Representation Hypothesis (LRH) to interpret and control LLMs by modeling multi-token words. Prior research explored LRH to connect LLM representations with linguistic concepts, but was limited to single token analysis. As most words are composed of several tokens, we extend LRH to multi-token words, thereby enabling usage on any textual data with thousands of concepts. To this end, we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept. We showcase these tools through Top-k Concept-Guided Decoding, which can intuitively steer text generation using concepts of choice. We verify said ideas on Llama 3.1, Gemma 2, and Phi 3 families, demonstrating gender and language biases, exposing harmful content, but also potential to remediate them, leading to safer and more transparent LLMs. Code is available at https://github.com/phvv-me/frame-representation-hypothesis.git

框架表示假设：多令牌LLM可解释性和概念引导文本生成

Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation

摘要

Support