框架表示假設：多標記LLM可解釋性與概念導向文本生成

摘要

在促進大型語言模型（LLMs）建立信任方面，可解釋性是一個關鍵挑戰，其根源於從模型參數中提取推理的複雜性。我們提出了框架表示假說，這是一個理論上堅固的框架，基於線性表示假說（LRH），用於解釋和控制LLMs，通過對多標記詞進行建模。先前的研究探索了LRH以將LLM表示與語言概念相連接，但僅限於單標記分析。由於大多數詞語由多個標記組成，我們將LRH擴展到多標記詞，從而使其能夠應用於包含數千個概念的任何文本數據。為此，我們提出詞語可以被解釋為框架，即更好地捕捉標記-詞語關係的向量有序序列。然後，概念可以被表示為共享相同概念的詞框架平均值。我們通過頂部-k概念引導解碼展示這些工具，該工具可以直觀地使用所選概念引導文本生成。我們在Llama 3.1、Gemma 2和Phi 3系列上驗證了這些想法，展示了性別和語言偏見，揭示了有害內容，但也展示了改善它們的潛力，從而使LLMs更安全和更透明。代碼可在https://github.com/phvv-me/frame-representation-hypothesis.git 上找到。

English

Interpretability is a key challenge in fostering trust for Large Language Models (LLMs), which stems from the complexity of extracting reasoning from model's parameters. We present the Frame Representation Hypothesis, a theoretically robust framework grounded in the Linear Representation Hypothesis (LRH) to interpret and control LLMs by modeling multi-token words. Prior research explored LRH to connect LLM representations with linguistic concepts, but was limited to single token analysis. As most words are composed of several tokens, we extend LRH to multi-token words, thereby enabling usage on any textual data with thousands of concepts. To this end, we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept. We showcase these tools through Top-k Concept-Guided Decoding, which can intuitively steer text generation using concepts of choice. We verify said ideas on Llama 3.1, Gemma 2, and Phi 3 families, demonstrating gender and language biases, exposing harmful content, but also potential to remediate them, leading to safer and more transparent LLMs. Code is available at https://github.com/phvv-me/frame-representation-hypothesis.git

框架表示假設：多標記LLM可解釋性與概念導向文本生成

Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation

摘要

Summary

Support