프레임 표현 가설: 다중 토큰 LLM 해석 가능성 및 개념 안내 텍스트 생성

초록

가시성은 대형 언어 모델 (LLM)에 대한 신뢰를 증진시키는 데 중요한 과제로, 모델의 매개변수에서 추론을 추출하는 복잡성에서 비롯됩니다. 우리는 선형 표현 가설 (LRH)에 근거한 이론적으로 견고한 프레임 표현 가설을 제시하여 다중 토큰 단어를 모델링하여 LLM을 해석하고 제어합니다. 이전 연구에서는 LLM 표현을 언어적 개념과 연결하기 위해 LRH를 탐구했지만, 단일 토큰 분석으로 제한되었습니다. 대부분의 단어가 여러 토큰으로 구성되므로 LRH를 다중 토큰 단어로 확장하여 수천 개의 개념을 포함하는 모든 텍스트 데이터에서 사용할 수 있게 합니다. 이를 위해 우리는 단어를 프레임으로 해석할 수 있으며, 이는 토큰-단어 관계를 더 잘 포착하는 벡터의 순서화된 시퀀스입니다. 그런 다음, 개념은 공통 개념을 공유하는 단어 프레임의 평균으로 표현될 수 있습니다. 우리는 Top-k 개념 안내 디코딩을 통해 이러한 도구들을 소개하며, 이를 사용하여 선택한 개념을 사용하여 텍스트 생성을 직관적으로 조절할 수 있습니다. 우리는 성별 및 언어 편향을 보여주고 유해한 콘텐츠를 드러내지만, 이를 해소할 수 있는 잠재력을 노출함으로써, 더 안전하고 투명한 LLM으로 이어지는 Llama 3.1, Gemma 2 및 Phi 3 패밀리에서 이러한 아이디어를 검증합니다. 코드는 https://github.com/phvv-me/frame-representation-hypothesis.git에서 사용할 수 있습니다.

English

Interpretability is a key challenge in fostering trust for Large Language Models (LLMs), which stems from the complexity of extracting reasoning from model's parameters. We present the Frame Representation Hypothesis, a theoretically robust framework grounded in the Linear Representation Hypothesis (LRH) to interpret and control LLMs by modeling multi-token words. Prior research explored LRH to connect LLM representations with linguistic concepts, but was limited to single token analysis. As most words are composed of several tokens, we extend LRH to multi-token words, thereby enabling usage on any textual data with thousands of concepts. To this end, we propose words can be interpreted as frames, ordered sequences of vectors that better capture token-word relationships. Then, concepts can be represented as the average of word frames sharing a common concept. We showcase these tools through Top-k Concept-Guided Decoding, which can intuitively steer text generation using concepts of choice. We verify said ideas on Llama 3.1, Gemma 2, and Phi 3 families, demonstrating gender and language biases, exposing harmful content, but also potential to remediate them, leading to safer and more transparent LLMs. Code is available at https://github.com/phvv-me/frame-representation-hypothesis.git

프레임 표현 가설: 다중 토큰 LLM 해석 가능성 및 개념 안내 텍스트 생성

Frame Representation Hypothesis: Multi-Token LLM Interpretability and Concept-Guided Text Generation

초록

Summary

Support