從語言模型生成反事實情境

Counterfactual Generation from Language Models

November 11, 2024
作者: Shauli Ravfogel, Anej Svete, Vésteinn Snæbjarnarson, Ryan Cotterell
cs.AI

摘要

理解和操控語言模型中的因果生成機制對於控制其行為至關重要。先前的研究主要依賴於技術,例如表示手術,例如模型消融或與特定概念相關聯的線性子空間的操作,以干預這些模型。為了精確了解干預的影響,檢視反事實是有用的,例如,如果一個給定的句子是如何生成的,假設它是由模型在進行特定干預後生成的。我們強調,反事實推理在概念上與干預是有區別的,正如Pearl的因果層次所表述的。基於這一觀察,我們提出了一個框架,將語言模型重新制定為使用Gumbel-max技巧的廣義結構方程模型,以生成真實的字符串反事實。這使我們能夠對原始字符串和由於採樣噪聲的同一實例而產生的反事實的聯合分佈進行建模。我們開發了一種基於事後Gumbel採樣的算法,使我們能夠推斷潛在的噪聲變量並生成觀察字符串的反事實。我們的實驗表明,這種方法產生了有意義的反事實,同時顯示了常用的干預技術具有相當大的不良副作用。
English
Understanding and manipulating the causal generation mechanisms in language models is essential for controlling their behavior. Previous work has primarily relied on techniques such as representation surgery -- e.g., model ablations or manipulation of linear subspaces tied to specific concepts -- to intervene on these models. To understand the impact of interventions precisely, it is useful to examine counterfactuals -- e.g., how a given sentence would have appeared had it been generated by the model following a specific intervention. We highlight that counterfactual reasoning is conceptually distinct from interventions, as articulated in Pearl's causal hierarchy. Based on this observation, we propose a framework for generating true string counterfactuals by reformulating language models as Generalized Structural-equation. Models using the Gumbel-max trick. This allows us to model the joint distribution over original strings and their counterfactuals resulting from the same instantiation of the sampling noise. We develop an algorithm based on hindsight Gumbel sampling that allows us to infer the latent noise variables and generate counterfactuals of observed strings. Our experiments demonstrate that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.

Summary

AI-Generated Summary

PDF52November 12, 2024