语言模型的反事实生成
Counterfactual Generation from Language Models
November 11, 2024
作者: Shauli Ravfogel, Anej Svete, Vésteinn Snæbjarnarson, Ryan Cotterell
cs.AI
摘要
理解和操纵语言模型中的因果生成机制对于控制其行为至关重要。先前的研究主要依赖于诸如表示手术之类的技术,例如模型消融或与特定概念相关的线性子空间的操纵,以干预这些模型。为了准确理解干预的影响,检查反事实是有用的,例如,如果一个给定的句子是通过在特定干预后由模型生成的,它会呈现什么样子。我们强调,反事实推理在概念上与干预是有区别的,正如Pearl的因果层次所阐述的。基于这一观察,我们提出了一个框架,将语言模型重新构建为广义结构方程模型,使用Gumbel-max技巧。这使我们能够对原始字符串和由相同采样噪声实例产生的反事实之间的联合分布进行建模。我们开发了一种基于事后Gumbel采样的算法,使我们能够推断潜在的噪声变量并生成观察到的字符串的反事实。我们的实验表明,这种方法产生了有意义的反事实,同时显示出常用的干预技术具有相当大的不良副作用。
English
Understanding and manipulating the causal generation mechanisms in language
models is essential for controlling their behavior. Previous work has primarily
relied on techniques such as representation surgery -- e.g., model ablations or
manipulation of linear subspaces tied to specific concepts -- to intervene on
these models. To understand the impact of interventions precisely, it is useful
to examine counterfactuals -- e.g., how a given sentence would have appeared
had it been generated by the model following a specific intervention. We
highlight that counterfactual reasoning is conceptually distinct from
interventions, as articulated in Pearl's causal hierarchy. Based on this
observation, we propose a framework for generating true string counterfactuals
by reformulating language models as Generalized Structural-equation. Models
using the Gumbel-max trick. This allows us to model the joint distribution over
original strings and their counterfactuals resulting from the same
instantiation of the sampling noise. We develop an algorithm based on hindsight
Gumbel sampling that allows us to infer the latent noise variables and generate
counterfactuals of observed strings. Our experiments demonstrate that the
approach produces meaningful counterfactuals while at the same time showing
that commonly used intervention techniques have considerable undesired side
effects.Summary
AI-Generated Summary