语言模型的反事实生成

摘要

理解和操纵语言模型中的因果生成机制对于控制其行为至关重要。先前的研究主要依赖于诸如表示手术之类的技术，例如模型消融或与特定概念相关的线性子空间的操纵，以干预这些模型。为了准确理解干预的影响，检查反事实是有用的，例如，如果一个给定的句子是通过在特定干预后由模型生成的，它会呈现什么样子。我们强调，反事实推理在概念上与干预是有区别的，正如Pearl的因果层次所阐述的。基于这一观察，我们提出了一个框架，将语言模型重新构建为广义结构方程模型，使用Gumbel-max技巧。这使我们能够对原始字符串和由相同采样噪声实例产生的反事实之间的联合分布进行建模。我们开发了一种基于事后Gumbel采样的算法，使我们能够推断潜在的噪声变量并生成观察到的字符串的反事实。我们的实验表明，这种方法产生了有意义的反事实，同时显示出常用的干预技术具有相当大的不良副作用。

English

Understanding and manipulating the causal generation mechanisms in language models is essential for controlling their behavior. Previous work has primarily relied on techniques such as representation surgery -- e.g., model ablations or manipulation of linear subspaces tied to specific concepts -- to intervene on these models. To understand the impact of interventions precisely, it is useful to examine counterfactuals -- e.g., how a given sentence would have appeared had it been generated by the model following a specific intervention. We highlight that counterfactual reasoning is conceptually distinct from interventions, as articulated in Pearl's causal hierarchy. Based on this observation, we propose a framework for generating true string counterfactuals by reformulating language models as Generalized Structural-equation. Models using the Gumbel-max trick. This allows us to model the joint distribution over original strings and their counterfactuals resulting from the same instantiation of the sampling noise. We develop an algorithm based on hindsight Gumbel sampling that allows us to infer the latent noise variables and generate counterfactuals of observed strings. Our experiments demonstrate that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.

语言模型的反事实生成

Counterfactual Generation from Language Models

摘要

Summary

Support

Support