ChatPaper.aiChatPaper

BiasEdit:通过模型编辑消除语言模型中的刻板偏见

BiasEdit: Debiasing Stereotyped Language Models via Model Editing

March 11, 2025
作者: Xin Xu, Wei Xu, Ningyu Zhang, Julian McAuley
cs.AI

摘要

先前的研究已证实,语言模型存在刻板偏见。现有的去偏策略,如使用反事实数据重新训练模型、表示投影和提示技术,往往无法有效消除偏见或直接改变模型内部的偏见表征。为解决这些问题,我们提出了BiasEdit,一种高效的模型编辑方法,通过轻量级网络作为编辑器生成参数更新,从而去除语言模型中的刻板偏见。BiasEdit采用去偏损失指导编辑器网络对语言模型的部分参数进行局部编辑,同时通过保留损失确保编辑过程中语言建模能力不受影响。在StereoSet和Crows-Pairs数据集上的实验表明,相较于切线去偏基线方法,BiasEdit在消除偏见方面展现出高效性、有效性和鲁棒性,且对语言模型的通用能力影响微乎其微。此外,我们还进行了偏见追踪,探究了不同模块中的偏见分布,并探索了偏见编辑对语言模型各组成部分的影响。
English
Previous studies have established that language models manifest stereotyped biases. Existing debiasing strategies, such as retraining a model with counterfactual data, representation projection, and prompting often fail to efficiently eliminate bias or directly alter the models' biased internal representations. To address these issues, we propose BiasEdit, an efficient model editing method to remove stereotypical bias from language models through lightweight networks that act as editors to generate parameter updates. BiasEdit employs a debiasing loss guiding editor networks to conduct local edits on partial parameters of a language model for debiasing while preserving the language modeling abilities during editing through a retention loss. Experiments on StereoSet and Crows-Pairs demonstrate the effectiveness, efficiency, and robustness of BiasEdit in eliminating bias compared to tangental debiasing baselines and little to no impact on the language models' general capabilities. In addition, we conduct bias tracing to probe bias in various modules and explore bias editing impacts on different components of language models.

Summary

AI-Generated Summary

PDF62March 12, 2025