ChatPaper.aiChatPaper

SAeUron:利用稀疏自动编码器在扩散模型中进行可解释概念遗忘

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

January 29, 2025
作者: Bartosz Cywiński, Kamil Deja
cs.AI

摘要

扩散模型虽然功能强大,但可能会无意中生成有害或不良内容,引发重大的伦理和安全关注。最近的机器遗忘方法提供了潜在的解决方案,但往往缺乏透明度,使人难以理解它们对基础模型引入的变化。在这项工作中,我们介绍了SAeUron,一种利用稀疏自动编码器(SAEs)学习的特征来消除文本到图像扩散模型中不需要的概念的新方法。首先,我们证明了在多个去噪时间步的激活上无监督训练的SAEs能够捕获对应特定概念的稀疏且可解释的特征。在此基础上,我们提出了一种特征选择方法,使模型激活上的精确干预能够阻止目标内容,同时保持整体性能。通过在对象和风格遗忘上进行竞争性UnlearnCanvas基准测试,突显了SAeUron的最先进性能。此外,我们展示了单个SAE可以同时移除多个概念,并且与其他方法相比,SAeUron减轻了即使在对抗攻击下也可能生成不需要的内容的可能性。代码和检查点可在以下链接找到:https://github.com/cywinski/SAeUron。
English
Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Evaluation with the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron's state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content, even under adversarial attack. Code and checkpoints are available at: https://github.com/cywinski/SAeUron.

Summary

AI-Generated Summary

PDF62February 3, 2025