SAeUron: 희소 오토인코더를 사용한 확산 모델에서 해석 가능한 개념 언러닝

초록

확산 모델은 강력하지만 때로 해로운 또는 원치 않는 콘텐츠를 생성할 수 있어 중요한 윤리적 및 안전 문제를 제기할 수 있습니다. 최근의 기계 언러닝 접근 방식은 잠재적인 해결책을 제공하지만 종종 투명성이 부족하여 기본 모델에 도입되는 변경 사항을 이해하기 어렵게 만듭니다. 본 연구에서는 SAeUron이라는 새로운 방법을 소개합니다. 이 방법은 텍스트-이미지 확산 모델에서 원치 않는 개념을 제거하기 위해 희소 오토인코더(SAEs)에서 학습한 특징을 활용합니다. 먼저, 우리는 확산 모델의 여러 노이즈 제거 타임스텝에서 활성화된 SAEs를 비지도 학습 방식으로 훈련시킨 후, 특정 개념에 해당하는 희소하고 해석 가능한 특징을 포착한다는 것을 증명합니다. 이를 기반으로, 모델 활성화에 정교한 개입을 가능하게 하는 특징 선택 방법을 제안합니다. 이 방법은 특정 콘텐츠를 차단하면서 전반적인 성능을 유지하는 것을 가능하게 합니다. 객체 및 스타일 언러닝에 대한 경쟁력 있는 UnlearnCanvas 벤치마크를 통한 평가는 SAeUron의 최첨단 성능을 강조합니다. 더불어, 우리는 단일 SAE로 여러 개념을 동시에 제거할 수 있으며, 다른 방법과는 달리 SAeUron은 적대적 공격에도 원치 않는 콘텐츠 생성 가능성을 완화시킵니다. 코드 및 체크포인트는 다음 링크에서 확인할 수 있습니다: https://github.com/cywinski/SAeUron.

English

Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Evaluation with the competitive UnlearnCanvas benchmark on object and style unlearning highlights SAeUron's state-of-the-art performance. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content, even under adversarial attack. Code and checkpoints are available at: https://github.com/cywinski/SAeUron.

SAeUron: 희소 오토인코더를 사용한 확산 모델에서 해석 가능한 개념 언러닝

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

초록

Support