ChatPaper.aiChatPaper

稀疏自编码器助力遗忘优化:动态稀疏自编码器为大型语言模型提供精准遗忘的防护机制

SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs

April 11, 2025
作者: Aashiq Muhamed, Jacopo Bonato, Mona Diab, Virginia Smith
cs.AI

摘要

机器学习中的遗忘技术是一种提升大语言模型安全性的有效途径,旨在从模型中移除不需要的知识。然而,当前主流的基于梯度的遗忘方法存在诸多问题,如高计算成本、超参数不稳定性、序列遗忘能力差、易受再学习攻击、数据效率低下以及缺乏可解释性。尽管稀疏自编码器(Sparse Autoencoders, SAEs)通过实现基于激活的定向遗忘有望改善这些方面,但先前的方法表现不如基于梯度的方法。本研究揭示,与早期发现相反,动态应用SAEs能显著提升遗忘效果。我们提出了动态DAE防护栏(Dynamic DAE Guardrails, DSG),这是一种新颖的精确遗忘方法,它结合了原则性的特征选择和动态分类器。实验表明,DSG在遗忘-效用权衡上大幅领先于现有遗忘方法,有效解决了基于梯度方法的关键缺陷——提供了更高的计算效率和稳定性、在序列遗忘中的稳健表现、更强的抗再学习攻击能力、包括零样本设置在内的更优数据效率,以及更具可解释性的遗忘过程。
English
Machine unlearning is a promising approach to improve LLM safety by removing unwanted knowledge from the model. However, prevailing gradient-based unlearning methods suffer from issues such as high computational costs, hyperparameter instability, poor sequential unlearning capability, vulnerability to relearning attacks, low data efficiency, and lack of interpretability. While Sparse Autoencoders are well-suited to improve these aspects by enabling targeted activation-based unlearning, prior approaches underperform gradient-based methods. This work demonstrates that, contrary to these earlier findings, SAEs can significantly improve unlearning when employed dynamically. We introduce Dynamic DAE Guardrails (DSG), a novel method for precision unlearning that leverages principled feature selection and a dynamic classifier. Our experiments show DSG substantially outperforms leading unlearning methods, achieving superior forget-utility trade-offs. DSG addresses key drawbacks of gradient-based approaches for unlearning -- offering enhanced computational efficiency and stability, robust performance in sequential unlearning, stronger resistance to relearning attacks, better data efficiency including zero-shot settings, and more interpretable unlearning.

Summary

AI-Generated Summary

PDF42April 14, 2025