ChatPaper.aiChatPaper

稀疏自編碼器(SAE)可提升遺忘效能:動態稀疏自編碼器為大語言模型中的精確遺忘提供保障

SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs

April 11, 2025
作者: Aashiq Muhamed, Jacopo Bonato, Mona Diab, Virginia Smith
cs.AI

摘要

機器遺忘是一種提升大型語言模型安全性的有前景方法,旨在從模型中移除不需要的知識。然而,現有的基於梯度的遺忘方法存在諸多問題,如高計算成本、超參數不穩定、序列遺忘能力差、易受重新學習攻擊、數據效率低以及缺乏可解釋性。雖然稀疏自編碼器(Sparse Autoencoders, SAEs)通過實現基於激活的定向遺忘,能夠有效改善這些方面,但先前的方法在性能上遜色於基於梯度的方法。本研究證明,與早期發現相反,當動態使用時,SAEs能顯著提升遺忘效果。我們提出了動態DAE防護欄(Dynamic DAE Guardrails, DSG),這是一種利用原則性特徵選擇和動態分類器的新型精確遺忘方法。實驗結果顯示,DSG在遺忘與效用之間的權衡上大幅領先於主流遺忘方法,顯著優化了遺忘效果。DSG解決了基於梯度遺忘方法的關鍵缺陷——提供了更高的計算效率和穩定性、在序列遺忘中的強健表現、對重新學習攻擊的更強抵抗力、包括零樣本設置在內的更好數據效率,以及更為可解釋的遺忘過程。
English
Machine unlearning is a promising approach to improve LLM safety by removing unwanted knowledge from the model. However, prevailing gradient-based unlearning methods suffer from issues such as high computational costs, hyperparameter instability, poor sequential unlearning capability, vulnerability to relearning attacks, low data efficiency, and lack of interpretability. While Sparse Autoencoders are well-suited to improve these aspects by enabling targeted activation-based unlearning, prior approaches underperform gradient-based methods. This work demonstrates that, contrary to these earlier findings, SAEs can significantly improve unlearning when employed dynamically. We introduce Dynamic DAE Guardrails (DSG), a novel method for precision unlearning that leverages principled feature selection and a dynamic classifier. Our experiments show DSG substantially outperforms leading unlearning methods, achieving superior forget-utility trade-offs. DSG addresses key drawbacks of gradient-based approaches for unlearning -- offering enhanced computational efficiency and stability, robust performance in sequential unlearning, stronger resistance to relearning attacks, better data efficiency including zero-shot settings, and more interpretable unlearning.

Summary

AI-Generated Summary

PDF32April 14, 2025