從語言模型中消除概念知識
Erasing Conceptual Knowledge from Language Models
October 3, 2024
作者: Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau
cs.AI
摘要
在語言模型中的概念消除傳統上缺乏全面的評估框架,導致對消除方法效果的評估不完整。我們提出了一個以三個關鍵標準為中心的評估範式:清白度(完全知識刪除)、無縫性(保持條件流暢生成)和特異性(保留無關任務表現)。我們的評估指標自然地促進了「語言記憶消除」(ELM)的開發,這是一種旨在應對所有三個維度的新方法。ELM採用針對性的低秩更新來改變消除概念的輸出分佈,同時保留整體模型的能力,包括在提示刪除概念時的流暢性。我們展示了ELM在生物安全、網絡安全和文學領域的消除任務上的有效性。比較分析顯示,ELM在我們提出的指標上取得了卓越表現,包括在刪除主題評估、生成流暢性、無關基準上的準確性和對抗性攻擊下的穩健性方面幾乎達到隨機分數。我們的代碼、數據和訓練模型可在https://elm.baulab.info 上獲得。
English
Concept erasure in language models has traditionally lacked a comprehensive
evaluation framework, leading to incomplete assessments of effectiveness of
erasure methods. We propose an evaluation paradigm centered on three critical
criteria: innocence (complete knowledge removal), seamlessness (maintaining
conditional fluent generation), and specificity (preserving unrelated task
performance). Our evaluation metrics naturally motivate the development of
Erasure of Language Memory (ELM), a new method designed to address all three
dimensions. ELM employs targeted low-rank updates to alter output distributions
for erased concepts while preserving overall model capabilities including
fluency when prompted for an erased concept. We demonstrate ELM's efficacy on
biosecurity, cybersecurity, and literary domain erasure tasks. Comparative
analysis shows that ELM achieves superior performance across our proposed
metrics, including near-random scores on erased topic assessments, generation
fluency, maintained accuracy on unrelated benchmarks, and robustness under
adversarial attacks. Our code, data, and trained models are available at
https://elm.baulab.infoSummary
AI-Generated Summary