從語言模型中消除概念知識

摘要

在語言模型中的概念消除傳統上缺乏全面的評估框架，導致對消除方法效果的評估不完整。我們提出了一個以三個關鍵標準為中心的評估範式：清白度（完全知識刪除）、無縫性（保持條件流暢生成）和特異性（保留無關任務表現）。我們的評估指標自然地促進了「語言記憶消除」（ELM）的開發，這是一種旨在應對所有三個維度的新方法。ELM採用針對性的低秩更新來改變消除概念的輸出分佈，同時保留整體模型的能力，包括在提示刪除概念時的流暢性。我們展示了ELM在生物安全、網絡安全和文學領域的消除任務上的有效性。比較分析顯示，ELM在我們提出的指標上取得了卓越表現，包括在刪除主題評估、生成流暢性、無關基準上的準確性和對抗性攻擊下的穩健性方面幾乎達到隨機分數。我們的代碼、數據和訓練模型可在https://elm.baulab.info 上獲得。

English

Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

從語言模型中消除概念知識

Erasing Conceptual Knowledge from Language Models

摘要

Summary

Support

Support