언어 모델에서 개념적 지식 삭제

초록

언어 모델에서의 개념 소거는 전통적으로 포괄적인 평가 프레임워크가 부족해 소거 방법의 효과를 완전히 평가하는 데 미흡했습니다. 우리는 세 가지 중요한 기준에 중점을 둔 평가 패러다임을 제안합니다: 결백성(완전한 지식 제거), 부드러움(조건부 유창한 생성 유지), 그리고 특이성(관련 없는 작업 성능 보존). 우리의 평가 지표는 자연스럽게 세 가지 차원을 모두 다루도록 설계된 새로운 방법인 언어 기억 소거(ELM)의 개발을 촉진합니다. ELM은 지정된 저랭크 업데이트를 활용하여 소거된 개념에 대한 출력 분포를 변경하면서, 소거된 개념에 대해 요청 받을 때 유창성을 포함한 전반적인 모델 능력을 보존합니다. 우리는 ELM의 효과를 생명안보, 사이버보안, 문학 분야의 소거 작업에서 입증합니다. 비교 분석 결과, ELM이 제안된 지표를 통해 우수한 성능을 달성하는 것을 보여줍니다. 이는 소거된 주제 평가에서 거의 무작위 점수, 생성 유창성, 관련 벤치마크에서의 정확성 유지, 그리고 적대적 공격에 대한 견고성을 포함합니다. 우리의 코드, 데이터, 그리고 훈련된 모델은 https://elm.baulab.info에서 제공됩니다.

English

Concept erasure in language models has traditionally lacked a comprehensive evaluation framework, leading to incomplete assessments of effectiveness of erasure methods. We propose an evaluation paradigm centered on three critical criteria: innocence (complete knowledge removal), seamlessness (maintaining conditional fluent generation), and specificity (preserving unrelated task performance). Our evaluation metrics naturally motivate the development of Erasure of Language Memory (ELM), a new method designed to address all three dimensions. ELM employs targeted low-rank updates to alter output distributions for erased concepts while preserving overall model capabilities including fluency when prompted for an erased concept. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative analysis shows that ELM achieves superior performance across our proposed metrics, including near-random scores on erased topic assessments, generation fluency, maintained accuracy on unrelated benchmarks, and robustness under adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

언어 모델에서 개념적 지식 삭제

Erasing Conceptual Knowledge from Language Models

초록

Summary

Support

Support