MetaSC: 언어 모델을 위한 테스트 시간 안전 명세 최적화

초록

우리는 모델 가중치를 수정하지 않고 추론 시 언어 모델(LM) 안전 추론을 최적화하는 새로운 동적 안전 프레임워크를 제안합니다. 최근 자가 비평 방법의 발전을 기반으로 한 접근 방식을 채택하여, 우리의 방법은 메타 비평 메커니즘을 활용하여 안전 프롬프트(명세서로 명명됨)를 반복적으로 업데이트하여 비판 및 수정 프로세스를 적응적으로 이끌어냅니다. 이 테스트 시간 최적화는 적대적 탈옥 요청에 대한 성능을 향상시킬 뿐만 아니라 도덕적 피해 회피 또는 정직한 응답 추구와 같은 다양한 일반 안전 관련 작업에서도 성과를 향상시킵니다. 여러 언어 모델에 걸쳐 우리의 경험적 평가는 동적으로 최적화된 안전 프롬프트가 고정된 시스템 프롬프트 및 정적 자가 비평 방어에 비해 상당히 더 높은 안전 점수를 제공함을 보여줍니다. 코드는 https://github.com/vicgalle/meta-self-critique.git 에서 공개될 예정입니다.

English

We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code to be released at https://github.com/vicgalle/meta-self-critique.git .

MetaSC: 언어 모델을 위한 테스트 시간 안전 명세 최적화

MetaSC: Test-Time Safety Specification Optimization for Language Models

초록

Support