제어 가능한 안전 정렬: 다양한 안전 요구 사항에 대한 추론 시간 적응

초록

대형 언어 모델(Large Language Models, LLMs)의 안전 정렬에 대한 현재 패러다임은 일반적인 접근 방식을 따릅니다: 모델은 모델 제공 업체가 안전하지 않다고 판단한 콘텐츠와 상호 작용하지 않습니다. 이러한 방식은 문화와 지역에 따라 다양한 사회적 규범을 고려하지 못하여 유연성이 부족합니다. 게다가 사용자들은 다양한 안전 요구를 가질 수 있으며, 정적 안전 기준을 갖는 모델은 유용성이 부족하고 재정렬 비용이 너무 높아질 수 있습니다. 우리는 Controllable Safety Alignment (CoSA)을 제안합니다. 이는 다양한 안전 요구에 모델을 재조정하지 않고 적응시키기 위한 프레임워크로, 고정된 모델을 정렬하는 대신 시스템 프롬프트의 일부로 제공되는 원하는 안전 행동의 자유 형식의 자연어 설명인 안전 구성을 따르도록 모델을 정렬합니다. 모델의 안전 행동을 조정하기 위해 권한이 있는 사용자는 추론 시에 이러한 안전 구성을 수정하기만 하면 됩니다. 이를 위해 우리는 다양한 안전 구성에 쉽게 적응할 수 있도록 LLMs를 정렬하는 데이터 중심 방법인 CoSAlign을 제안합니다. 더불어, 도움이 되는 정도와 구성된 안전을 모두 고려하는 혁신적인 가용성 평가 프로토콜을 고안하여 이를 CoSA-Score로 요약하고, 실제 다양한 안전 요구 사례와 해당 평가 프롬프트로 구성된 CoSApien이라는 인간 작성 벤치마크를 구축합니다. 우리는 CoSAlign이 컨텍스트 정렬을 포함한 강력한 기준에 비해 상당한 가용성 향상을 이끌어낸다는 것을 보여줍니다. 우리의 프레임워크는 LLMs에서 다양한 인간 가치를 더 잘 대표하고 적응시키도록 장려함으로써 그들의 실용성을 높이는 데 기여합니다.

English

The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned. We propose Controllable Safety Alignment (CoSA), a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow safety configs -- free-form natural language descriptions of the desired safety behaviors -- that are provided as part of the system prompt. To adjust model safety behavior, authorized users only need to modify such safety configs at inference time. To enable that, we propose CoSAlign, a data-centric method for aligning LLMs to easily adapt to diverse safety configs. Furthermore, we devise a novel controllability evaluation protocol that considers both helpfulness and configured safety, summarizing them into CoSA-Score, and construct CoSApien, a human-authored benchmark that consists of real-world LLM use cases with diverse safety requirements and corresponding evaluation prompts. We show that CoSAlign leads to substantial gains of controllability over strong baselines including in-context alignment. Our framework encourages better representation and adaptation to pluralistic human values in LLMs, and thereby increasing their practicality.

제어 가능한 안전 정렬: 다양한 안전 요구 사항에 대한 추론 시간 적응

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

초록

Support