可控安全對齊：推論時適應多樣安全需求

摘要

目前大型語言模型（LLMs）的安全對齊範式採用一種一刀切的方法：模型拒絕與模型提供者認為不安全的內容互動。這種方法在應對不同文化和地區的社會規範時缺乏靈活性。此外，用戶可能具有不同的安全需求，使得具有靜態安全標準的模型過於受限以至於無法使用，也因重新對齊而成本過高。我們提出了可控安全對齊（CoSA），這是一個旨在使模型適應各種安全需求而無需重新訓練的框架。我們不是對齊固定模型，而是對齊模型以遵循安全配置 - 這些配置是所需安全行為的自由形式自然語言描述，作為系統提示的一部分提供。為了調整模型的安全行為，授權用戶只需在推斷時修改這些安全配置。為實現這一點，我們提出了CoSAlign，這是一種基於數據的方法，用於對齊LLMs以便輕鬆適應各種安全配置。此外，我們設計了一種新的可控性評估協議，考慮了幫助性和配置的安全性，將它們總結為CoSA-Score，並構建了CoSApien，這是一個由人類編寫的基準，包含具有多樣安全需求和相應評估提示的現實世界LLM使用案例。我們展示了CoSAlign相對於強基線（包括上下文對齊）具有顯著的可控性增益。我們的框架鼓勵更好地代表和適應LLMs中的多元人類價值觀，從而提高它們的實用性。

English

The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned. We propose Controllable Safety Alignment (CoSA), a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to follow safety configs -- free-form natural language descriptions of the desired safety behaviors -- that are provided as part of the system prompt. To adjust model safety behavior, authorized users only need to modify such safety configs at inference time. To enable that, we propose CoSAlign, a data-centric method for aligning LLMs to easily adapt to diverse safety configs. Furthermore, we devise a novel controllability evaluation protocol that considers both helpfulness and configured safety, summarizing them into CoSA-Score, and construct CoSApien, a human-authored benchmark that consists of real-world LLM use cases with diverse safety requirements and corresponding evaluation prompts. We show that CoSAlign leads to substantial gains of controllability over strong baselines including in-context alignment. Our framework encourages better representation and adaptation to pluralistic human values in LLMs, and thereby increasing their practicality.

可控安全對齊：推論時適應多樣安全需求

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

摘要

Summary

Support

Support