推断时大型语言模型的几乎确定安全对齐
Almost Surely Safe Alignment of Large Language Models at Inference-Time
February 3, 2025
作者: Xiaotong Ji, Shyam Sundhar Ramesh, Matthieu Zimmer, Ilija Bogunovic, Jun Wang, Haitham Bou Ammar
cs.AI
摘要
即使是性能出色的大型语言模型(LLMs)也可能产生偏见或不安全的响应,而旨在缓解这一问题的对齐技术,如RLHF,因为重新训练LLM而昂贵且容易过拟合。本文介绍了一种新颖的推理时间对齐方法,确保LLMs几乎肯定生成安全响应,即概率接近于1。我们通过将推理时间响应的安全生成框架化为LLM潜在空间内的受限马尔可夫决策过程来实现这一目标。关键是,我们增加了一个安全状态,跟踪安全约束的演变,并能够在解决潜在空间中的MDP时展示正式的安全保证。基于这一基础,我们提出了InferenceGuard,这是一个实用的实现,可以在不修改模型权重的情况下安全地对齐LLMs。从经验上看,我们证明了InferenceGuard在平衡安全性和任务性能方面表现出色,优于现有的推理时间对齐方法,在生成安全和对齐的响应方面表现更佳。
English
Even highly capable large language models (LLMs) can produce biased or unsafe
responses, and alignment techniques, such as RLHF, aimed at mitigating this
issue, are expensive and prone to overfitting as they retrain the LLM. This
paper introduces a novel inference-time alignment approach that ensures LLMs
generate safe responses almost surely, i.e., with a probability approaching
one. We achieve this by framing the safe generation of inference-time responses
as a constrained Markov decision process within the LLM's latent space.
Crucially, we augment a safety state that tracks the evolution of safety
constraints and enables us to demonstrate formal safety guarantees upon solving
the MDP in the latent space. Building on this foundation, we propose
InferenceGuard, a practical implementation that safely aligns LLMs without
modifying the model weights. Empirically, we demonstrate InferenceGuard
effectively balances safety and task performance, outperforming existing
inference-time alignment methods in generating safe and aligned responses.Summary
AI-Generated Summary