추론 시점에서 대형 언어 모델의 거의 확실한 안전 정렬

초록

심층 언어 모델 (LLM)은 높은 능력을 갖고 있지만 편향된 또는 안전하지 않은 응답을 생성할 수 있으며, 이 문제를 완화하기 위한 RLHF와 같은 정렬 기술은 LLM을 재학습하여 오버피팅될 수 있으므로 비용이 많이 듭니다. 본 논문은 거의 확실하게 안전한 응답을 생성하도록 보장하는 새로운 추론 시간 정렬 접근 방식을 소개합니다. 이를 위해 추론 시간 응답의 안전 생성을 LLM의 잠재 공간 내에서 제약 조건이 있는 마르코프 의사 결정 과정으로 구성합니다. 중요한 점은 안전성 제약 조건의 진화를 추적하는 안전 상태를 보강하여 잠재 공간에서 MDP를 해결함으로써 공식적인 안전 보장을 증명할 수 있습니다. 이 기반 위에 InferenceGuard를 제안하여 모델 가중치를 수정하지 않고 LLM을 안전하게 정렬하는 실용적인 구현을 제시합니다. 경험적으로, InferenceGuard가 안전성과 작업 성능을 효과적으로 균형 있게 유지하며, 안전하고 정렬된 응답을 생성하는 기존 추론 시간 정렬 방법을 능가하는 것을 시연합니다.

English

Even highly capable large language models (LLMs) can produce biased or unsafe responses, and alignment techniques, such as RLHF, aimed at mitigating this issue, are expensive and prone to overfitting as they retrain the LLM. This paper introduces a novel inference-time alignment approach that ensures LLMs generate safe responses almost surely, i.e., with a probability approaching one. We achieve this by framing the safe generation of inference-time responses as a constrained Markov decision process within the LLM's latent space. Crucially, we augment a safety state that tracks the evolution of safety constraints and enables us to demonstrate formal safety guarantees upon solving the MDP in the latent space. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses.

추론 시점에서 대형 언어 모델의 거의 확실한 안전 정렬

Almost Surely Safe Alignment of Large Language Models at Inference-Time

초록

Support