GuardReasoner: 추론 기반 LLM 안전장치를 향하여

초록

LLM이 안전 중요 응용 프로그램에 점점 더 영향을 미치는 가운데, 가드레일을 사용하여 그 안전성을 보장하는 것은 여전히 중요한 과제입니다. 본 논문에서는 가드 모델이 추론을 학습하도록 안내함으로써 LLM을 위한 새로운 안전장치인 GuardReasoner를 제안합니다. 구체적으로, 우리는 먼저 460K개의 자세한 추론 단계를 포함한 127K개의 샘플로 구성된 GuardReasonerTrain 데이터셋을 생성합니다. 그런 다음, 가드 모델의 추론 능력을 발휘하기 위해 추론 SFT를 소개합니다. 게다가, 추론 능력을 더 강화하기 위해 어려운 샘플 DPO를 제시합니다. 이러한 방식으로 GuardReasoner는 더 나은 성능, 설명 가능성 및 일반화 능력을 달성합니다. 3가지 가드레일 작업의 13가지 벤치마크에 대한 광범위한 실험 및 분석은 그 우수성을 입증합니다. 특히, GuardReasoner 8B는 평균적으로 GPT-4o+CoT보다 5.74% 및 LLaMA Guard 3 8B보다 20.84% F1 점수를 능가합니다. GuardReasoner의 다양한 규모(1B, 3B, 8B)의 훈련 데이터, 코드 및 모델은 다음 링크에서 확인할 수 있습니다: https://github.com/yueliu1999/GuardReasoner/.

English

As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.

GuardReasoner: 추론 기반 LLM 안전장치를 향하여

GuardReasoner: Towards Reasoning-based LLM Safeguards

초록

Summary

Support