GuardReasoner：走向基于推理的LLM安全防护

摘要

随着大型语言模型（LLMs）在安全关键应用中的影响日益增大，利用护栏确保它们的安全性仍然是一个关键挑战。本文提出了GuardReasoner，这是一种新的LLMs保护措施，通过引导护栏模型学习推理。具体而言，我们首先创建了GuardReasonerTrain数据集，包括127K个样本和460K个详细的推理步骤。然后，我们引入推理SFT来释放护栏模型的推理能力。此外，我们提出了难样本DPO来进一步加强它们的推理能力。通过这种方式，GuardReasoner实现了更好的性能、可解释性和泛化能力。对3个护栏任务的13个基准进行了大量实验和分析，证明了其优越性。值得注意的是，GuardReasoner 8B在平均F1分数上超过了GPT-4o+CoT的5.74%，超过了LLaMA Guard 3 8B的20.84%。我们发布了不同规模（1B、3B、8B）的GuardReasoner的训练数据、代码和模型：https://github.com/yueliu1999/GuardReasoner/。

English

As LLMs increasingly impact safety-critical applications, ensuring their safety using guardrails remains a key challenge. This paper proposes GuardReasoner, a new safeguard for LLMs, by guiding the guard model to learn to reason. Concretely, we first create the GuardReasonerTrain dataset, which consists of 127K samples with 460K detailed reasoning steps. Then, we introduce reasoning SFT to unlock the reasoning capability of guard models. In addition, we present hard sample DPO to further strengthen their reasoning ability. In this manner, GuardReasoner achieves better performance, explainability, and generalizability. Extensive experiments and analyses on 13 benchmarks of 3 guardrail tasks demonstrate its superiority. Remarkably, GuardReasoner 8B surpasses GPT-4o+CoT by 5.74% and LLaMA Guard 3 8B by 20.84% F1 score on average. We release the training data, code, and models with different scales (1B, 3B, 8B) of GuardReasoner : https://github.com/yueliu1999/GuardReasoner/.

GuardReasoner：走向基于推理的LLM安全防护

GuardReasoner: Towards Reasoning-based LLM Safeguards

摘要

Summary

Support