大型推理模型的潜在风险:R1模型的安全评估
The Hidden Risks of Large Reasoning Models: A Safety Assessment of R1
February 18, 2025
作者: Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, Xin Eric Wang
cs.AI
摘要
大型推理模型,如OpenAI-o3和DeepSeek-R1的快速发展,显著提升了复杂推理能力,超越了非推理型大语言模型(LLMs)。然而,这些模型增强的能力,加之DeepSeek-R1等模型的开源获取,引发了严重的安全担忧,尤其是其潜在滥用风险。本研究针对这些推理模型进行了全面的安全评估,利用既有的安全基准测试其合规性。此外,我们探究了它们对对抗性攻击(如越狱和提示注入)的易感性,以评估其在实际应用中的鲁棒性。通过多维度分析,我们揭示了四个关键发现:(1)开源R1模型与o3-mini模型在安全基准和攻击测试上存在显著安全差距,表明R1模型需加强安全措施。(2)蒸馏后的推理模型相较于其安全对齐的基础模型,表现出更差的安全性能。(3)模型的推理能力越强,回答不安全问题时可能造成的危害越大。(4)R1模型的思考过程比其最终答案带来更大的安全隐患。本研究为推理模型的安全影响提供了洞见,并强调了在R1模型安全性上进一步推进以缩小差距的必要性。
English
The rapid development of large reasoning models, such as OpenAI-o3 and
DeepSeek-R1, has led to significant improvements in complex reasoning over
non-reasoning large language models~(LLMs). However, their enhanced
capabilities, combined with the open-source access of models like DeepSeek-R1,
raise serious safety concerns, particularly regarding their potential for
misuse. In this work, we present a comprehensive safety assessment of these
reasoning models, leveraging established safety benchmarks to evaluate their
compliance with safety regulations. Furthermore, we investigate their
susceptibility to adversarial attacks, such as jailbreaking and prompt
injection, to assess their robustness in real-world applications. Through our
multi-faceted analysis, we uncover four key findings: (1) There is a
significant safety gap between the open-source R1 models and the o3-mini model,
on both safety benchmark and attack, suggesting more safety effort on R1 is
needed. (2) The distilled reasoning model shows poorer safety performance
compared to its safety-aligned base models. (3) The stronger the model's
reasoning ability, the greater the potential harm it may cause when answering
unsafe questions. (4) The thinking process in R1 models pose greater safety
concerns than their final answers. Our study provides insights into the
security implications of reasoning models and highlights the need for further
advancements in R1 models' safety to close the gap.Summary
AI-Generated Summary