Agent-SafetyBench：評估LLM智能體的安全性

摘要

隨著大型語言模型（LLMs）越來越多地被部署為代理人，它們整合到互動環境和工具使用中，帶來了超出模型本身相關的新安全挑戰。然而，缺乏全面的評估代理人安全性的基準，對於有效評估和進一步改進構成了重大障礙。在本文中，我們介紹了Agent-SafetyBench，這是一個旨在評估LLM代理人安全性的全面基準。Agent-SafetyBench 包括349個互動環境和2,000個測試案例，評估了8個安全風險類別，涵蓋了10種常見的不安全互動中經常遇到的失敗模式。我們對16個流行的LLM代理進行評估後發現一個令人擔憂的結果：沒有一個代理人的安全得分超過60％。這突顯了LLM代理人中存在重大的安全挑戰，並強調了對改進的巨大需求。通過定量分析，我們確定了關鍵的失敗模式，並總結了當前LLM代理人中兩個基本的安全缺陷：缺乏魯棒性和缺乏風險意識。此外，我們的研究結果表明，僅依賴防禦提示是不足以應對這些安全問題的，強調了需要更先進和更堅固的策略。我們在 https://github.com/thu-coai/Agent-SafetyBench 上發布了Agent-SafetyBench，以促進代理人安全性評估和改進的進一步研究和創新。

English

As large language models (LLMs) are increasingly deployed as agents, their integration into interactive environments and tool use introduce new safety challenges beyond those associated with the models themselves. However, the absence of comprehensive benchmarks for evaluating agent safety presents a significant barrier to effective assessment and further improvement. In this paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%. This highlights significant safety challenges in LLM agents and underscores the considerable need for improvement. Through quantitative analysis, we identify critical failure modes and summarize two fundamental safety detects in current LLM agents: lack of robustness and lack of risk awareness. Furthermore, our findings suggest that reliance on defense prompts alone is insufficient to address these safety issues, emphasizing the need for more advanced and robust strategies. We release Agent-SafetyBench at https://github.com/thu-coai/Agent-SafetyBench to facilitate further research and innovation in agent safety evaluation and improvement.

Agent-SafetyBench：評估LLM智能體的安全性

Agent-SafetyBench: Evaluating the Safety of LLM Agents

摘要

Summary

Support