Agent-SafetyBench：LLM エージェントの安全性評価

要旨

大規模言語モデル（LLM）がエージェントとしてますます展開されるにつれて、対話環境への統合やツールの使用は、モデル自体に関連する安全性に加えて新たな安全性の課題をもたらします。ただし、エージェントの安全性を評価する包括的なベンチマークが不足しているため、効果的な評価とさらなる改善が阻害されています。本論文では、LLMエージェントの安全性を評価するために設計された包括的なベンチマークであるAgent-SafetyBenchを紹介します。Agent-SafetyBenchには349の対話環境と2,000のテストケースが含まれ、8つの安全リスクカテゴリを評価し、危険な相互作用で頻繁に遭遇する10の一般的な障害モードをカバーしています。16の人気のあるLLMエージェントの評価では、懸念すべき結果が明らかになりました：エージェントのいずれも安全スコアが60％を超えていません。これは、LLMエージェントにおける重大な安全性の課題を浮き彫りにし、改善の必要性を強調しています。定量的な分析を通じて、重要な障害モードを特定し、現在のLLMエージェントにおける2つの基本的な安全性欠陥をまとめました：堅牢性の欠如とリスク認識の欠如。さらに、我々の調査結果は、防御プロンプトへの依存だけではこれらの安全性問題に対処するのに不十分であり、より高度で堅牢な戦略が必要であることを強調しています。Agent-SafetyBenchは、エージェントの安全性評価と改善におけるさらなる研究とイノベーションを促進するために、https://github.com/thu-coai/Agent-SafetyBench で公開されています。

English

As large language models (LLMs) are increasingly deployed as agents, their integration into interactive environments and tool use introduce new safety challenges beyond those associated with the models themselves. However, the absence of comprehensive benchmarks for evaluating agent safety presents a significant barrier to effective assessment and further improvement. In this paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%. This highlights significant safety challenges in LLM agents and underscores the considerable need for improvement. Through quantitative analysis, we identify critical failure modes and summarize two fundamental safety detects in current LLM agents: lack of robustness and lack of risk awareness. Furthermore, our findings suggest that reliance on defense prompts alone is insufficient to address these safety issues, emphasizing the need for more advanced and robust strategies. We release Agent-SafetyBench at https://github.com/thu-coai/Agent-SafetyBench to facilitate further research and innovation in agent safety evaluation and improvement.

Agent-SafetyBench：LLM エージェントの安全性評価

Agent-SafetyBench: Evaluating the Safety of LLM Agents

要旨

Summary

Support

Support