에이전트 안전성 평가벤치: LLM 에이전트의 안전성 평가

초록

대형 언어 모델(LLMs)이 에이전트로서 점점 더 많이 배치됨에 따라, 상호 작용 환경 및 도구 사용으로 인한 통합은 모델 자체와 관련된 안전 문제 이상의 새로운 안전 도전 과제를 도입합니다. 그러나 에이전트 안전을 평가하기 위한 포괄적인 벤치마크의 부재는 효과적인 평가와 추가 개선에 상당한 장벽을 제공합니다. 본 논문에서는 LLM 에이전트의 안전을 평가하기 위해 설계된 포괄적인 벤치마크인 Agent-SafetyBench를 소개합니다. Agent-SafetyBench는 349개의 상호 작용 환경과 2,000개의 테스트 케이스를 포함하며, 8가지 안전 위험 범주를 평가하고, 불안전한 상호 작용에서 자주 발생하는 10가지 일반적인 실패 모드를 다룹니다. 16개의 인기 있는 LLM 에이전트를 평가한 결과, 우려스러운 결과가 나타났습니다: 어떤 에이전트도 안전 점수가 60%를 넘지 못했습니다. 이는 LLM 에이전트에서 중요한 안전 도전 과제를 강조하며, 개선이 크게 필요함을 강조합니다. 양적 분석을 통해 우리는 중요한 실패 모드를 식별하고, 현재 LLM 에이전트에서의 두 가지 기본적인 안전 결함을 요약합니다: 견고성 부족과 위험 인식 부족. 더 나아가, 우리의 연구 결과는 방어 프롬프트에만 의존하는 것이 이러한 안전 문제를 해결하는 데 충분하지 않다는 것을 시사하며, 더 진보된 강력한 전략이 필요함을 강조합니다. 우리는 Agent-SafetyBench를 https://github.com/thu-coai/Agent-SafetyBench 에 공개하여 에이전트 안전 평가 및 개선에 대한 추가 연구와 혁신을 촉진합니다.

English

As large language models (LLMs) are increasingly deployed as agents, their integration into interactive environments and tool use introduce new safety challenges beyond those associated with the models themselves. However, the absence of comprehensive benchmarks for evaluating agent safety presents a significant barrier to effective assessment and further improvement. In this paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%. This highlights significant safety challenges in LLM agents and underscores the considerable need for improvement. Through quantitative analysis, we identify critical failure modes and summarize two fundamental safety detects in current LLM agents: lack of robustness and lack of risk awareness. Furthermore, our findings suggest that reliance on defense prompts alone is insufficient to address these safety issues, emphasizing the need for more advanced and robust strategies. We release Agent-SafetyBench at https://github.com/thu-coai/Agent-SafetyBench to facilitate further research and innovation in agent safety evaluation and improvement.

에이전트 안전성 평가벤치: LLM 에이전트의 안전성 평가

Agent-SafetyBench: Evaluating the Safety of LLM Agents

초록

Summary

Support

Support