ChatPaper.aiChatPaper

SafeArena:评估自主网络代理的安全性

SafeArena: Evaluating the Safety of Autonomous Web Agents

March 6, 2025
作者: Ada Defne Tur, Nicholas Meade, Xing Han Lù, Alejandra Zambrano, Arkil Patel, Esin Durmus, Spandana Gella, Karolina Stańczak, Siva Reddy
cs.AI

摘要

基于大语言模型(LLM)的代理在解决网络任务方面正变得日益熟练。然而,这种能力的提升也伴随着更大的滥用风险,例如在在线论坛发布虚假信息或在网站上销售违禁品。为评估这些风险,我们提出了SafeArena,这是首个专注于网络代理故意滥用的基准测试。SafeArena包含来自四个网站的250项安全任务和250项有害任务。我们将有害任务划分为五大类别——虚假信息、非法活动、骚扰、网络犯罪和社会偏见,旨在评估网络代理的实际滥用情况。我们在该基准上评估了包括GPT-4o、Claude-3.5 Sonnet、Qwen-2-VL 72B和Llama-3.2 90B在内的领先LLM网络代理。为系统评估其对有害任务的易感性,我们引入了代理风险评估框架,该框架将代理行为划分为四个风险等级。我们发现,代理对恶意请求的顺从程度令人惊讶,GPT-4o和Qwen-2分别完成了34.7%和27.3%的有害请求。我们的研究结果凸显了对网络代理进行安全对齐程序的迫切需求。我们的基准测试可在此处获取:https://safearena.github.io
English
LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for malicious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io

Summary

AI-Generated Summary

PDF182March 10, 2025