ChatPaper.aiChatPaper

WebGames:面向通用网页浏览AI智能体的挑战性测试平台

WebGames: Challenging General-Purpose Web-Browsing AI Agents

February 25, 2025
作者: George Thomas, Alex J. Chan, Jikun Kang, Wenqi Wu, Filippos Christianos, Fraser Greenlee, Andy Toulis, Marvin Purtorab
cs.AI

摘要

我们推出了WebGames,这是一个全面的基准测试套件,旨在通过50多个互动挑战来评估通用网页浏览AI代理。这些挑战特别设计为对人类而言简单直接,同时系统地测试当前AI系统在基础浏览器交互、高级输入处理、认知任务、工作流自动化及互动娱乐等方面的局限。我们的框架通过密封的测试环境消除了外部依赖,确保了可复现的评估与可验证的真实解。我们对包括GPT-4o、Claude Computer-Use、Gemini-1.5-Pro和Qwen2-VL在内的领先视觉语言模型进行了与人类表现的对比评估。结果显示,存在显著的能力差距,最佳AI系统的成功率仅为43.1%,而人类表现高达95.7%,这凸显了当前AI系统在处理人类认为直观的常见网页交互模式上的根本局限。该基准测试公开于webgames.convergence.ai,提供了一个轻量级的客户端实现,便于快速评估循环。凭借其模块化架构和标准化的挑战规范,WebGames为衡量更强大网页浏览代理的开发进展奠定了坚实基础。
English
We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for humans while systematically testing the limitations of current AI systems across fundamental browser interactions, advanced input processing, cognitive tasks, workflow automation, and interactive entertainment. Our framework eliminates external dependencies through a hermetic testing environment, ensuring reproducible evaluation with verifiable ground-truth solutions. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems' ability to handle common web interaction patterns that humans find intuitive. The benchmark is publicly available at webgames.convergence.ai, offering a lightweight, client-side implementation that facilitates rapid evaluation cycles. Through its modular architecture and standardized challenge specifications, WebGames provides a robust foundation for measuring progress in development of more capable web-browsing agents.

Summary

AI-Generated Summary

PDF112February 26, 2025