用于Web代理研究的BrowserGym生态系统
The BrowserGym Ecosystem for Web Agent Research
December 6, 2024
作者: Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste
cs.AI
摘要
BrowserGym生态系统解决了对Web代理进行高效评估和基准测试的日益增长的需求,特别是那些利用自动化和大型语言模型(LLMs)进行Web交互任务的代理。许多现有的基准测试存在碎片化和评估方法不一致的问题,这使得难以进行可靠的比较和可重复的结果。BrowserGym旨在通过提供一个统一的类健身房环境,具有明确定义的观测和动作空间,促进跨不同基准测试的标准化评估。结合AgentLab,一个辅助框架,用于代理的创建、测试和分析,BrowserGym提供了整合新基准测试的灵活性,同时确保一致的评估和全面的实验管理。这种标准化方法旨在减少开发Web代理的时间和复杂性,支持更可靠的比较,并促进对代理行为的深入分析,可能导致更具适应性和能力的代理,最终加速LLM驱动的自动化创新。作为支持证据,我们进行了第一次大规模、多基准测试的Web代理实验,并比较了6种最先进的LLMs在BrowserGym当前所有基准测试中的表现。除其他发现外,我们的结果突出显示了OpenAI和Anthropic的最新模型之间存在很大差异,Claude-3.5-Sonnet在几乎所有基准测试中处于领先地位,只有在与视觉相关的任务中,GPT-4o才更为优越。尽管取得了这些进展,我们的结果强调,构建稳健高效的Web代理仍然是一个重大挑战,这是由于现实世界Web环境的固有复杂性和当前模型的局限性所致。
English
The BrowserGym ecosystem addresses the growing need for efficient evaluation
and benchmarking of web agents, particularly those leveraging automation and
Large Language Models (LLMs) for web interaction tasks. Many existing
benchmarks suffer from fragmentation and inconsistent evaluation methodologies,
making it challenging to achieve reliable comparisons and reproducible results.
BrowserGym aims to solve this by providing a unified, gym-like environment with
well-defined observation and action spaces, facilitating standardized
evaluation across diverse benchmarks. Combined with AgentLab, a complementary
framework that aids in agent creation, testing, and analysis, BrowserGym offers
flexibility for integrating new benchmarks while ensuring consistent evaluation
and comprehensive experiment management. This standardized approach seeks to
reduce the time and complexity of developing web agents, supporting more
reliable comparisons and facilitating in-depth analysis of agent behaviors, and
could result in more adaptable, capable agents, ultimately accelerating
innovation in LLM-driven automation. As a supporting evidence, we conduct the
first large-scale, multi-benchmark web agent experiment and compare the
performance of 6 state-of-the-art LLMs across all benchmarks currently
available in BrowserGym. Among other findings, our results highlight a large
discrepancy between OpenAI and Anthropic's latests models, with
Claude-3.5-Sonnet leading the way on almost all benchmarks, except on
vision-related tasks where GPT-4o is superior. Despite these advancements, our
results emphasize that building robust and efficient web agents remains a
significant challenge, due to the inherent complexity of real-world web
environments and the limitations of current models.Summary
AI-Generated Summary