用于Web代理研究的BrowserGym生态系统

摘要

BrowserGym生态系统解决了对Web代理进行高效评估和基准测试的日益增长的需求，特别是那些利用自动化和大型语言模型（LLMs）进行Web交互任务的代理。许多现有的基准测试存在碎片化和评估方法不一致的问题，这使得难以进行可靠的比较和可重复的结果。BrowserGym旨在通过提供一个统一的类健身房环境，具有明确定义的观测和动作空间，促进跨不同基准测试的标准化评估。结合AgentLab，一个辅助框架，用于代理的创建、测试和分析，BrowserGym提供了整合新基准测试的灵活性，同时确保一致的评估和全面的实验管理。这种标准化方法旨在减少开发Web代理的时间和复杂性，支持更可靠的比较，并促进对代理行为的深入分析，可能导致更具适应性和能力的代理，最终加速LLM驱动的自动化创新。作为支持证据，我们进行了第一次大规模、多基准测试的Web代理实验，并比较了6种最先进的LLMs在BrowserGym当前所有基准测试中的表现。除其他发现外，我们的结果突出显示了OpenAI和Anthropic的最新模型之间存在很大差异，Claude-3.5-Sonnet在几乎所有基准测试中处于领先地位，只有在与视觉相关的任务中，GPT-4o才更为优越。尽管取得了这些进展，我们的结果强调，构建稳健高效的Web代理仍然是一个重大挑战，这是由于现实世界Web环境的固有复杂性和当前模型的局限性所致。

English

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

用于Web代理研究的BrowserGym生态系统

The BrowserGym Ecosystem for Web Agent Research

摘要

Support