웹 에이전트 연구를 위한 BrowserGym 생태계

초록

BrowserGym 생태계는 자동화 및 대형 언어 모델 (LLM)을 활용하는 웹 상호 작용 작업에 대한 효율적인 평가와 벤치마킹의 증가하는 필요성에 대응합니다. 많은 기존 벤치마크는 단편화와 일관성 없는 평가 방법론으로 인해 신뢰할 수 있는 비교와 재현 가능한 결과를 얻기가 어려워지고 있습니다. BrowserGym은 잘 정의된 관측 및 행동 공간을 갖춘 통합된 체육관과 같은 환경을 제공하여 다양한 벤치마크 간의 표준화된 평가를 용이하게 합니다. 에이전트 생성, 테스트 및 분석을 지원하는 보조 프레임워크인 AgentLab과 결합된 BrowserGym은 새로운 벤치마크를 통합하는 유연성을 제공하면서 일관된 평가와 포괄적인 실험 관리를 보장합니다. 이 표준화된 접근 방식은 웹 에이전트 개발의 시간과 복잡성을 줄이고 더 신뢰할 수 있는 비교를 지원하며 에이전트 행동의 심층 분석을 용이하게 하여 더 적응 가능하고 능력 있는 에이전트를 만들어 LLM 기반 자동화의 혁신을 가속화하려 합니다. 지원하는 증거로, 우리는 최초의 대규모, 다중 벤치마크 웹 에이전트 실험을 수행하고 현재 BrowserGym에서 사용 가능한 모든 벤치마크에서 6개의 최첨단 LLM의 성능을 비교합니다. 다른 결과 중에서 우리의 결과는 OpenAI와 Anthropic의 최신 모델 간의 큰 차이를 강조하며, Claude-3.5-Sonnet이 대부분의 벤치마크에서 선두를 달리고 있지만 GPT-4o가 우수한 시각 관련 작업에서 우세함을 보여줍니다. 이러한 발전에도 불구하고, 우리의 결과는 견고하고 효율적인 웹 에이전트 구축이 현실 세계 웹 환경의 본질적인 복잡성과 현재 모델의 한계로 인해 여전히 중요한 도전 과제임을 강조합니다.

English

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

웹 에이전트 연구를 위한 BrowserGym 생태계

The BrowserGym Ecosystem for Web Agent Research

초록

Summary

Support