用於網頁代理程式研究的BrowserGym生態系統

The BrowserGym Ecosystem for Web Agent Research

December 6, 2024
作者: Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste
cs.AI

摘要

BrowserGym 生態系統應對日益增長的需求,有效評估和基準測試網頁代理,尤其是那些利用自動化和大型語言模型 (LLMs) 進行網頁交互任務的代理。許多現有的基準測試存在碎片化和不一致的評估方法論,使得難以進行可靠的比較和可重現的結果。BrowserGym 旨在通過提供統一的類健身房環境,具有明確定義的觀察和行動空間,促進跨多個基準測試的標準化評估。結合 AgentLab,一個輔助框架,有助於代理的創建、測試和分析,BrowserGym 在確保一致評估和全面實驗管理的同時,提供了整合新基準測試的靈活性。這種標準化方法旨在減少開發網頁代理的時間和複雜性,支持更可靠的比較,促進對代理行為的深入分析,可能導致更具適應性和能力的代理,從而加速 LLM 驅動的自動化創新。作為支持證據,我們進行了首次大規模、多基準測試的網頁代理實驗,並比較了目前在 BrowserGym 中所有基準測試中 6 個最先進的 LLMs 的性能。除其他發現外,我們的結果突顯了 OpenAI 和 Anthropic 最新模型之間的巨大差異,Claude-3.5-Sonnet 在幾乎所有基準測試中處於領先地位,只有在與視覺相關的任務中,GPT-4o 更為優越。儘管取得這些進展,我們的結果強調,由於現實世界網頁環境的固有複雜性和當前模型的限制,構建堅固高效的網頁代理仍然是一個重大挑戰。
English
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

Summary

AI-Generated Summary

PDF192December 12, 2024