ChatPaper.aiChatPaper

探索者:面向多模态网络代理的规模化探索驱动型网页轨迹合成

Explorer: Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents

February 17, 2025
作者: Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, Ahmed Awadallah
cs.AI

摘要

近期,大型多模态模型(LMMs)的成功激发了能够自主完成复杂网络任务的智能代理的广泛应用。尽管开源LMM代理在离线评估基准上取得了显著进展,但在更贴近实际的在线环境中,其性能仍远未达到人类水平。一个关键瓶颈在于缺乏跨多个领域的多样化、大规模轨迹级数据集,而这些数据的收集成本高昂。本文通过开发一种可扩展的方法,合成了迄今为止最大且最多样化的轨迹级数据集,包含超过94,000条成功的多模态网络轨迹,涵盖49,000个唯一URL、720,000张截图及3,300万个网页元素。特别地,我们利用广泛的网络探索与优化来获取多样化的任务意图。每条成功轨迹的平均成本仅为28美分,使得社区内广大用户都能负担得起。基于此数据集,我们训练了名为Explorer的多模态网络代理,并在Mind2Web-Live、Multimodal-Mind2Web及MiniWob++等线上线下网络代理基准测试中展现了强劲性能。此外,我们的实验表明,数据规模的扩大是提升网络代理能力的关键驱动力。我们期望这项研究能推动基于LMM的前沿代理研究在更大范围内得以普及。
English
Recent success in large multimodal models (LMMs) has sparked promising applications of agents capable of autonomously completing complex web tasks. While open-source LMM agents have made significant advances in offline evaluation benchmarks, their performance still falls substantially short of human-level capabilities in more realistic online settings. A key bottleneck is the lack of diverse and large-scale trajectory-level datasets across various domains, which are expensive to collect. In this paper, we address this challenge by developing a scalable recipe to synthesize the largest and most diverse trajectory-level dataset to date, containing over 94K successful multimodal web trajectories, spanning 49K unique URLs, 720K screenshots, and 33M web elements. In particular, we leverage extensive web exploration and refinement to obtain diverse task intents. The average cost is 28 cents per successful trajectory, making it affordable to a wide range of users in the community. Leveraging this dataset, we train Explorer, a multimodal web agent, and demonstrate strong performance on both offline and online web agent benchmarks such as Mind2Web-Live, Multimodal-Mind2Web, and MiniWob++. Additionally, our experiments highlight data scaling as a key driver for improving web agent capabilities. We hope this study makes state-of-the-art LMM-based agent research at a larger scale more accessible.

Summary

AI-Generated Summary

PDF102February 18, 2025