您的LLM是否暗中成为互联网的世界模型？基于模型的规划用于网络代理程序

摘要

语言代理已经展示出在自动化网络任务方面具有很大潜力，尽管它们目前的反应式方法与人类相比仍然表现不佳。通过整合先进的规划算法，特别是树搜索方法，可以增强这些代理的性能，但是直接在实时网站上实施树搜索会带来重大的安全风险和实际约束，因为会有不可逆转的行动，比如确认购买。在本文中，我们介绍了一种新颖的范式，通过模型为基础的规划来增强语言代理，开创了在复杂网络环境中将大型语言模型（LLMs）用作世界模型的创新用途。我们的方法，WebDreamer，建立在一个关键观点上，即LLMs本质上编码了关于网站结构和功能的全面知识。具体来说，WebDreamer利用LLMs来模拟每个候选动作的结果（例如，“如果我点击这个按钮会发生什么？”），然后评估这些想象出的结果以确定每一步的最佳动作。在具有在线交互的两个代表性网络代理基准测试--VisualWebArena和Mind2Web-live上的实证结果表明，WebDreamer相对于反应式基线取得了显著的改进。通过证实LLMs在网络环境中作为世界模型的可行性，这项工作为自动化网络交互的范式转变奠定了基础。更广泛地说，我们的发现为未来研究开辟了激动人心的新途径，包括1）专门为复杂、动态环境中的世界建模优化LLMs，以及2）基于模型的猜测性规划用于语言代理。

English

Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents' performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase. In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities. Specifically, WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g., "what would happen if I click this button?") using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step. Empirical results on two representative web agent benchmarks with online interaction -- VisualWebArena and Mind2Web-live -- demonstrate that WebDreamer achieves substantial improvements over reactive baselines. By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction. More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents.

您的LLM是否暗中成为互联网的世界模型？基于模型的规划用于网络代理程序

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

摘要

Support