你的LLM暗中是互聯網的世界模型嗎？基於模型的計劃為Web代理商

摘要

語言代理在自動化網絡任務方面展現了潛力，但其目前的反應式方法與人類相比仍存在明顯不足。透過整合先進的規劃算法，尤其是樹搜索方法，可以提升這些代理的性能，然而直接在實時網站上實施樹搜索存在重大安全風險和實際限制，因為確認購買等不可逆操作。本文介紹了一種新範式，該範式將語言代理與基於模型的規劃相結合，開創性地將大型語言模型（LLMs）用作複雜網絡環境中的世界模型。我們的方法WebDreamer 基於一個關鍵洞察，即LLMs內在編碼了有關網站結構和功能的全面知識。具體而言，WebDreamer 使用LLMs 模擬每個候選操作的結果（例如，“如果我點擊此按鈕會發生什麼？”），並通過自然語言描述評估這些想像的結果，以確定每個步驟的最佳操作。在兩個具有在線交互的代表性網絡代理基準測試VisualWebArena 和Mind2Web-live 上的實證結果表明，WebDreamer 在反應式基準上實現了顯著改進。通過證明LLMs 在網絡環境中的世界模型的可行性，這項工作為自動化網絡交互的範式轉變奠定了基礎。更廣泛地說，我們的發現為未來研究開辟了激動人心的新途徑，包括1）針對在複雜、動態環境中進行世界建模的LLMs 進行優化，以及2）為語言代理進行基於模型的推測性規劃。

English

Language agents have demonstrated promising capabilities in automating web-based tasks, though their current reactive approaches still underperform largely compared to humans. While incorporating advanced planning algorithms, particularly tree search methods, could enhance these agents' performance, implementing tree search directly on live websites poses significant safety risks and practical constraints due to irreversible actions such as confirming a purchase. In this paper, we introduce a novel paradigm that augments language agents with model-based planning, pioneering the innovative use of large language models (LLMs) as world models in complex web environments. Our method, WebDreamer, builds on the key insight that LLMs inherently encode comprehensive knowledge about website structures and functionalities. Specifically, WebDreamer uses LLMs to simulate outcomes for each candidate action (e.g., "what would happen if I click this button?") using natural language descriptions, and then evaluates these imagined outcomes to determine the optimal action at each step. Empirical results on two representative web agent benchmarks with online interaction -- VisualWebArena and Mind2Web-live -- demonstrate that WebDreamer achieves substantial improvements over reactive baselines. By establishing the viability of LLMs as world models in web environments, this work lays the groundwork for a paradigm shift in automated web interaction. More broadly, our findings open exciting new avenues for future research into 1) optimizing LLMs specifically for world modeling in complex, dynamic environments, and 2) model-based speculative planning for language agents.

你的LLM暗中是互聯網的世界模型嗎？基於模型的計劃為Web代理商

Is Your LLM Secretly a World Model of the Internet? Model-Based Planning for Web Agents

摘要

Summary

Support