具有世界模型的網路代理：學習和利用網頁導覽中的環境動態

摘要

近來，大型語言模型（LLMs）在建立自主代理方面引起了廣泛關注。然而，目前基於LLM的網頁代理在長時間範疇任務中的表現遠非最佳，常常導致錯誤，例如反覆購買不可退款的機票。相較之下，人類能夠避免這種不可逆的錯誤，因為我們對於行動可能帶來的結果（例如損失金錢）有意識，這也被稱為「世界模型」。受此啟發，我們的研究首先進行初步分析，確認目前的LLMs（例如GPT-4o、Claude-3.5-Sonnet等）中缺乏世界模型。接著，我們提出了一種增強世界模型（WMA）的網頁代理，該代理模擬其行動的結果以做出更好的決策。為了克服訓練LLMs作為預測下一觀察的世界模型所面臨的挑戰，例如觀察之間的重複元素和長HTML輸入，我們提出了一種以轉換為焦點的觀察抽象，其中預測目標是自由形式的自然語言描述，專門突顯時間步之間的重要狀態差異。在WebArena和Mind2Web上的實驗表明，我們的世界模型提高了代理的策略選擇，無需額外訓練，並且相較於最近基於樹搜索的代理，我們的代理在成本和時間效率上表現更佳。

English

Large language models (LLMs) have recently gained much attention in building autonomous agents. However, the performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the "world model". Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4o, Claude-3.5-Sonnet, etc.). Then, we present a World-model-augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents' policy selection without training and demonstrate our agents' cost- and time-efficiency compared to recent tree-search-based agents.

具有世界模型的網路代理：學習和利用網頁導覽中的環境動態

Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

摘要

Summary

Support

Support