通過反思樹搜索和自我學習來提升自主AI代理程序

摘要

自主代理已展示出在自動化複雜多步決策任務方面具有顯著潛力。然而，即使是最先進的視覺語言模型（VLMs），如GPT-4o，仍然無法達到人類水平的表現，特別是在複雜的網絡環境和長期規劃任務中。為了解決這些限制，我們引入了反思蒙特卡羅樹搜索（R-MCTS），這是一種新穎的測試時間算法，旨在增強AI代理的能力，例如由GPT-4o提供支持，以即時探索決策空間。R-MCTS通過以下方式擴展了傳統MCTS：1）融入對比反思，使代理能夠從過去的互動中學習並動態提高其搜索效率；以及2）使用多代理辯論來提供可靠的狀態評估。此外，我們通過自學來改進代理的性能，使用R-MCTS生成的樹遍歷來微調GPT-4o，而無需任何人工提供的標籤。在具有挑戰性的VisualWebArena基準測試中，我們基於GPT-4o的R-MCTS代理相對於先前最先進技術，在各種任務中實現了6%至30%的相對改進。此外，我們展示了從測試時間搜索中獲得的知識可以通過微調有效地轉移回GPT-4o。經過微調的GPT-4o與R-MCTS的性能匹配率為97%，同時在測試時間將計算使用量減少了四倍。此外，定性結果顯示，經過微調的GPT-4o模型展示了探索環境、評估狀態以及在檢測到當前狀態無法成功時回溯到可行狀態的能力。此外，我們的工作展示了在訓練 - 通過R-MCTS進行數據收集 - 和測試時間中的計算擴展特性。這些結果表明了一個有前途的研究方向，即通過測試時間搜索和自學來增強VLMs的推理和規劃能力，以應用於代理應用。

English

Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent's performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97% of R-MCTS's performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs' reasoning and planning capabilities for agentic applications via test-time search and self-learning.

通過反思樹搜索和自我學習來提升自主AI代理程序

Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

摘要

Summary

Support

Support