OpenWebVoyager：通过迭代式现实世界探索、反馈和优化构建多模态网络代理

摘要

大型语言和多模态模型的快速发展引发了对使用专有模型（如GPT-4o）开发能够处理诸如网络导航等真实场景的自主代理的极大兴趣。尽管最近的开源努力试图赋予代理探索环境和持续改进能力，但它们在定义明确的奖励信号的合成环境中构建仅支持文本的代理。这些代理难以推广到需要多模态感知能力且缺乏地面真实信号的现实设置。在本文中，我们介绍了一个旨在促进开发能够自主进行现实世界探索和改进自身的多模态网络代理的开源框架。我们首先通过模仿学习训练基础模型以获得基本能力。然后让代理探索开放网络并收集其轨迹的反馈。之后，它通过从另一个通用模型判断为表现良好的轨迹中学习进一步改进其策略。这种探索-反馈-优化循环可以持续多次迭代。实验结果表明，我们的网络代理在每次迭代后成功改进自身，展现出在多个测试集上的强大性能。

English

The rapid development of large language and multimodal models has sparked significant interest in using proprietary models, such as GPT-4o, to develop autonomous agents capable of handling real-world scenarios like web navigation. Although recent open-source efforts have tried to equip agents with the ability to explore environments and continuously improve over time, they are building text-only agents in synthetic environments where the reward signals are clearly defined. Such agents struggle to generalize to realistic settings that require multimodal perception abilities and lack ground-truth signals. In this paper, we introduce an open-source framework designed to facilitate the development of multimodal web agent that can autonomously conduct real-world exploration and improve itself. We first train the base model with imitation learning to gain the basic abilities. We then let the agent explore the open web and collect feedback on its trajectories. After that, it further improves its policy by learning from well-performing trajectories judged by another general-purpose model. This exploration-feedback-optimization cycle can continue for several iterations. Experimental results show that our web agent successfully improves itself after each iteration, demonstrating strong performance across multiple test sets.

OpenWebVoyager：通过迭代式现实世界探索、反馈和优化构建多模态网络代理

OpenWebVoyager: Building Multimodal Web Agents via Iterative Real-World Exploration, Feedback and Optimization

摘要

Summary

Support

Support