提议者-代理-评估者(PAE):基于模型的互联网代理自主技能发现

Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

December 17, 2024
作者: Yifei Zhou, Qianlan Yang, Kaixiang Lin, Min Bai, Xiong Zhou, Yu-Xiong Wang, Sergey Levine, Erran Li
cs.AI

摘要

广泛能力和目标导向代理的愿景,例如数字世界中的互联网浏览代理和物理世界中的家庭人形机器人,由于基础模型的泛化能力,已经迅速发展。这样一个通用代理需要具有大而多样的技能库,例如在两个旅行地点之间查找方向和从互联网购买特定物品。如果每个技能都需要通过一组固定的人工注释指令手动指定,由于人工注释指令的数量和多样性,代理的技能库将受到限制。在这项工作中,我们通过提出提议者-代理-评估者(PAE),一个有效的学习系统,来解决这一挑战,使基础模型代理能够在野外自主发现和练习技能。PAE的核心是一个上下文感知任务提议者,它根据环境的上下文信息(例如用户演示或者仅仅是互联网浏览代理的网站名称)自主提出代理需要练习的任务。然后,代理策略尝试使用思考和实际基于真实世界的操作执行这些任务,其结果轨迹由自主的基于VLM的成功评估者评估。成功评估作为奖励信号,用于代理通过RL来优化其策略。我们在具有挑战性的基于视觉的网络导航上验证了PAE,使用了来自WebVoyager和WebArena的真实世界和自托管网站。据我们所知,这项工作代表了首个将自主任务提议与RL应用于代理的有效学习系统,该系统能够将真实世界的人工注释基准泛化并达到SOTA性能。我们的开源检查点和代码可在https://yanqval.github.io/PAE/找到。
English
The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the generalization capability of foundation models. Such a generalist agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent's skill repertoire will necessarily be limited due to the quantity and diversity of human-annotated instructions. In this work, we address this challenge by proposing Proposer-Agent-Evaluator, an effective learning system that enables foundation model agents to autonomously discover and practice skills in the wild. At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information of the environment such as user demos or even just the name of the website itself for Internet-browsing agents. Then, the agent policy attempts those tasks with thoughts and actual grounded operations in the real world with resulting trajectories evaluated by an autonomous VLM-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and self-hosted websites from WebVoyager and WebArena.To the best of our knowledge, this work represents the first effective learning system to apply autonomous task proposal with RL for agents that generalizes real-world human-annotated benchmarks with SOTA performances. Our open-source checkpoints and code can be found in https://yanqval.github.io/PAE/

Summary

AI-Generated Summary

PDF122December 18, 2024