WebRL：通过自我演进的在线课程训练LLM Web代理强化学习

摘要

大型语言模型（LLMs）展现出在网络任务中作为自主代理的显著潜力。然而，现有的LLM网络代理在很大程度上依赖昂贵的专有LLM API，而开放的LLMs则缺乏必要的决策能力。本文介绍了WebRL，这是一个自我进化的在线课程强化学习框架，旨在利用开放的LLMs训练高性能网络代理。WebRL解决了构建LLM网络代理时的三个关键挑战，包括训练任务的稀缺性、稀疏的反馈信号以及在线学习中的策略分布漂移。具体而言，WebRL包括：1）一个自我进化的课程，从不成功的尝试中生成新任务，2）一个强大的结果监督奖励模型（ORM），以及3）自适应的强化学习策略，以确保持续改进。我们将WebRL应用于将开放的Llama-3.1和GLM-4模型转变为熟练的网络代理。在WebArena-Lite上，WebRL将Llama-3.1-8B的成功率从4.8%提高到42.4%，将GLM-4-9B的成功率从6.1%提高到43%。这些开放模型明显超越了GPT-4-Turbo（17.6%）和GPT-4o（13.9%）的性能，并且胜过之前基于开放LLMs训练的最先进网络代理（AutoWebGLM，18.2%）。我们的研究结果表明WebRL在弥合开放和专有LLM网络代理之间的差距方面的有效性，为更具可访问性和强大的自主网络交互系统铺平了道路。

English

Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.

WebRL：通过自我演进的在线课程训练LLM Web代理强化学习

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

摘要

Summary

Support

Support