WebRL: 자기 진화 온라인 커리큘럼을 통해 LLM 웹 에이전트를 훈련하는 강화 학습

초록

대형 언어 모델(LLM)은 특히 웹 기반 작업에서 자율 에이전트로서 놀라운 잠재력을 보여주었습니다. 그러나 기존 LLM 웹 에이전트들은 고가의 소유 LLM API에 심하게 의존하고 있으며, 오픈 LLM은 필요한 의사 결정 능력이 부족합니다. 본 논문은 오픈 LLM을 활용하여 고성능 웹 에이전트를 훈련시키기 위한 자기 진화 온라인 커리큘럼 강화 학습 프레임워크인 WebRL을 소개합니다. WebRL은 LLM 웹 에이전트를 구축하는 데 있어서 훈련 작업의 부족, 희박한 피드백 신호, 그리고 온라인 학습에서의 정책 분포 이탈이라는 세 가지 주요 도전에 대응합니다. 구체적으로, WebRL은 1) 실패한 시도로부터 새로운 작업을 생성하는 자기 진화 커리큘럼, 2) 강력한 결과 지도 보상 모델(ORM), 그리고 3) 지속적인 개선을 보장하기 위한 적응형 강화 학습 전략을 통합합니다. 우리는 WebRL을 적용하여 오픈 Llama-3.1 및 GLM-4 모델을 능숙한 웹 에이전트로 변환했습니다. WebArena-Lite에서, WebRL은 Llama-3.1-8B의 성공률을 4.8%에서 42.4%로, 그리고 GLM-4-9B의 성공률을 6.1%에서 43%로 향상시켰습니다. 이러한 오픈 모델들은 GPT-4-Turbo(17.6%)와 GPT-4o(13.9%)보다 성능이 크게 뛰어나며, 오픈 LLM에서 훈련된 이전 최첨단 웹 에이전트들(AutoWebGLM, 18.2%)을 능가합니다. 우리의 연구 결과는 WebRL이 오픈 및 소유 LLM 기반 웹 에이전트 사이의 간극을 좁히는 데 효과적임을 입증하며, 더 접근 가능하고 강력한 자율 웹 상호 작용 시스템을 위한 길을 열어줍니다.

English

Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents heavily rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. WebRL addresses three key challenges in building LLM web agents, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. Specifically, WebRL incorporates 1) a self-evolving curriculum that generates new tasks from unsuccessful attempts, 2) a robust outcome-supervised reward model (ORM), and 3) adaptive reinforcement learning strategies to ensure consistent improvements. We apply WebRL to transform open Llama-3.1 and GLM-4 models into proficient web agents. On WebArena-Lite, WebRL improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM-4-9B. These open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%) and outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL's effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.

WebRL: 자기 진화 온라인 커리큘럼을 통해 LLM 웹 에이전트를 훈련하는 강화 학습

WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

초록

Summary

Support