AgentTrek:通过Web教程指导重放实现智能体轨迹合成
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
December 12, 2024
作者: Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu
cs.AI
摘要
图形用户界面(GUI)代理在自动化跨多样数字环境中的复杂任务方面具有巨大潜力,从Web应用到桌面软件。然而,这类代理的开发受制于缺乏高质量的、多步轨迹数据,这些数据对有效训练至关重要。现有方法依赖昂贵且劳动密集的人工标注,使其在规模上难以持续。为解决这一挑战,我们提出AgentTrek,一个可扩展的数据合成流水线,通过利用Web教程生成高质量的GUI代理轨迹。我们的方法自动从互联网收集类似教程的文本,将其转化为具有逐步说明的任务目标,并利用视觉语言模型代理在真实数字环境中模拟其执行。基于VLM的评估器确保生成轨迹的正确性。我们证明,使用这些合成轨迹训练GUI代理显著提高了它们的基础和规划性能,超过了当前模型。此外,与传统的人工标注方法相比,我们的方法更具成本效益。这项工作强调了通过Web教程进行引导重放作为大规模GUI代理训练的可行策略的潜力,为更具能力和自主性的数字代理铺平了道路。
English
Graphical User Interface (GUI) agents hold great potential for automating
complex tasks across diverse digital environments, from web applications to
desktop software. However, the development of such agents is hindered by the
lack of high-quality, multi-step trajectory data required for effective
training. Existing approaches rely on expensive and labor-intensive human
annotation, making them unsustainable at scale. To address this challenge, we
propose AgentTrek, a scalable data synthesis pipeline that generates
high-quality GUI agent trajectories by leveraging web tutorials. Our method
automatically gathers tutorial-like texts from the internet, transforms them
into task goals with step-by-step instructions, and employs a visual-language
model agent to simulate their execution in a real digital environment. A
VLM-based evaluator ensures the correctness of the generated trajectories. We
demonstrate that training GUI agents with these synthesized trajectories
significantly improves their grounding and planning performance over the
current models. Moreover, our approach is more cost-efficient compared to
traditional human annotation methods. This work underscores the potential of
guided replay with web tutorials as a viable strategy for large-scale GUI agent
training, paving the way for more capable and autonomous digital agents.Summary
AI-Generated Summary