AgentTrek:通過使用網絡教程引導重播來合成智能體軌跡

AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

December 12, 2024
作者: Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, Tao Yu
cs.AI

摘要

圖形使用者介面(GUI)代理在自動化跨不同數位環境的複雜任務方面具有巨大潛力,從網頁應用程式到桌面軟體。然而,這類代理的開發受到高質量、多步驟軌跡數據的缺乏所阻礙,這些數據對有效訓練至關重要。現有方法依賴昂貴且勞動密集的人工標註,使其難以規模化。為應對這一挑戰,我們提出AgentTrek,一個可擴展的數據合成管線,通過利用網絡教程生成高質量的GUI代理軌跡。我們的方法自動從互聯網上收集類似教程的文本,將其轉換為具有逐步指導的任務目標,並利用視覺語言模型代理在真實數位環境中模擬其執行。基於VLM的評估器確保所生成軌跡的正確性。我們展示通過使用這些合成軌跡訓練GUI代理,顯著提高了其基礎和規劃性能,超越了當前模型。此外,我們的方法與傳統的人工標註方法相比更具成本效益。這項工作強調了通過網絡教程進行引導重播作為大規模GUI代理訓練的可行策略的潛力,為更具能力和自主性的數位代理鋪平了道路。
English
Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive human annotation, making them unsustainable at scale. To address this challenge, we propose AgentTrek, a scalable data synthesis pipeline that generates high-quality GUI agent trajectories by leveraging web tutorials. Our method automatically gathers tutorial-like texts from the internet, transforms them into task goals with step-by-step instructions, and employs a visual-language model agent to simulate their execution in a real digital environment. A VLM-based evaluator ensures the correctness of the generated trajectories. We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models. Moreover, our approach is more cost-efficient compared to traditional human annotation methods. This work underscores the potential of guided replay with web tutorials as a viable strategy for large-scale GUI agent training, paving the way for more capable and autonomous digital agents.

Summary

AI-Generated Summary

PDF282December 13, 2024