UI-TARS:与本地代理进行自动化GUI交互的先驱
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
January 21, 2025
作者: Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi
cs.AI
摘要
本文介绍了UI-TARS,这是一个本地GUI代理模型,仅将屏幕截图作为输入,并执行类似人类的交互(例如键盘和鼠标操作)。与依赖于大量封装的商业模型(例如GPT-4o)以及专家设计的提示和工作流的流行代理框架不同,UI-TARS是一个端到端模型,表现优于这些复杂的框架。实验证明了其卓越性能:UI-TARS在评估感知、接地和GUI任务执行的10多个GUI代理基准测试中取得了SOTA性能。值得注意的是,在OSWorld基准测试中,UI-TARS在50步骤时取得了24.6的分数,在15步骤时取得了22.7的分数,优于Claude(分别为22.0和14.9)。在AndroidWorld中,UI-TARS取得了46.6的分数,超过了GPT-4o(34.5)。UI-TARS融合了几项关键创新:(1)增强感知:利用大规模GUI屏幕截图数据集,实现对UI元素的上下文感知理解和精准字幕;(2)统一动作建模,将动作标准化为跨平台的统一空间,并通过大规模动作跟踪实现精准接地和交互;(3)系统-2推理,将深思熟虑的推理融入多步决策制定中,涉及多种推理模式,如任务分解、反思思考、里程碑识别等;(4)反思在线跟踪的迭代训练,通过在数百台虚拟机上自动收集、过滤和反思性地完善新的交互跟踪,解决了数据瓶颈问题。通过迭代训练和反思调整,UI-TARS不断从错误中学习,并在最少人为干预下适应未预见的情况。我们还分析了GUI代理的演变路径,以指导该领域的进一步发展。
English
This paper introduces UI-TARS, a native GUI agent model that solely perceives
the screenshots as input and performs human-like interactions (e.g., keyboard
and mouse operations). Unlike prevailing agent frameworks that depend on
heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts
and workflows, UI-TARS is an end-to-end model that outperforms these
sophisticated frameworks. Experiments demonstrate its superior performance:
UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating
perception, grounding, and GUI task execution. Notably, in the OSWorld
benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15
steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld,
UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several
key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of
GUI screenshots for context-aware understanding of UI elements and precise
captioning; (2) Unified Action Modeling, which standardizes actions into a
unified space across platforms and achieves precise grounding and interaction
through large-scale action traces; (3) System-2 Reasoning, which incorporates
deliberate reasoning into multi-step decision making, involving multiple
reasoning patterns such as task decomposition, reflection thinking, milestone
recognition, etc. (4) Iterative Training with Reflective Online Traces, which
addresses the data bottleneck by automatically collecting, filtering, and
reflectively refining new interaction traces on hundreds of virtual machines.
Through iterative training and reflection tuning, UI-TARS continuously learns
from its mistakes and adapts to unforeseen situations with minimal human
intervention. We also analyze the evolution path of GUI agents to guide the
further development of this domain.Summary
AI-Generated Summary