Agent S2：一种面向计算机使用代理的复合型通用-专用框架

摘要

计算机使用代理通过直接与计算机和移动设备上的图形用户界面（GUI）交互，自动化执行数字任务，为提升人类生产力提供了广阔空间，能够完成多样化的用户查询。然而，当前代理面临显著挑战：GUI元素定位不精确、长时程任务规划困难，以及依赖单一通用模型处理多样化认知任务导致的性能瓶颈。为此，我们引入了Agent S2，一种新颖的组合框架，将认知职责分配给多种通用和专用模型。我们提出了一种创新的混合定位技术，以实现精确的GUI定位，并引入了主动分层规划，动态地在多个时间尺度上根据观察到的变化优化行动计划。评估结果显示，Agent S2在三个主要的计算机使用基准测试中确立了新的最先进（SOTA）性能。具体而言，Agent S2在OSWorld的15步和50步评估中，分别比Claude Computer Use和UI-TARS等领先基线代理提升了18.9%和32.7%的相对性能。此外，Agent S2在其他操作系统和应用上展现出良好的泛化能力，在WindowsAgentArena上比之前最佳方法提升了52.8%，在AndroidWorld上提升了16.52%。代码可在https://github.com/simular-ai/Agent-S获取。

English

Computer use agents automate digital tasks by directly interacting with graphical user interfaces (GUIs) on computers and mobile devices, offering significant potential to enhance human productivity by completing an open-ended space of user queries. However, current agents face significant challenges: imprecise grounding of GUI elements, difficulties with long-horizon task planning, and performance bottlenecks from relying on single generalist models for diverse cognitive tasks. To this end, we introduce Agent S2, a novel compositional framework that delegates cognitive responsibilities across various generalist and specialist models. We propose a novel Mixture-of-Grounding technique to achieve precise GUI localization and introduce Proactive Hierarchical Planning, dynamically refining action plans at multiple temporal scales in response to evolving observations. Evaluations demonstrate that Agent S2 establishes new state-of-the-art (SOTA) performance on three prominent computer use benchmarks. Specifically, Agent S2 achieves 18.9% and 32.7% relative improvements over leading baseline agents such as Claude Computer Use and UI-TARS on the OSWorld 15-step and 50-step evaluation. Moreover, Agent S2 generalizes effectively to other operating systems and applications, surpassing previous best methods by 52.8% on WindowsAgentArena and by 16.52% on AndroidWorld relatively. Code available at https://github.com/simular-ai/Agent-S.

Agent S2：一种面向计算机使用代理的复合型通用-专用框架

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents

摘要

Summary

Support

Support