WorldGUI：全面桌面GUI自动化的动态测试

摘要

当前的GUI代理在GUI元素定位方面取得了出色的表现。然而，规划仍然具有极高的挑战性，特别是由于对环境初始状态的敏感性。具体来说，初始状态的轻微差异，比如目标软件未打开或界面不处于默认状态，通常会导致规划错误。这个问题在真实用户场景中普遍存在，但现有的基准测试未能对其进行评估。在本文中，我们提出了WorldGUI，一个新颖的GUI基准测试，设计了具有各种初始状态的GUI任务，以模拟真实的计算机用户交互。该基准测试涵盖了10个流行软件应用程序的各种任务，包括PowerPoint、VSCode和Adobe Acrobat。此外，为了解决动态GUI自动化任务的挑战，我们提出了GUI-Thinker，一个综合框架，利用批判性机制，有效管理GUI交互的不可预测性和复杂性。实验结果表明，GUI-Thinker在WorldGUI任务的成功率上比Claude-3.5（计算机使用）提高了14.9%。这一改进突显了我们基于批判性思维的框架在增强GUI自动化方面的有效性。

English

Current GUI agents have achieved outstanding performance in GUI element grounding. However, planning remains highly challenging, especially due to sensitivity to the initial state of the environment. Specifically, slight differences in the initial state-such as the target software not being open or the interface not being in its default state-often lead to planning errors. This issue is widespread in real user scenarios, but existing benchmarks fail to evaluate it. In this paper, we present WorldGUI, a novel GUI benchmark that designs GUI tasks with various initial states to simulate real computer-user interactions. The benchmark spans a wide range of tasks across 10 popular software applications, including PowerPoint, VSCode, and Adobe Acrobat. In addition, to address the challenges of dynamic GUI automation tasks, we propose GUI-Thinker, a holistic framework, leveraging a critique mechanism, that effectively manages the unpredictability and complexity of GUI interactions. Experimental results demonstrate that GUI-Thinker significantly outperforms Claude-3.5 (Computer Use) by 14.9% in success rate on WorldGUI tasks. This improvement underscores the effectiveness of our critical-thinking-based framework in enhancing GUI automation.

WorldGUI：全面桌面GUI自动化的动态测试

WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation

摘要

Summary

Support