A3：用於行動GUI代理的Android代理競技場

摘要

近年來，受到大型語言模型（LLM）領域的重大進展推動，AI代理在各個領域中變得越來越普遍。移動GUI代理是AI代理的一個子集，旨在自主執行移動設備上的任務。雖然許多研究引入了代理、數據集和基準來推動移動GUI代理研究，但許多現有數據集專注於靜態幀評估，並未提供全面評估在真實世界野外任務表現的平台。為彌補這一差距，我們提出了Android代理競技場（A3），這是一個新穎的評估平台。與現有的野外系統不同，A3提供：（1）有意義且實用的任務，如實時在線信息檢索和操作指引；（2）更大、更靈活的操作空間，可與在任何數據集上訓練的代理兼容；以及（3）基於自動化業務級LLM評估過程。A3包括21個廣泛使用的第三方應用程序和201個代表常見用戶情境的任務，為在真實世界情況下評估移動GUI代理提供了堅實基礎，並提供了一個新的自主評估流程，減少人力和編碼專業知識的需求。該項目可在https://yuxiangchai.github.io/Android-Agent-Arena/ 上找到。

English

AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at https://yuxiangchai.github.io/Android-Agent-Arena/.

A3：用於行動GUI代理的Android代理競技場

A3: Android Agent Arena for Mobile GUI Agents

摘要

Summary

Support