A3:移动GUI代理的安卓代理竞技场

A3: Android Agent Arena for Mobile GUI Agents

January 2, 2025
作者: Yuxiang Chai, Hanhao Li, Jiayu Zhang, Liang Liu, Guozhi Wang, Shuai Ren, Siyuan Huang, Hongsheng Li
cs.AI

摘要

近年来,受到大型语言模型(LLMs)领域的重大进展推动,AI代理变得日益普遍。移动GUI代理是AI代理的一个子集,旨在自主执行移动设备上的任务。虽然许多研究引入了代理、数据集和基准以推动移动GUI代理研究,但许多现有数据集侧重于静态帧评估,并未提供全面的平台来评估在真实世界中的任务表现。为了填补这一空白,我们提出了Android Agent Arena(A3),这是一个新颖的评估平台。与现有的真实世界系统不同,A3提供:(1)有意义且实用的任务,如实时在线信息检索和操作指导;(2)更大、更灵活的动作空间,使其与在任何数据集上训练的代理兼容;以及(3)自动化的基于业务级LLM的评估过程。A3包括21个广泛使用的通用第三方应用程序和201个代表常见用户场景的任务,为在真实世界情境中评估移动GUI代理提供了坚实基础,并为减少人力和编码专业知识提供了新的自主评估流程。该项目可在https://yuxiangchai.github.io/Android-Agent-Arena/ 上找到。
English
AI agents have become increasingly prevalent in recent years, driven by significant advancements in the field of large language models (LLMs). Mobile GUI agents, a subset of AI agents, are designed to autonomously perform tasks on mobile devices. While numerous studies have introduced agents, datasets, and benchmarks to advance mobile GUI agent research, many existing datasets focus on static frame evaluations and fail to provide a comprehensive platform for assessing performance on real-world, in-the-wild tasks. To address this gap, we present Android Agent Arena (A3), a novel evaluation platform. Unlike existing in-the-wild systems, A3 offers: (1) meaningful and practical tasks, such as real-time online information retrieval and operational instructions; (2) a larger, more flexible action space, enabling compatibility with agents trained on any dataset; and (3) automated business-level LLM-based evaluation process. A3 includes 21 widely used general third-party apps and 201 tasks representative of common user scenarios, providing a robust foundation for evaluating mobile GUI agents in real-world situations and a new autonomous evaluation process for less human labor and coding expertise. The project is available at https://yuxiangchai.github.io/Android-Agent-Arena/.

Summary

AI-Generated Summary

PDF223January 3, 2025