AndroidLab：训练和系统性基准测试Android自主代理

摘要

自主代理在与现实世界互动方面变得越来越重要。特别是，Android代理最近成为一个经常提及的交互方法。然而，现有关于训练和评估Android代理的研究缺乏对开源和闭源模型的系统性研究。在这项工作中，我们提出了AndroidLab作为一个系统化的Android代理框架。它包括一个具有不同形式、动作空间和可重现基准的操作环境。它支持在相同动作空间中的大型语言模型（LLMs）和多模型模型（LMMs）。AndroidLab基准包括预定义的Android虚拟设备和跨九个应用构建的138个任务。通过使用AndroidLab环境，我们开发了一个Android指令数据集，并训练了六个开源LLMs和LMMs，将LLMs的平均成功率从4.59%提高到21.50%，将LMMs的平均成功率从1.93%提高到13.28%。AndroidLab是开源的，并可以在https://github.com/THUDM/Android-Lab 上公开获取。

English

Autonomous agents have become increasingly important for interacting with the real world. Android agents, in particular, have been recently a frequently-mentioned interaction method. However, existing studies for training and evaluating Android agents lack systematic research on both open-source and closed-source models. In this work, we propose AndroidLab as a systematic Android agent framework. It includes an operation environment with different modalities, action space, and a reproducible benchmark. It supports both large language models (LLMs) and multimodal models (LMMs) in the same action space. AndroidLab benchmark includes predefined Android virtual devices and 138 tasks across nine apps built on these devices. By using the AndroidLab environment, we develop an Android Instruction dataset and train six open-source LLMs and LMMs, lifting the average success rates from 4.59% to 21.50% for LLMs and from 1.93% to 13.28% for LMMs. AndroidLab is open-sourced and publicly available at https://github.com/THUDM/Android-Lab.

AndroidLab：训练和系统性基准测试Android自主代理

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

摘要

Summary

Support

Support