Mobile-Agent-V:通过视频引导的多智能体协作学习移动设备操作
Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration
February 24, 2025
作者: Junyang Wang, Haiyang Xu, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Jitao Sang
cs.AI
摘要
移动设备使用的快速增长对无缝任务管理提出了更高的自动化要求。然而,许多AI驱动框架因操作知识不足而难以应对。手动编写的知识虽有所帮助,但费时且效率低下。为解决这些挑战,我们推出了Mobile-Agent-V框架,该框架利用视频指导为移动自动化提供丰富且成本效益高的操作知识。Mobile-Agent-V通过视频输入增强任务执行能力,无需专门的采样或预处理。该框架整合了滑动窗口策略,并引入了视频代理和深度反思代理,确保操作与用户指令一致。通过这一创新方法,用户可在指导下记录任务过程,使系统能够自主高效地学习并执行任务。实验结果表明,Mobile-Agent-V相较于现有框架实现了30%的性能提升。
English
The rapid increase in mobile device usage necessitates improved automation
for seamless task management. However, many AI-driven frameworks struggle due
to insufficient operational knowledge. Manually written knowledge helps but is
labor-intensive and inefficient. To address these challenges, we introduce
Mobile-Agent-V, a framework that leverages video guidance to provide rich and
cost-effective operational knowledge for mobile automation. Mobile-Agent-V
enhances task execution capabilities by leveraging video inputs without
requiring specialized sampling or preprocessing. Mobile-Agent-V integrates a
sliding window strategy and incorporates a video agent and deep-reflection
agent to ensure that actions align with user instructions. Through this
innovative approach, users can record task processes with guidance, enabling
the system to autonomously learn and execute tasks efficiently. Experimental
results show that Mobile-Agent-V achieves a 30% performance improvement
compared to existing frameworks.Summary
AI-Generated Summary