InfiGUIAgent:具有本地推理和反思能力的多模态通用GUI代理

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

January 8, 2025
作者: Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu
cs.AI

摘要

由多模式大型语言模型(MLLMs)驱动的图形用户界面(GUI)代理在计算设备(如计算机和手机)上展现出了自动化任务的巨大潜力。然而,现有代理面临着多步推理和对文本注释的依赖等挑战,限制了它们的有效性。我们引入了InfiGUIAgent,这是一个基于MLLM的GUI代理,经过两阶段监督微调流程进行训练。第一阶段增强了诸如GUI理解和基础技能等基本技能,而第二阶段则利用合成数据集成了分层推理和期望-反思推理技能,以实现代理的本地推理能力。InfiGUIAgent在多个GUI基准测试中取得了竞争性表现,突显了本地推理技能对增强GUI交互以进行自动化任务的影响。资源可在https://github.com/Reallm-Labs/InfiGUIAgent找到。
English
Graphical User Interface (GUI) Agents, powered by multimodal large language models (MLLMs), have shown great potential for task automation on computing devices such as computers and mobile phones. However, existing agents face challenges in multi-step reasoning and reliance on textual annotations, limiting their effectiveness. We introduce InfiGUIAgent, an MLLM-based GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1 enhances fundamental skills such as GUI understanding and grounding, while Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning skills using synthesized data to enable native reasoning abilities of the agents. InfiGUIAgent achieves competitive performance on several GUI benchmarks, highlighting the impact of native reasoning skills in enhancing GUI interaction for automation tasks. Resources are available at https://github.com/Reallm-Labs/InfiGUIAgent.

Summary

AI-Generated Summary

PDF222January 9, 2025