InfiGUIAgent:具備本地推理和反思能力的多模式通用 GUI 代理程式
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
January 8, 2025
作者: Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, Fei Wu
cs.AI
摘要
以多模式大型語言模型(MLLMs)為動力的圖形用戶界面(GUI)代理已展現出在計算設備(如電腦和手機)上進行任務自動化的巨大潛力。然而,現有代理面臨多步推理和依賴文本標註等挑戰,限制了它們的效能。我們引入了InfiGUIAgent,這是一個基於MLLM的GUI代理,通過兩階段監督微調管道進行訓練。第一階段增強了基本技能,如GUI理解和基礎,而第二階段則使用合成數據集成了階層推理和期望-反思推理技能,從而使代理的本地推理能力得以實現。InfiGUIAgent在多個GUI基準測試中取得了競爭性表現,突顯了本地推理技能對增強GUI互動以進行自動化任務的影響。資源可在https://github.com/Reallm-Labs/InfiGUIAgent找到。
English
Graphical User Interface (GUI) Agents, powered by multimodal large language
models (MLLMs), have shown great potential for task automation on computing
devices such as computers and mobile phones. However, existing agents face
challenges in multi-step reasoning and reliance on textual annotations,
limiting their effectiveness. We introduce InfiGUIAgent, an MLLM-based
GUI Agent trained with a two-stage supervised fine-tuning pipeline. Stage 1
enhances fundamental skills such as GUI understanding and grounding, while
Stage 2 integrates hierarchical reasoning and expectation-reflection reasoning
skills using synthesized data to enable native reasoning abilities of the
agents. InfiGUIAgent achieves competitive performance on several GUI
benchmarks, highlighting the impact of native reasoning skills in enhancing GUI
interaction for automation tasks. Resources are available at
https://github.com/Reallm-Labs/InfiGUIAgent.Summary
AI-Generated Summary