PC代理:当您入睡时,人工智能在工作——数字世界中的认知之旅

PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

December 23, 2024
作者: Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, Pengfei Liu
cs.AI

摘要

想象一下一个世界,人工智能可以在您睡觉时处理您的工作 - 整理您的研究材料,起草报告,或者制作您明天需要的演示文稿。然而,虽然当前的数字代理可以执行简单的任务,但它们远远不能处理人类经常执行的复杂现实世界工作。我们提出了PC Agent,这是一个通过人类认知转移展示了朝着这一愿景迈出关键一步的人工智能系统。我们的关键洞察是,从执行简单的“任务”到处理复杂的“工作”的路径在于高效地捕获并学习人类在计算机使用过程中的认知过程。为了验证这一假设,我们引入了三个关键创新:(1)PC Tracker,一个轻量级基础设施,可以高效地收集具有完整认知背景的高质量人机交互轨迹;(2)一个两阶段认知完成流水线,通过完成动作语义和思维过程,将原始交互数据转化为丰富的认知轨迹;以及(3)一个多代理系统,结合了用于决策制定的规划代理和用于稳健视觉基础的基础代理。我们在PowerPoint演示文稿创建方面的初步实验表明,通过少量高质量的认知数据,可以实现复杂的数字化工作能力 - 仅仅在133个认知轨迹上训练的PC Agent可以处理涉及多达50个步骤的复杂工作场景,跨越多个应用程序。这展示了我们方法的数据效率,突出了训练有能力的数字代理的关键在于收集人类认知数据。通过开源我们的完整框架,包括数据收集基础设施和认知完成方法,我们的目标是降低研究社区开发真正有能力的数字代理的障碍。
English
Imagine a world where AI can handle your work while you sleep - organizing your research materials, drafting a report, or creating a presentation you need for tomorrow. However, while current digital agents can perform simple tasks, they are far from capable of handling the complex real-world work that humans routinely perform. We present PC Agent, an AI system that demonstrates a crucial step toward this vision through human cognition transfer. Our key insight is that the path from executing simple "tasks" to handling complex "work" lies in efficiently capturing and learning from human cognitive processes during computer use. To validate this hypothesis, we introduce three key innovations: (1) PC Tracker, a lightweight infrastructure that efficiently collects high-quality human-computer interaction trajectories with complete cognitive context; (2) a two-stage cognition completion pipeline that transforms raw interaction data into rich cognitive trajectories by completing action semantics and thought processes; and (3) a multi-agent system combining a planning agent for decision-making with a grounding agent for robust visual grounding. Our preliminary experiments in PowerPoint presentation creation reveal that complex digital work capabilities can be achieved with a small amount of high-quality cognitive data - PC Agent, trained on just 133 cognitive trajectories, can handle sophisticated work scenarios involving up to 50 steps across multiple applications. This demonstrates the data efficiency of our approach, highlighting that the key to training capable digital agents lies in collecting human cognitive data. By open-sourcing our complete framework, including the data collection infrastructure and cognition completion methods, we aim to lower the barriers for the research community to develop truly capable digital agents.

Summary

AI-Generated Summary

PDF122December 24, 2024