PC 代理人:當您入睡時,AI 工作 —— 進入數位世界的認知之旅

PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World

December 23, 2024
作者: Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, Pengfei Liu
cs.AI

摘要

想像一個世界,人工智慧可以在您睡覺時處理您的工作 - 整理您的研究資料、起草報告,或製作您明天需要的簡報。然而,儘管目前的數位代理人可以執行簡單的任務,但它們遠未能處理人類經常執行的複雜現實工作。我們提出 PC 代理人,透過人類認知轉移展示了朝著這個願景邁出的關鍵一步。我們的主要洞察是,從執行簡單的「任務」到處理複雜的「工作」的途徑在於有效地捕捉並學習人類在使用電腦時的認知過程。為了驗證這一假設,我們引入了三個關鍵創新:(1) PC 追蹤器,一個輕量級基礎設施,有效地收集具有完整認知背景的高質量人機互動軌跡;(2) 一個兩階段認知完成流程,通過完成動作語義和思考過程,將原始互動數據轉換為豐富的認知軌跡;以及(3) 一個多代理系統,結合了一個用於決策的規劃代理和一個用於穩健視覺基礎的基礎代理。我們在 PowerPoint 簡報創建方面的初步實驗顯示,只需少量高質量的認知數據,PC 代理人就能處理涉及多個應用程式的高達 50 個步驟的複雜工作情境。這展示了我們方法的數據效率,突顯了培訓能力強大的數位代理人的關鍵在於收集人類認知數據。通過開源我們的完整框架,包括數據收集基礎設施和認知完成方法,我們的目標是降低研究社群發展真正能力強大的數位代理人的門檻。
English
Imagine a world where AI can handle your work while you sleep - organizing your research materials, drafting a report, or creating a presentation you need for tomorrow. However, while current digital agents can perform simple tasks, they are far from capable of handling the complex real-world work that humans routinely perform. We present PC Agent, an AI system that demonstrates a crucial step toward this vision through human cognition transfer. Our key insight is that the path from executing simple "tasks" to handling complex "work" lies in efficiently capturing and learning from human cognitive processes during computer use. To validate this hypothesis, we introduce three key innovations: (1) PC Tracker, a lightweight infrastructure that efficiently collects high-quality human-computer interaction trajectories with complete cognitive context; (2) a two-stage cognition completion pipeline that transforms raw interaction data into rich cognitive trajectories by completing action semantics and thought processes; and (3) a multi-agent system combining a planning agent for decision-making with a grounding agent for robust visual grounding. Our preliminary experiments in PowerPoint presentation creation reveal that complex digital work capabilities can be achieved with a small amount of high-quality cognitive data - PC Agent, trained on just 133 cognitive trajectories, can handle sophisticated work scenarios involving up to 50 steps across multiple applications. This demonstrates the data efficiency of our approach, highlighting that the key to training capable digital agents lies in collecting human cognitive data. By open-sourcing our complete framework, including the data collection infrastructure and cognition completion methods, we aim to lower the barriers for the research community to develop truly capable digital agents.

Summary

AI-Generated Summary

PDF122December 24, 2024