PC 代理人:當您入睡時,AI 工作 —— 進入數位世界的認知之旅
PC Agent: While You Sleep, AI Works -- A Cognitive Journey into Digital World
December 23, 2024
作者: Yanheng He, Jiahe Jin, Shijie Xia, Jiadi Su, Runze Fan, Haoyang Zou, Xiangkun Hu, Pengfei Liu
cs.AI
摘要
想像一個世界,人工智慧可以在您睡覺時處理您的工作 - 整理您的研究資料、起草報告,或製作您明天需要的簡報。然而,儘管目前的數位代理人可以執行簡單的任務,但它們遠未能處理人類經常執行的複雜現實工作。我們提出 PC 代理人,透過人類認知轉移展示了朝著這個願景邁出的關鍵一步。我們的主要洞察是,從執行簡單的「任務」到處理複雜的「工作」的途徑在於有效地捕捉並學習人類在使用電腦時的認知過程。為了驗證這一假設,我們引入了三個關鍵創新:(1) PC 追蹤器,一個輕量級基礎設施,有效地收集具有完整認知背景的高質量人機互動軌跡;(2) 一個兩階段認知完成流程,通過完成動作語義和思考過程,將原始互動數據轉換為豐富的認知軌跡;以及(3) 一個多代理系統,結合了一個用於決策的規劃代理和一個用於穩健視覺基礎的基礎代理。我們在 PowerPoint 簡報創建方面的初步實驗顯示,只需少量高質量的認知數據,PC 代理人就能處理涉及多個應用程式的高達 50 個步驟的複雜工作情境。這展示了我們方法的數據效率,突顯了培訓能力強大的數位代理人的關鍵在於收集人類認知數據。通過開源我們的完整框架,包括數據收集基礎設施和認知完成方法,我們的目標是降低研究社群發展真正能力強大的數位代理人的門檻。
English
Imagine a world where AI can handle your work while you sleep - organizing
your research materials, drafting a report, or creating a presentation you need
for tomorrow. However, while current digital agents can perform simple tasks,
they are far from capable of handling the complex real-world work that humans
routinely perform. We present PC Agent, an AI system that demonstrates a
crucial step toward this vision through human cognition transfer. Our key
insight is that the path from executing simple "tasks" to handling complex
"work" lies in efficiently capturing and learning from human cognitive
processes during computer use. To validate this hypothesis, we introduce three
key innovations: (1) PC Tracker, a lightweight infrastructure that efficiently
collects high-quality human-computer interaction trajectories with complete
cognitive context; (2) a two-stage cognition completion pipeline that
transforms raw interaction data into rich cognitive trajectories by completing
action semantics and thought processes; and (3) a multi-agent system combining
a planning agent for decision-making with a grounding agent for robust visual
grounding. Our preliminary experiments in PowerPoint presentation creation
reveal that complex digital work capabilities can be achieved with a small
amount of high-quality cognitive data - PC Agent, trained on just 133 cognitive
trajectories, can handle sophisticated work scenarios involving up to 50 steps
across multiple applications. This demonstrates the data efficiency of our
approach, highlighting that the key to training capable digital agents lies in
collecting human cognitive data. By open-sourcing our complete framework,
including the data collection infrastructure and cognition completion methods,
we aim to lower the barriers for the research community to develop truly
capable digital agents.Summary
AI-Generated Summary