大型動作模型：從構想到實踐

摘要

隨著人工智慧的不斷進步，對超越基於語言的輔助系統，轉向能夠執行真實世界動作的智能代理系統的需求日益增長。這種演進需要從擅長生成文本回應的傳統大型語言模型（LLMs）轉向大型行動模型（LAMs），這些模型旨在在動態環境中生成和執行動作。借助代理系統的支持，LAMs有潛力將人工智慧從被動的語言理解轉變為主動的任務完成，標誌著邁向人工通用智能的重要里程碑。本文提出了一個全面的框架，用於開發LAMs，提供了一種從構思到部署的系統性方法。我們首先概述LAMs，突出它們的獨特特徵，並劃分它們與LLMs的區別。通過以Windows作業系統為基礎的代理系統作為案例研究，我們提供了關於LAM開發的關鍵階段的詳細、逐步指南，包括數據收集、模型訓練、環境整合、基礎和評估。這種通用的工作流程可以作為在各種應用領域中創建功能性LAMs的藍圖。最後，我們指出了LAMs目前的限制，並討論未來研究和工業部署的方向，強調實現LAMs在實際應用中充分潛力的挑戰和機遇。本文中使用的數據收集過程的代碼可在以下鏈接公開獲取：https://github.com/microsoft/UFO/tree/main/dataflow，並可在https://microsoft.github.io/UFO/dataflow/overview/找到全面的文檔。

English

As AI continues to advance, there is a growing demand for systems that go beyond language-based assistance and move toward intelligent agents capable of performing real-world actions. This evolution requires the transition from traditional Large Language Models (LLMs), which excel at generating textual responses, to Large Action Models (LAMs), designed for action generation and execution within dynamic environments. Enabled by agent systems, LAMs hold the potential to transform AI from passive language understanding to active task completion, marking a significant milestone in the progression toward artificial general intelligence. In this paper, we present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. This generalizable workflow can serve as a blueprint for creating functional LAMs in various application domains. We conclude by identifying the current limitations of LAMs and discussing directions for future research and industrial deployment, emphasizing the challenges and opportunities that lie ahead in realizing the full potential of LAMs in real-world applications. The code for the data collection process utilized in this paper is publicly available at: https://github.com/microsoft/UFO/tree/main/dataflow, and comprehensive documentation can be found at https://microsoft.github.io/UFO/dataflow/overview/.

大型動作模型：從構想到實踐

Large Action Models: From Inception to Implementation

摘要

Summary

Support