大型动作模型：从构思到实施

摘要

随着人工智能的不断发展，对超越基于语言的辅助系统并朝向能够执行真实世界动作的智能代理的需求日益增长。这种演进需要从擅长生成文本响应的传统大型语言模型（LLMs）转向大型行动模型（LAMs），后者旨在在动态环境中生成和执行动作。借助代理系统的支持，LAMs有潜力将人工智能从被动的语言理解转变为主动的任务完成，标志着朝着人工通用智能的进展迈出了重要的一步。在本文中，我们提出了一个全面的框架，用于开发LAMs，提供了一个从构思到部署的系统化方法。我们首先概述LAMs，突出它们的独特特征，并阐明它们与LLMs的区别。以基于Windows OS的代理为案例研究，我们详细介绍了LAM开发的关键阶段，包括数据收集、模型训练、环境集成、基础和评估。这种通用的工作流程可以作为在各种应用领域创建功能性LAMs的蓝图。最后，我们指出了LAMs目前的局限性，并讨论了未来研究和工业部署的方向，强调了在实现LAMs在实际应用中的全部潜力时面临的挑战和机遇。本文中使用的数据收集过程的代码可在以下网址公开获取：https://github.com/microsoft/UFO/tree/main/dataflow，并可在https://microsoft.github.io/UFO/dataflow/overview/找到详尽的文档。

English

As AI continues to advance, there is a growing demand for systems that go beyond language-based assistance and move toward intelligent agents capable of performing real-world actions. This evolution requires the transition from traditional Large Language Models (LLMs), which excel at generating textual responses, to Large Action Models (LAMs), designed for action generation and execution within dynamic environments. Enabled by agent systems, LAMs hold the potential to transform AI from passive language understanding to active task completion, marking a significant milestone in the progression toward artificial general intelligence. In this paper, we present a comprehensive framework for developing LAMs, offering a systematic approach to their creation, from inception to deployment. We begin with an overview of LAMs, highlighting their unique characteristics and delineating their differences from LLMs. Using a Windows OS-based agent as a case study, we provide a detailed, step-by-step guide on the key stages of LAM development, including data collection, model training, environment integration, grounding, and evaluation. This generalizable workflow can serve as a blueprint for creating functional LAMs in various application domains. We conclude by identifying the current limitations of LAMs and discussing directions for future research and industrial deployment, emphasizing the challenges and opportunities that lie ahead in realizing the full potential of LAMs in real-world applications. The code for the data collection process utilized in this paper is publicly available at: https://github.com/microsoft/UFO/tree/main/dataflow, and comprehensive documentation can be found at https://microsoft.github.io/UFO/dataflow/overview/.

大型动作模型：从构思到实施

Large Action Models: From Inception to Implementation

摘要

Support