通过自动奖励建模与规划实现自主智能体的规模化扩展

摘要

大型语言模型（LLMs）在一系列文本生成任务中展现了卓越的能力。然而，面对需要多步决策和环境反馈的问题，如在线购物、科学推理和数学解题，LLMs仍显不足。与纯文本数据不同，大规模决策数据的收集颇具挑战。此外，许多强大的LLMs仅通过API访问，这因成本和复杂性阻碍了其在代理任务中的微调。为克服LLM代理的局限，我们提出了一种框架，能够自动从环境中学习奖励模型，无需人工标注。该模型可用于评估LLM代理的行为轨迹，并为任务规划提供启发。具体而言，我们的方法包括：利用一个基于LLM的代理在环境中随机导航，生成多样化的行为轨迹；随后，借助另一个LLM为每条轨迹分配任务意图，并合成一个错误响应与正确响应配对。这些三元组（任务意图、正确响应和错误响应）随后被用作训练数据，以优化能够为行为轨迹打分的奖励模型。通过在不同代理基准上的评估，我们证明了该框架的有效性和普适性。总之，我们提出的框架在提升LLM代理决策能力方面迈出了重要一步。通过自动化奖励模型的学习，我们克服了数据稀缺和API限制的挑战，有望革新LLMs在复杂交互环境中的应用。这项研究为开发能够应对现实世界中多步决策问题的更高级AI代理铺平了道路。

English

Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks. However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem-solving. Unlike pure text data, collecting large-scale decision-making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity. To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. Specifically, our approach involves employing one LLM-based agent to navigate an environment randomly, generating diverse action trajectories. Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory. These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents' decision-making capabilities. By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments. This research paves the way for more sophisticated AI agents capable of tackling a wide range of real-world problems requiring multi-step decision-making.

通过自动奖励建模与规划实现自主智能体的规模化扩展

Scaling Autonomous Agents via Automatic Reward Modeling And Planning

摘要

Summary

Support