Agent-R：通过迭代自我训练训练语言模型代理以反映

摘要

大型语言模型（LLMs）代理在处理交互环境中的复杂任务方面变得越来越关键。现有研究主要集中在通过行为克隆来提高性能，但这种方法在现实世界应用中经常失败，主要是由于无法从错误中恢复。然而，逐步级别的批判数据难以收集且成本高昂。因此，自动化和动态构建自我批判数据集对赋予模型智能代理能力至关重要。在这项工作中，我们提出了一种迭代自我训练框架，Agent-R，使语言Agent能够实时反思。与传统方法不同，Agent-R利用MCTS构建训练数据，从错误的轨迹中恢复正确的轨迹，而不是基于正确性奖励或惩罚行动。代理反思的一个关键挑战在于及时修订，而不是等到一个回合结束。为了解决这个问题，我们引入了一个模型引导的批判构建机制：演员模型识别失败轨迹中的第一个错误步骤（在其当前能力范围内）。从这一点开始，我们将其与相邻的正确路径拼接起来，这些路径在树中具有相同的父节点。这种策略使模型能够基于其当前策略学习反思，从而提高学习效率。为了进一步探索这种自我改进范式的可扩展性，我们研究了错误校正能力和数据集构建的迭代改进。我们的研究结果表明，Agent-R不断提高了模型从错误中恢复的能力，并实现了及时的错误校正。在三个交互环境上的实验表明，Agent-R有效地装备了代理来纠正错误操作，同时避免循环，相较于基线方法，取得了更优越的性能（+5.59%）。

English

Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real-world applications, mainly due to the inability to recover from errors. However, step-level critique data is difficult and expensive to collect. Automating and dynamically constructing self-critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. A key challenge of agent reflection lies in the necessity for timely revision rather than waiting until the end of a rollout. To address this, we introduce a model-guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency. To further explore the scalability of this self-improvement paradigm, we investigate iterative refinement of both error correction capabilities and dataset construction. Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction. Experiments on three interactive environments show that Agent-R effectively equips agents to correct erroneous actions while avoiding loops, achieving superior performance compared to baseline methods (+5.59%).

Agent-R：通过迭代自我训练训练语言模型代理以反映

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

摘要

Summary

Support

Support