에이전트-R: 반복적인 자기 학습을 통해 언어 모델 에이전트를 훈련시키는 방법

초록

대형 언어 모델 (LLMs) 에이전트는 상호 작용 환경에서 복잡한 작업을 해결하는 데 점점 중요해지고 있습니다. 기존 연구는 주로 성능을 향상시키는 데 초점을 맞추고 강력한 전문가로부터 행동 복제를 통해 이루어지지만, 이러한 방식은 실제 응용 프로그램에서 종종 실패하며, 주로 오류에서 회복할 수 없는 데 기인합니다. 그러나 단계별 비평 데이터를 수집하는 것은 어렵고 비용이 많이 듭니다. 따라서 자동화되고 동적으로 자체 비평 데이터 세트를 구축하는 것은 모델에 지능적인 에이전트 능력을 부여하는 데 중요합니다. 본 연구에서는 에이전트가 실시간으로 반성할 수 있는 반복적인 자기 교육 프레임워크인 Agent-R을 제안합니다. 정확도에 따라 행동에 보상하거나 처벌하는 전통적인 방법과 달리 Agent-R은 MCTS를 활용하여 올바른 궤적을 잘못된 궤적에서 복구하는 교육 데이터를 구축합니다. 에이전트 반성의 주요 과제는 롤아웃의 끝까지 기다리는 대신 적시에 수정이 필요하다는 점에 있습니다. 이를 해결하기 위해 우리는 모델에 의한 비평 구성 메커니즘을 소개합니다: 액터 모델은 실패한 궤적에서 현재 능력 내에서 첫 번째 오류 단계를 식별합니다. 그것으로부터 시작하여 나무 구조에서 동일한 부모 노드를 공유하는 인접한 올바른 경로와 결합합니다. 이 전략은 모델이 현재 정책에 기반한 반성을 학습할 수 있도록 하여 더 나은 학습 효율성을 제공합니다. 이 자체 개선 패러다임의 확장 가능성을 더 탐구하기 위해 우리는 오류 수정 능력과 데이터 집합 구축의 반복적인 개선을 조사합니다. 우리의 연구 결과는 Agent-R이 모델이 오류에서 회복하는 능력을 지속적으로 향상시키고 적시에 오류를 수정할 수 있도록 하는 것을 보여줍니다. 세 가지 상호 작용 환경에서의 실험 결과는 Agent-R이 에이전트가 루프를 피하면서 잘못된 조치를 수정할 수 있는 능력을 효과적으로 갖추도록 하여 기준 방법에 비해 우수한 성능을 달성한다는 것을 보여줍니다 (+5.59%).

English

Large Language Models (LLMs) agents are increasingly pivotal for addressing complex tasks in interactive environments. Existing work mainly focuses on enhancing performance through behavior cloning from stronger experts, yet such approaches often falter in real-world applications, mainly due to the inability to recover from errors. However, step-level critique data is difficult and expensive to collect. Automating and dynamically constructing self-critique datasets is thus crucial to empowering models with intelligent agent capabilities. In this work, we propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. A key challenge of agent reflection lies in the necessity for timely revision rather than waiting until the end of a rollout. To address this, we introduce a model-guided critique construction mechanism: the actor model identifies the first error step (within its current capability) in a failed trajectory. Starting from it, we splice it with the adjacent correct path, which shares the same parent node in the tree. This strategy enables the model to learn reflection based on its current policy, therefore yielding better learning efficiency. To further explore the scalability of this self-improvement paradigm, we investigate iterative refinement of both error correction capabilities and dataset construction. Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction. Experiments on three interactive environments show that Agent-R effectively equips agents to correct erroneous actions while avoiding loops, achieving superior performance compared to baseline methods (+5.59%).

에이전트-R: 반복적인 자기 학습을 통해 언어 모델 에이전트를 훈련시키는 방법

Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training

초록

Summary

Support