자율 에이전트의 확장: 자동 보상 모델링 및 계획을 통한 접근

초록

대규모 언어 모델(LLM)은 다양한 텍스트 생성 작업에서 뛰어난 능력을 보여주고 있습니다. 그러나 LLM은 여전히 온라인 쇼핑, 과학적 추론, 수학 문제 해결과 같이 다단계 의사결정과 환경 피드백이 필요한 문제에 어려움을 겪고 있습니다. 순수 텍스트 데이터와 달리, 대규모 의사결정 데이터를 수집하는 것은 어려운 과제입니다. 또한, 많은 강력한 LLM은 API를 통해서만 접근 가능하며, 이는 비용과 복잡성으로 인해 에이전트 작업에 대한 미세 조정을 방해합니다. LLM 에이전트의 한계를 해결하기 위해, 우리는 인간의 주석 없이 환경에서 자동으로 보상 모델을 학습할 수 있는 프레임워크를 제안합니다. 이 모델은 LLM 에이전트의 행동 궤적을 평가하고 작업 계획을 위한 휴리스틱을 제공하는 데 사용될 수 있습니다. 구체적으로, 우리의 접근 방식은 하나의 LLM 기반 에이전트를 사용하여 환경을 무작위로 탐색하고 다양한 행동 궤적을 생성하는 것을 포함합니다. 이후, 별도의 LLM을 활용하여 각 궤적에 대한 작업 의도를 할당하고 올바른 응답과 함께 부정적인 응답을 합성합니다. 이 삼중항(작업 의도, 긍정적 응답, 부정적 응답)은 행동 궤적을 점수화할 수 있는 보상 모델을 최적화하기 위한 학습 데이터로 사용됩니다. 우리 프레임워크의 효과와 일반화 가능성은 다양한 에이전트 벤치마크에서 수행된 평가를 통해 입증되었습니다. 결론적으로, 우리가 제안한 프레임워크는 LLM 에이전트의 의사결정 능력을 향상시키는 데 있어 중요한 진전을 나타냅니다. 보상 모델의 학습을 자동화함으로써, 데이터 부족과 API 제한의 문제를 극복하고, 복잡하고 상호작용적인 환경에서 LLM의 응용을 혁신할 가능성을 열었습니다. 이 연구는 다단계 의사결정이 필요한 다양한 실제 문제를 해결할 수 있는 더 정교한 AI 에이전트를 위한 길을 열었습니다.

English

Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks. However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem-solving. Unlike pure text data, collecting large-scale decision-making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity. To address LLM agents' limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. Specifically, our approach involves employing one LLM-based agent to navigate an environment randomly, generating diverse action trajectories. Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory. These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents' decision-making capabilities. By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments. This research paves the way for more sophisticated AI agents capable of tackling a wide range of real-world problems requiring multi-step decision-making.

자율 에이전트의 확장: 자동 보상 모델링 및 계획을 통한 접근

Scaling Autonomous Agents via Automatic Reward Modeling And Planning

초록

Summary

Support