探索专家失误优化大语言模型代理调优

摘要

大型语言模型（LLMs）作为智能体展现出了巨大潜力，在需要多轮推理与交互的任务中表现卓越。拒绝采样微调（RFT）已成为微调LLMs作为智能体的有效方法：它首先模仿专家生成的成功轨迹，进而通过在自我生成的成功轨迹上进行迭代微调，进一步提升智能体技能。然而，由于专家（如GPT-4）主要在较简单的子任务上取得成功，且RFT本身倾向于简单场景，许多复杂的子任务仍未被解决，持续处于分布外（OOD）状态。通过研究这些具有挑战性的子任务，我们发现，专家先前失败的轨迹往往能提供宝贵的指导，例如计划和关键行动，这些能显著提升智能体探索效率及关键技能的掌握。基于这些观察，我们提出了探索专家失败（EEF）方法，该方法从失败的专家轨迹中识别出有益行动，并将其整合到训练数据集中。同时，我们精心排除潜在有害行动，以避免污染模型学习过程。通过利用专家失败中的有益行动，EEF成功解决了一些先前无法解决的子任务，并提升了智能体微调性能。值得注意的是，我们的方法在WebShop中取得了62%的胜率，超越了RFT（53.6%）和GPT-4（35.6%），据我们所知，这是首个在WebShop中超过0.81分并在SciWorld中超过81分的方法，创下了新的技术标杆。

English

Large Language Models (LLMs) have shown tremendous potential as agents, excelling at tasks that require multiple rounds of reasoning and interactions. Rejection Sampling Fine-Tuning (RFT) has emerged as an effective method for finetuning LLMs as agents: it first imitates expert-generated successful trajectories and further improves agentic skills through iterative fine-tuning on successful, self-generated trajectories. However, since the expert (e.g., GPT-4) succeeds primarily on simpler subtasks and RFT inherently favors simpler scenarios, many complex subtasks remain unsolved and persistently out-of-distribution (OOD). Upon investigating these challenging subtasks, we discovered that previously failed expert trajectories can often provide valuable guidance, e.g., plans and key actions, that can significantly improve agent exploration efficiency and acquisition of critical skills. Motivated by these observations, we propose Exploring Expert Failures (EEF), which identifies beneficial actions from failed expert trajectories and integrates them into the training dataset. Potentially harmful actions are meticulously excluded to prevent contamination of the model learning process. By leveraging the beneficial actions in expert failures, EEF successfully solves some previously unsolvable subtasks and improves agent tuning performance. Remarkably, our approach achieved a 62\% win rate in WebShop, outperforming RFT (53. 6\%) and GPT-4 (35. 6\%), and to the best of our knowledge, setting a new state-of-the-art as the first method to surpass a score of 0.81 in WebShop and exceed 81 in SciWorld.

探索专家失误优化大语言模型代理调优

Exploring Expert Failures Improves LLM Agent Tuning

摘要

Summary

Support

Support