探索專家失誤提升LLM代理調校

摘要

大型語言模型（LLMs）作為代理已展現出巨大潛力，在需要多輪推理與互動的任務中表現卓越。拒絕採樣微調（RFT）已成為微調LLMs作為代理的有效方法：它首先模仿專家生成的成功軌跡，並通過在自我生成的成功軌跡上進行迭代微調來進一步提升代理技能。然而，由於專家（例如GPT-4）主要在較簡單的子任務上取得成功，且RFT本質上偏向於簡單情境，許多複雜的子任務仍然未解決並持續處於分佈外（OOD）。在調查這些具有挑戰性的子任務時，我們發現先前失敗的專家軌跡往往能提供有價值的指導，例如計劃和關鍵行動，這些指導能顯著提高代理探索效率和關鍵技能的獲取。基於這些觀察，我們提出了探索專家失敗（EEF），該方法從失敗的專家軌跡中識別出有益的行動，並將其整合到訓練數據集中。潛在有害的行動則被謹慎排除，以防止污染模型的學習過程。通過利用專家失敗中的有益行動，EEF成功解決了一些先前無法解決的子任務，並提升了代理微調性能。值得注意的是，我們的方法在WebShop中達到了62%的勝率，超越了RFT（53.6%）和GPT-4（35.6%），並且據我們所知，首次在WebShop中超過0.81分，並在SciWorld中超過81分，創下了新的最先進水平。

English

Large Language Models (LLMs) have shown tremendous potential as agents, excelling at tasks that require multiple rounds of reasoning and interactions. Rejection Sampling Fine-Tuning (RFT) has emerged as an effective method for finetuning LLMs as agents: it first imitates expert-generated successful trajectories and further improves agentic skills through iterative fine-tuning on successful, self-generated trajectories. However, since the expert (e.g., GPT-4) succeeds primarily on simpler subtasks and RFT inherently favors simpler scenarios, many complex subtasks remain unsolved and persistently out-of-distribution (OOD). Upon investigating these challenging subtasks, we discovered that previously failed expert trajectories can often provide valuable guidance, e.g., plans and key actions, that can significantly improve agent exploration efficiency and acquisition of critical skills. Motivated by these observations, we propose Exploring Expert Failures (EEF), which identifies beneficial actions from failed expert trajectories and integrates them into the training dataset. Potentially harmful actions are meticulously excluded to prevent contamination of the model learning process. By leveraging the beneficial actions in expert failures, EEF successfully solves some previously unsolvable subtasks and improves agent tuning performance. Remarkably, our approach achieved a 62\% win rate in WebShop, outperforming RFT (53. 6\%) and GPT-4 (35. 6\%), and to the best of our knowledge, setting a new state-of-the-art as the first method to surpass a score of 0.81 in WebShop and exceed 81 in SciWorld.

探索專家失誤提升LLM代理調校

Exploring Expert Failures Improves LLM Agent Tuning

摘要

Summary

Support

Support