DynaSaur：超越預定義動作的大型語言代理

摘要

現有的LLM代理系統通常在每個步驟從一組固定且預定義的動作中選擇。儘管這種方法在封閉且範圍狹窄的環境中是有效的，但我們認為在將LLM代理部署到現實場景中時，會面臨兩個主要挑戰：(1) 從固定的動作集合中選擇顯著限制了LLM代理的規劃和執行能力，以及(2) 這種方法需要大量人力來枚舉和實現所有可能的動作，在具有大量潛在動作的複雜環境中變得不切實際。在這項工作中，我們提出了一個LLM代理框架，可以在線動態創建和組合動作。在這個框架中，代理通過在每個步驟生成並執行用通用程式語言編寫的程序來與環境交互。此外，生成的動作會隨著時間累積以供將來重複使用。我們在GAIA基準測試上進行了大量實驗，證明該框架提供了顯著更大的靈活性並優於先前的方法。值得注意的是，它允許LLM代理在預定義集合中不存在相關動作或現有動作因意外邊緣情況而失敗的情況下進行恢復。在撰寫本文時，我們在GAIA公共排行榜上名列前茅。我們的程式碼可以在以下網址找到：https://github.com/adobe-research/dynasaur{https://github.com/adobe-research/dynasaur}。

English

Existing LLM agent systems typically select actions from a fixed and predefined set at every step. While this approach is effective in closed, narrowly-scoped environments, we argue that it presents two major challenges when deploying LLM agents in real-world scenarios: (1) selecting from a fixed set of actions significantly restricts the planning and acting capabilities of LLM agents, and (2) this approach requires substantial human effort to enumerate and implement all possible actions, which becomes impractical in complex environments with a vast number of potential actions. In this work, we propose an LLM agent framework that enables the dynamic creation and composition of actions in an online manner. In this framework, the agent interacts with the environment by generating and executing programs written in a general-purpose programming language at each step. Furthermore, generated actions are accumulated over time for future reuse. Our extensive experiments on the GAIA benchmark demonstrate that this framework offers significantly greater flexibility and outperforms previous methods. Notably, it allows an LLM agent to recover in scenarios where no relevant action exists in the predefined set or when existing actions fail due to unforeseen edge cases. At the time of writing, we hold the top position on the GAIA public leaderboard. Our code can be found in https://github.com/adobe-research/dynasaur{https://github.com/adobe-research/dynasaur}.

DynaSaur：超越預定義動作的大型語言代理

DynaSaur: Large Language Agents Beyond Predefined Actions

摘要

Summary

Support

Support