ChatPaper.aiChatPaper

大型語言模型是貪婪的代理者:強化學習微調對決策能力的影響

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

April 22, 2025
作者: Thomas Schmied, Jörg Bornschein, Jordi Grau-Moya, Markus Wulfmeier, Razvan Pascanu
cs.AI

摘要

大型語言模型(LLMs)的成功引發了對各種代理應用的廣泛興趣。一個關鍵假設是,LLMs能夠利用常識和思維鏈(CoT)推理,有效地探索並高效解決複雜領域的問題。然而,研究發現LLM代理存在探索次優和知行差距的問題,即無法有效利用模型中已有的知識進行行動。在本研究中,我們系統性地探討了LLMs在決策場景中表現不佳的原因,特別聚焦於三種常見的失敗模式:貪婪性、頻率偏差以及知行差距。我們提出通過基於自我生成的CoT推理進行強化學習(RL)微調來緩解這些缺陷。我們在多臂老虎機、上下文老虎機和井字棋等實驗中證明,RL微調通過增加探索和縮小知行差距,提升了LLMs的決策能力。最後,我們研究了經典的探索機制,如ε-貪婪策略,以及LLM特有的方法,如自我修正和自我一致性,以實現更有效的LLMs決策微調。
English
The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as epsilon-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.

Summary

AI-Generated Summary

PDF193April 23, 2025