大型語言模型是貪婪的代理者：強化學習微調對決策能力的影響

摘要

大型語言模型（LLMs）的成功引發了對各種代理應用的廣泛興趣。一個關鍵假設是，LLMs能夠利用常識和思維鏈（CoT）推理，有效地探索並高效解決複雜領域的問題。然而，研究發現LLM代理存在探索次優和知行差距的問題，即無法有效利用模型中已有的知識進行行動。在本研究中，我們系統性地探討了LLMs在決策場景中表現不佳的原因，特別聚焦於三種常見的失敗模式：貪婪性、頻率偏差以及知行差距。我們提出通過基於自我生成的CoT推理進行強化學習（RL）微調來緩解這些缺陷。我們在多臂老虎機、上下文老虎機和井字棋等實驗中證明，RL微調通過增加探索和縮小知行差距，提升了LLMs的決策能力。最後，我們研究了經典的探索機制，如ε-貪婪策略，以及LLM特有的方法，如自我修正和自我一致性，以實現更有效的LLMs決策微調。

English

The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as epsilon-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.

大型語言模型是貪婪的代理者：強化學習微調對決策能力的影響

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

摘要

Summary

Support

Support