自然語言強化學習

摘要

強化學習（RL）在數學上以馬可夫決策過程（MDP）形式化決策制定。借助MDP，研究人員在各個領域取得了顯著突破，包括遊戲、機器人和語言模型。本文探索一種新可能性，即自然語言強化學習（NLRL），通過將傳統MDP擴展到基於自然語言的表示空間。具體而言，NLRL創新地將RL原則重新定義為其語言對應物，包括任務目標、策略、價值函數、貝爾曼方程和策略迭代。憑藉大型語言模型（LLMs）的最新進展，NLRL可以通過純提示或基於梯度的訓練實現RL式的策略和價值改進。在迷宮、突破和井字遊戲上的實驗證明了NLRL框架在各種用例中的有效性、效率和可解釋性。我們的代碼將在https://github.com/waterhorse1/Natural-language-RL 上發布。

English

Reinforcement Learning (RL) mathematically formulates decision-making with Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs across various domains, including games, robotics, and language models. This paper seeks a new possibility, Natural Language Reinforcement Learning (NLRL), by extending traditional MDP to natural language-based representation space. Specifically, NLRL innovatively redefines RL principles, including task objectives, policy, value function, Bellman equation, and policy iteration, into their language counterparts. With recent advancements in large language models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value improvement by either pure prompting or gradient-based training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games demonstrate the effectiveness, efficiency, and interpretability of the NLRL framework among diverse use cases. Our code will be released at https://github.com/waterhorse1/Natural-language-RL.