自然语言强化学习

摘要

强化学习（RL）通过马尔可夫决策过程（MDP）在数学上对决策进行了形式化。借助MDP，研究人员在各个领域取得了显著突破，包括游戏、机器人技术和语言模型。本文探讨了一种新的可能性，即自然语言强化学习（NLRL），通过将传统MDP扩展到基于自然语言的表示空间。具体而言，NLRL创新性地将RL原则重新定义为其语言对应物，包括任务目标、策略、值函数、贝尔曼方程和策略迭代。借助最新的大型语言模型（LLMs），NLRL可以通过纯提示或基于梯度的训练实现RL样的策略和值改进。在迷宫、突破和井字棋游戏上的实验表明，NLRL框架在不同用例中具有有效性、高效性和可解释性。我们的代码将在https://github.com/waterhorse1/Natural-language-RL 上发布。

English

Reinforcement Learning (RL) mathematically formulates decision-making with Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs across various domains, including games, robotics, and language models. This paper seeks a new possibility, Natural Language Reinforcement Learning (NLRL), by extending traditional MDP to natural language-based representation space. Specifically, NLRL innovatively redefines RL principles, including task objectives, policy, value function, Bellman equation, and policy iteration, into their language counterparts. With recent advancements in large language models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value improvement by either pure prompting or gradient-based training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games demonstrate the effectiveness, efficiency, and interpretability of the NLRL framework among diverse use cases. Our code will be released at https://github.com/waterhorse1/Natural-language-RL.

自然语言强化学习

Natural Language Reinforcement Learning

摘要

Summary

Support