自然语言强化学习
Natural Language Reinforcement Learning
November 21, 2024
作者: Xidong Feng, Ziyu Wan, Haotian Fu, Bo Liu, Mengyue Yang, Girish A. Koushik, Zhiyuan Hu, Ying Wen, Jun Wang
cs.AI
摘要
强化学习(RL)通过马尔可夫决策过程(MDP)在数学上对决策进行了形式化。借助MDP,研究人员在各个领域取得了显著突破,包括游戏、机器人技术和语言模型。本文探讨了一种新的可能性,即自然语言强化学习(NLRL),通过将传统MDP扩展到基于自然语言的表示空间。具体而言,NLRL创新性地将RL原则重新定义为其语言对应物,包括任务目标、策略、值函数、贝尔曼方程和策略迭代。借助最新的大型语言模型(LLMs),NLRL可以通过纯提示或基于梯度的训练实现RL样的策略和值改进。在迷宫、突破和井字棋游戏上的实验表明,NLRL框架在不同用例中具有有效性、高效性和可解释性。我们的代码将在https://github.com/waterhorse1/Natural-language-RL 上发布。
English
Reinforcement Learning (RL) mathematically formulates decision-making with
Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable
breakthroughs across various domains, including games, robotics, and language
models. This paper seeks a new possibility, Natural Language Reinforcement
Learning (NLRL), by extending traditional MDP to natural language-based
representation space. Specifically, NLRL innovatively redefines RL principles,
including task objectives, policy, value function, Bellman equation, and policy
iteration, into their language counterparts. With recent advancements in large
language models (LLMs), NLRL can be practically implemented to achieve RL-like
policy and value improvement by either pure prompting or gradient-based
training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games
demonstrate the effectiveness, efficiency, and interpretability of the NLRL
framework among diverse use cases. Our code will be released at
https://github.com/waterhorse1/Natural-language-RL.Summary
AI-Generated Summary