自然語言強化學習
Natural Language Reinforcement Learning
November 21, 2024
作者: Xidong Feng, Ziyu Wan, Haotian Fu, Bo Liu, Mengyue Yang, Girish A. Koushik, Zhiyuan Hu, Ying Wen, Jun Wang
cs.AI
摘要
強化學習(RL)在數學上以馬可夫決策過程(MDP)形式化決策制定。借助MDP,研究人員在各個領域取得了顯著突破,包括遊戲、機器人和語言模型。本文探索一種新可能性,即自然語言強化學習(NLRL),通過將傳統MDP擴展到基於自然語言的表示空間。具體而言,NLRL創新地將RL原則重新定義為其語言對應物,包括任務目標、策略、價值函數、貝爾曼方程和策略迭代。憑藉大型語言模型(LLMs)的最新進展,NLRL可以通過純提示或基於梯度的訓練實現RL式的策略和價值改進。在迷宮、突破和井字遊戲上的實驗證明了NLRL框架在各種用例中的有效性、效率和可解釋性。我們的代碼將在https://github.com/waterhorse1/Natural-language-RL 上發布。
English
Reinforcement Learning (RL) mathematically formulates decision-making with
Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable
breakthroughs across various domains, including games, robotics, and language
models. This paper seeks a new possibility, Natural Language Reinforcement
Learning (NLRL), by extending traditional MDP to natural language-based
representation space. Specifically, NLRL innovatively redefines RL principles,
including task objectives, policy, value function, Bellman equation, and policy
iteration, into their language counterparts. With recent advancements in large
language models (LLMs), NLRL can be practically implemented to achieve RL-like
policy and value improvement by either pure prompting or gradient-based
training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games
demonstrate the effectiveness, efficiency, and interpretability of the NLRL
framework among diverse use cases. Our code will be released at
https://github.com/waterhorse1/Natural-language-RL.Summary
AI-Generated Summary