자연어 강화 학습

초록

강화 학습(Reinforcement Learning, RL)은 마르코프 결정 과정(Markov Decision Process, MDP)을 사용하여 의사 결정을 수학적으로 정의합니다. MDP를 통해 연구자들은 게임, 로봇 공학, 언어 모델 등 다양한 분야에서 혁신적인 성과를 이루어 왔습니다. 본 논문은 기존 MDP를 자연어 기반 표현 공간으로 확장하여 새로운 가능성, 자연어 강화 학습(Natural Language Reinforcement Learning, NLRL)을 탐구합니다. 구체적으로, NLRL은 RL 원칙인 작업 목표, 정책, 가치 함수, 벨만 방정식, 정책 반복 등을 해당하는 언어 상대물로 혁신적으로 재정의합니다. 대형 언어 모델(Large Language Models, LLMs)의 최근 발전을 통해 NLRL은 순수 프롬프팅(pure prompting) 또는 그래디언트 기반 훈련을 통해 RL과 유사한 정책 및 가치 향상을 실현할 수 있습니다. 미로, 브레이크스루, 틱택토 게임에 대한 실험을 통해 NLRL 프레임워크의 효과적이고 효율적이며 해석 가능한 특성이 다양한 사용 사례에서 입증되었습니다. 저희의 코드는 https://github.com/waterhorse1/Natural-language-RL에서 공개될 예정입니다.

English

Reinforcement Learning (RL) mathematically formulates decision-making with Markov Decision Process (MDP). With MDPs, researchers have achieved remarkable breakthroughs across various domains, including games, robotics, and language models. This paper seeks a new possibility, Natural Language Reinforcement Learning (NLRL), by extending traditional MDP to natural language-based representation space. Specifically, NLRL innovatively redefines RL principles, including task objectives, policy, value function, Bellman equation, and policy iteration, into their language counterparts. With recent advancements in large language models (LLMs), NLRL can be practically implemented to achieve RL-like policy and value improvement by either pure prompting or gradient-based training. Experiments over Maze, Breakthrough, and Tic-Tac-Toe games demonstrate the effectiveness, efficiency, and interpretability of the NLRL framework among diverse use cases. Our code will be released at https://github.com/waterhorse1/Natural-language-RL.