PokerBench: 대규모 언어 모델을 프로 포커 플레이어로 훈련하기

초록

우리는 PokerBench를 소개합니다 - 대형 언어 모델(LLMs)의 포커 플레이 능력을 평가하기 위한 벤치마크입니다. LLMs는 전통적인 자연어 처리(NLP) 작업에서 뛰어나지만, 포커와 같은 복잡하고 전략적인 게임에 적용하는 것은 새로운 도전입니다. 포커는 정보가 불완전한 게임으로, 수학, 추론, 계획, 전략, 게임 이론 및 인간 심리에 대한 심층적인 이해와 같은 다양한 기술을 요구합니다. 이는 포커를 대형 언어 모델에게 이상적인 새로운 영역으로 만듭니다. PokerBench는 훈련된 포커 플레이어와 협력하여 개발된 프리플랍과 포스트플랍 플레이로 분할된 11,000가지 중요한 시나리오의 포괄적인 컴필레이션으로 구성됩니다. 우리는 GPT-4, ChatGPT 3.5 및 다양한 Llama 및 Gemma 시리즈 모델을 포함한 주요 모델들을 평가하며, 모든 최첨단 LLMs가 최적의 포커 플레이에서 성능이 부족함을 발견했습니다. 그러나 세밀한 튜닝 이후 이러한 모델들은 상당한 개선을 보입니다. 우리는 서로 다른 점수를 가진 모델들이 경쟁하도록 PokerBench를 검증하여, PokerBench에서 높은 점수가 실제 포커 게임에서 높은 승률로 이어진다는 것을 입증했습니다. 우리가 세밀하게 튜닝된 모델과 GPT-4 간의 게임을 통해, 최적의 플레이 전략을 학습하기 위한 간단한 지도 튜닝의 한계를 확인하며, 게임에서 뛰어난 언어 모델을 효과적으로 훈련하기 위한 더 고급화된 방법론이 필요함을 시사했습니다. PokerBench는 LLMs의 포커 플레이 능력을 신속하고 신뢰할 수 있는 평가를 위한 독특한 벤치마크로서, 복잡한 게임 플레이 시나리오에서 LLMs의 진전을 연구하기 위한 포괄적인 벤치마크로 제시됩니다. 데이터셋과 코드는 다음에서 제공될 예정입니다: https://github.com/pokerllm/pokerbench.

English

We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios. The dataset and code will be made available at: https://github.com/pokerllm/pokerbench.

PokerBench: 대규모 언어 모델을 프로 포커 플레이어로 훈련하기

PokerBench: Training Large Language Models to become Professional Poker Players

초록

Support