PokerBench：训练大型语言模型成为专业扑克选手

摘要

我们介绍了PokerBench - 一个用于评估大型语言模型（LLMs）扑克游戏能力的基准。由于LLMs在传统自然语言处理任务中表现出色，将它们应用于扑克等复杂的战略游戏提出了新的挑战。扑克是一个信息不完全的游戏，需要多种技能，如数学、推理、规划、策略，以及对博弈论和人类心理学的深刻理解。这使得扑克成为大型语言模型的理想下一个领域。PokerBench包括一个由经过训练的扑克玩家合作开发的包含11,000个最重要场景的全面编译，分为翻牌前和翻牌后的游戏。我们评估了包括GPT-4、ChatGPT 3.5以及各种Llama和Gemma系列模型在内的知名模型，发现所有最先进的LLMs在玩优化扑克游戏方面表现不佳。然而，在微调后，这些模型显示出明显的改善。我们通过让得分不同的模型相互竞争来验证PokerBench，结果显示在PokerBench上得分更高的模型在实际扑克游戏中获胜率更高。通过我们微调模型与GPT-4之间的游戏过程，我们还发现简单的监督微调存在学习最佳游戏策略的局限性，这表明需要更高级的方法来有效地训练语言模型在游戏中表现出色。因此，PokerBench提供了一个独特的基准，用于快速可靠地评估LLMs的扑克游戏能力，同时也是一个全面的基准，用于研究LLMs在复杂游戏场景中的进展。数据集和代码将在以下网址提供：https://github.com/pokerllm/pokerbench。

English

We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios. The dataset and code will be made available at: https://github.com/pokerllm/pokerbench.

PokerBench：训练大型语言模型成为专业扑克选手

PokerBench: Training Large Language Models to become Professional Poker Players

摘要

Support