PokerBench：訓練大型語言模型成為專業撲克玩家

摘要

我們介紹了PokerBench - 一個用於評估大型語言模型（LLMs）撲克遊戲能力的基準。由於LLMs在傳統自然語言處理任務中表現出色，將它們應用於複雜的戰略遊戲如撲克帶來了新挑戰。撲克是一種不完全信息遊戲，需要眾多技能，如數學、推理、規劃、策略，以及對遊戲理論和人類心理的深刻理解。這使得撲克成為大型語言模型的理想下一個挑戰。PokerBench包括一個由訓練有素的撲克玩家合作開發的、涵蓋前翻和後翻遊戲的11,000個最重要情境的全面編譯。我們評估了包括GPT-4、ChatGPT 3.5以及各種Llama和Gemma系列模型在內的知名模型，發現所有最先進的LLMs在玩最佳撲克時表現不佳。然而，在微調後，這些模型顯示出明顯的改善。我們通過讓得分不同的模型互相競爭來驗證PokerBench，表明在PokerBench上取得更高分數導致在實際撲克遊戲中獲勝率更高。通過我們微調的模型與GPT-4之間的遊戲過程，我們還確定了簡單監督微調對於學習最佳遊戲策略的局限性，暗示需要更先進的方法來有效訓練語言模型在遊戲中表現出色。因此，PokerBench提供了一個獨特的基準，用於快速可靠地評估LLMs的撲克遊戲能力，同時也是一個全面的基準，用於研究LLMs在複雜遊戲情境中的進展。數據集和代碼將在以下鏈接提供：https://github.com/pokerllm/pokerbench。

English

We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios. The dataset and code will be made available at: https://github.com/pokerllm/pokerbench.

PokerBench：訓練大型語言模型成為專業撲克玩家

PokerBench: Training Large Language Models to become Professional Poker Players

摘要

Support