PokerBench: 大規模言語モデルをプロのポーカープレイヤーに育成する

要旨

PokerBenchを紹介します - 大規模言語モデル（LLM）のポーカー対戦能力を評価するためのベンチマークです。LLMは従来の自然言語処理（NLP）タスクで優れているため、ポーカーなどの複雑で戦略的なゲームへの適用は新たな挑戦となります。不完全情報ゲームであるポーカーは、数学、推論、計画、戦略、ゲーム理論、人間心理の深い理解など多くのスキルが求められます。これにより、ポーカーは大規模言語モデルにとって理想的な次のフロンティアとなります。PokerBenchは、トレーニングされたポーカープレイヤーとの協力によって開発された、プリフロップとポストフロップのプレイに分かれた11,000の最も重要なシナリオの包括的なコンパイルから構成されています。GPT-4、ChatGPT 3.5、およびさまざまなLlamaおよびGemmaシリーズモデルなどの有名なモデルを評価し、すべての最先端のLLMが最適なポーカーのプレイで性能が低いことを発見しました。ただし、ファインチューニング後、これらのモデルは著しい改善を示します。異なるスコアを持つモデル同士を競わせることでPokerBenchを検証し、PokerBenchでの高いスコアが実際のポーカーゲームでの高い勝率につながることを示しました。ファインチューニングされたモデルとGPT-4とのゲームプレイを通じて、最適なプレイ戦略を学習するための単純な教師ありファインチューニングの限界を特定し、ゲームで優れた言語モデルを効果的にトレーニングするためのより高度な方法が必要であることを示唆しています。したがって、PokerBenchは、LLMのポーカー対戦能力を迅速かつ信頼性の高い評価するためのユニークなベンチマークと、複雑なゲームプレイシナリオでのLLMの進歩を研究する包括的なベンチマークを提供します。データセットとコードは次の場所で利用可能になります：https://github.com/pokerllm/pokerbench。

English

We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios. The dataset and code will be made available at: https://github.com/pokerllm/pokerbench.

PokerBench: 大規模言語モデルをプロのポーカープレイヤーに育成する

PokerBench: Training Large Language Models to become Professional Poker Players

要旨

Summary

Support