PokerBench: Addestrare Grandi Modelli Linguistici per Diventare Giocatori Professionisti di Poker

Abstract

Introduciamo PokerBench - un benchmark per valutare le capacità di gioco del poker dei grandi modelli linguistici (LLM). Poiché i LLM eccellono nelle tradizionali attività di elaborazione del linguaggio naturale (NLP), la loro applicazione a giochi complessi e strategici come il poker pone una nuova sfida. Il poker, un gioco con informazioni incomplete, richiede una moltitudine di abilità come matematica, ragionamento, pianificazione, strategia e una profonda comprensione della teoria dei giochi e della psicologia umana. Ciò rende il poker la prossima frontiera ideale per i grandi modelli linguistici. PokerBench consiste in una completa raccolta di 11.000 scenari più importanti, suddivisi tra il gioco pre-flop e post-flop, sviluppati in collaborazione con giocatori di poker esperti. Valutiamo modelli prominenti tra cui GPT-4, ChatGPT 3.5 e vari modelli delle serie Llama e Gemma, scoprendo che tutti i LLM all'avanguardia hanno prestazioni inferiori nel giocare a poker ottimale. Tuttavia, dopo il raffinamento, questi modelli mostrano miglioramenti significativi. Convalidiamo PokerBench facendo competere modelli con punteggi diversi tra loro, dimostrando che punteggi più alti su PokerBench portano a tassi di vincita più elevati nei veri giochi di poker. Attraverso il gameplay tra il nostro modello raffinato e GPT-4, identifichiamo anche limitazioni del semplice raffinamento supervisionato per imparare una strategia di gioco ottimale, suggerendo la necessità di metodologie più avanzate per addestrare efficacemente i modelli linguistici a eccellere nei giochi. PokerBench presenta quindi un benchmark unico per una valutazione rapida e affidabile delle capacità di gioco del poker dei LLM, nonché un benchmark completo per studiare i progressi dei LLM in scenari di gioco complessi. Il dataset e il codice saranno resi disponibili su: https://github.com/pokerllm/pokerbench.

English

We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios. The dataset and code will be made available at: https://github.com/pokerllm/pokerbench.

PokerBench: Addestrare Grandi Modelli Linguistici per Diventare Giocatori Professionisti di Poker

PokerBench: Training Large Language Models to become Professional Poker Players

Abstract

Support