BALROG:在遊戲中對代理式LLM和VLM推理進行基準測試
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
November 20, 2024
作者: Davide Paglieri, Bartłomiej Cupiał, Samuel Coward, Ulyana Piterbarg, Maciej Wolczyk, Akbir Khan, Eduardo Pignatelli, Łukasz Kuciński, Lerrel Pinto, Rob Fergus, Jakob Nicolaus Foerster, Jack Parker-Holder, Tim Rocktäschel
cs.AI
摘要
大型語言模型(LLMs)和視覺語言模型(VLMs)具有豐富的知識並展現出有前途的推理能力;然而,在複雜、動態環境中它們仍然難以表現良好。現實世界的任務需要處理複雜的互動、高級的空間推理、長期規劃以及持續探索新策略,這些領域我們缺乏有效的方法來全面評估這些能力。為了彌補這一差距,我們引入了BALROG,一個新穎的基準測試,旨在通過一系列具有挑戰性的遊戲來評估LLMs和VLMs的代理能力。我們的基準測試包括各種現有的強化學習環境,難度各異,包括一些可以在幾秒鐘內被非專家人士解決的任務,以及一些極具挑戰性的任務,可能需要多年才能掌握(例如NetHack學習環境)。我們設計了細緻的指標來衡量性能,並對幾個流行的開源和專有的LLMs和VLMs進行了廣泛評估。我們的研究結果表明,儘管當前模型在較簡單的遊戲中取得了部分成功,但在更具挑戰性的任務中表現明顯不佳。值得注意的是,我們觀察到在基於視覺的決策方面存在嚴重缺陷,當提供環境的視覺表示時,模型的表現更差。我們將BALROG作為一個開放且用戶友好的基準測試發布,以促進代理社區未來的研究和發展。
English
Large Language Models (LLMs) and Vision Language Models (VLMs) possess
extensive knowledge and exhibit promising reasoning abilities; however, they
still struggle to perform well in complex, dynamic environments. Real-world
tasks require handling intricate interactions, advanced spatial reasoning,
long-term planning, and continuous exploration of new strategies-areas in which
we lack effective methodologies for comprehensively evaluating these
capabilities. To address this gap, we introduce BALROG, a novel benchmark
designed to assess the agentic capabilities of LLMs and VLMs through a diverse
set of challenging games. Our benchmark incorporates a range of existing
reinforcement learning environments with varying levels of difficulty,
including tasks that are solvable by non-expert humans in seconds to extremely
challenging ones that may take years to master (e.g., the NetHack Learning
Environment). We devise fine-grained metrics to measure performance and conduct
an extensive evaluation of several popular open-source and closed-source LLMs
and VLMs. Our findings indicate that while current models achieve partial
success in the easier games, they struggle significantly with more challenging
tasks. Notably, we observe severe deficiencies in vision-based decision-making,
as models perform worse when visual representations of the environments are
provided. We release BALROG as an open and user-friendly benchmark to
facilitate future research and development in the agentic community.Summary
AI-Generated Summary