CrossWordBench:透過可控謎題生成評估LLM與LVLM的推理能力
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation
March 30, 2025
作者: Jixuan Leng, Chengsong Huang, Langlin Huang, Bill Yuchen Lin, William W. Cohen, Haohan Wang, Jiaxin Huang
cs.AI
摘要
現有針對大型語言模型(LLMs)和大型視覺語言模型(LVLMs)的推理評估框架,主要集中於評估基於文本的推理能力或視覺語言理解能力,而文本與視覺約束之間的動態交互作用則較為有限。為解決這一限制,我們引入了CrossWordBench,這是一個旨在通過填字遊戲這一媒介來評估LLMs和LVLMs推理能力的基準測試——填字遊戲任務要求模型在多模態下遵守來自文本提示的語義約束以及視覺網格結構的交集約束。CrossWordBench利用可控的謎題生成框架,生成多種格式(文本和圖像)的謎題,並提供從直接解謎到互動模式的不同評估策略。我們對超過20個模型進行了廣泛評估,結果顯示,推理型LLMs通過有效利用交叉字母約束,顯著優於非推理模型。我們進一步證明,LVLMs在該任務上表現不佳,其解謎表現與網格解析準確性之間存在強烈相關性。我們的研究發現揭示了當前LLMs和LVLMs推理能力的局限性,並為未來評估創建多模態約束任務提供了一種有效方法。
English
Existing reasoning evaluation frameworks for Large Language Models (LLMs) and
Large Vision-Language Models (LVLMs) predominantly either assess text-based
reasoning or vision-language understanding capabilities, with limited dynamic
interplay between textual and visual constraints. To address this limitation,
we introduce CrossWordBench, a benchmark designed to evaluate the reasoning
capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a
task requiring multimodal adherence to semantic constraints from text-based
clues and intersectional constraints from visual grid structures.
CrossWordBench leverages a controllable puzzle generation framework that
produces puzzles in multiple formats (text and image) and offers different
evaluation strategies ranging from direct puzzle solving to interactive modes.
Our extensive evaluation of over 20 models reveals that reasoning LLMs
outperform non-reasoning models substantially by effectively leveraging
crossing-letter constraints. We further demonstrate that LVLMs struggle with
the task, showing a strong correlation between their puzzle-solving performance
and grid-parsing accuracy. Our findings offer insights into the limitations of
the reasoning capabilities of current LLMs and LVLMs, and provide an effective
approach for creating multimodal constrained tasks for future evaluations.Summary
AI-Generated Summary