ChatPaper.aiChatPaper

CrossWordBench:通过可控谜题生成评估LLM与LVLM的推理能力

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

March 30, 2025
作者: Jixuan Leng, Chengsong Huang, Langlin Huang, Bill Yuchen Lin, William W. Cohen, Haohan Wang, Jiaxin Huang
cs.AI

摘要

现有的大型语言模型(LLMs)和大型视觉语言模型(LVLMs)的推理评估框架主要侧重于评估基于文本的推理能力或视觉语言理解能力,而在文本与视觉约束之间的动态交互方面存在局限。为应对这一不足,我们引入了CrossWordBench,一个通过填字游戏这一媒介来评估LLMs和LVLMs推理能力的基准测试。填字游戏任务要求模型同时遵循基于文本线索的语义约束和视觉网格结构的交叉约束。CrossWordBench利用可控的谜题生成框架,生成多种格式(文本与图像)的谜题,并提供从直接解谜到互动模式的不同评估策略。我们对超过20个模型进行了广泛评估,发现推理型LLMs通过有效利用交叉字母约束,显著优于非推理模型。此外,我们还发现LVLMs在此任务上表现欠佳,其解谜性能与网格解析准确性之间存在强相关性。我们的研究揭示了当前LLMs和LVLMs在推理能力上的局限性,并为未来评估中创建多模态约束任务提供了有效途径。
English
Existing reasoning evaluation frameworks for Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) predominantly either assess text-based reasoning or vision-language understanding capabilities, with limited dynamic interplay between textual and visual constraints. To address this limitation, we introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of both LLMs and LVLMs through the medium of crossword puzzles-a task requiring multimodal adherence to semantic constraints from text-based clues and intersectional constraints from visual grid structures. CrossWordBench leverages a controllable puzzle generation framework that produces puzzles in multiple formats (text and image) and offers different evaluation strategies ranging from direct puzzle solving to interactive modes. Our extensive evaluation of over 20 models reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints. We further demonstrate that LVLMs struggle with the task, showing a strong correlation between their puzzle-solving performance and grid-parsing accuracy. Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.

Summary

AI-Generated Summary

PDF92April 9, 2025