ZebraLogic:关于LLM在逻辑推理方面的扩展限制
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
February 3, 2025
作者: Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi
cs.AI
摘要
我们研究了大型语言模型(LLMs)的逻辑推理能力及其在复杂非单调推理中的可扩展性。为此,我们引入了ZebraLogic,这是一个全面的评估框架,用于评估基于约束满足问题(CSPs)推导的逻辑格谜在LLM推理性能上的表现。ZebraLogic能够生成具有可控和可量化复杂性的谜题,有助于系统研究Llama、o1模型和DeepSeek-R1等模型的扩展极限。通过涵盖广泛的搜索空间复杂性和多样的逻辑约束,ZebraLogic提供了一个结构化环境,用于评估在不断增加的困难下的推理能力。
我们的结果显示,随着问题复杂性的增加,准确性显著下降,这一现象被我们称为“复杂性诅咒”。即使使用更大的模型和增加推理时间计算,这种限制仍然存在,表明当前LLM推理能力中存在固有约束。此外,我们探讨了增强逻辑推理的策略,包括最佳N采样、回溯机制和自我验证提示。我们的发现为LLM推理的可扩展性提供了关键见解,突显了基本限制,并概述了改进的潜在方向。
English
We investigate the logical reasoning capabilities of large language models
(LLMs) and their scalability in complex non-monotonic reasoning. To this end,
we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM
reasoning performance on logic grid puzzles derived from constraint
satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with
controllable and quantifiable complexity, facilitating a systematic study of
the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By
encompassing a broad range of search space complexities and diverse logical
constraints, ZebraLogic provides a structured environment to evaluate reasoning
under increasing difficulty.
Our results reveal a significant decline in accuracy as problem complexity
grows -- a phenomenon we term the curse of complexity. This limitation persists
even with larger models and increased inference-time computation, suggesting
inherent constraints in current LLM reasoning capabilities. Additionally, we
explore strategies to enhance logical reasoning, including Best-of-N sampling,
backtracking mechanisms, and self-verification prompts. Our findings offer
critical insights into the scalability of LLM reasoning, highlight fundamental
limitations, and outline potential directions for improvement.Summary
AI-Generated Summary