ZebraLogic：关于LLM在逻辑推理方面的扩展限制

摘要

我们研究了大型语言模型（LLMs）的逻辑推理能力及其在复杂非单调推理中的可扩展性。为此，我们引入了ZebraLogic，这是一个全面的评估框架，用于评估基于约束满足问题（CSPs）推导的逻辑格谜在LLM推理性能上的表现。ZebraLogic能够生成具有可控和可量化复杂性的谜题，有助于系统研究Llama、o1模型和DeepSeek-R1等模型的扩展极限。通过涵盖广泛的搜索空间复杂性和多样的逻辑约束，ZebraLogic提供了一个结构化环境，用于评估在不断增加的困难下的推理能力。我们的结果显示，随着问题复杂性的增加，准确性显著下降，这一现象被我们称为“复杂性诅咒”。即使使用更大的模型和增加推理时间计算，这种限制仍然存在，表明当前LLM推理能力中存在固有约束。此外，我们探讨了增强逻辑推理的策略，包括最佳N采样、回溯机制和自我验证提示。我们的发现为LLM推理的可扩展性提供了关键见解，突显了基本限制，并概述了改进的潜在方向。

English

We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.

ZebraLogic：关于LLM在逻辑推理方面的扩展限制

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

摘要

Summary

Support