ChatPaper.aiChatPaper

ZebraLogic:关于LLM在逻辑推理方面的扩展限制

ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning

February 3, 2025
作者: Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi
cs.AI

摘要

我们研究了大型语言模型(LLMs)的逻辑推理能力及其在复杂非单调推理中的可扩展性。为此,我们引入了ZebraLogic,这是一个全面的评估框架,用于评估基于约束满足问题(CSPs)推导的逻辑格谜在LLM推理性能上的表现。ZebraLogic能够生成具有可控和可量化复杂性的谜题,有助于系统研究Llama、o1模型和DeepSeek-R1等模型的扩展极限。通过涵盖广泛的搜索空间复杂性和多样的逻辑约束,ZebraLogic提供了一个结构化环境,用于评估在不断增加的困难下的推理能力。 我们的结果显示,随着问题复杂性的增加,准确性显著下降,这一现象被我们称为“复杂性诅咒”。即使使用更大的模型和增加推理时间计算,这种限制仍然存在,表明当前LLM推理能力中存在固有约束。此外,我们探讨了增强逻辑推理的策略,包括最佳N采样、回溯机制和自我验证提示。我们的发现为LLM推理的可扩展性提供了关键见解,突显了基本限制,并概述了改进的潜在方向。
English
We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows -- a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.

Summary

AI-Generated Summary

PDF172February 4, 2025