大型語言模型的複雜推理生成評估

摘要

随着强大的大型语言模型（LLMs）展现出超越人类的推理能力，一个关键问题随之而来：LLMs是否真正在进行推理，还是仅仅从其广泛的、网络抓取的训练数据集中回忆答案？一旦公开发布的基准被纳入后续的LLM训练集，它们不可避免地会受到污染，从而削弱其作为忠实评估工具的可靠性。为解决这一问题，我们引入了KUMO，这是一个专门设计用于评估LLMs推理能力的生成式评估框架。KUMO协同结合了LLMs与符号引擎，动态生成多样化的、多轮次的推理任务，这些任务部分可观察且难度可调。通过自动化流程，KUMO在开放式领域中持续生成新颖任务，迫使模型展示真正的泛化能力而非记忆能力。我们在KUMO创建的100个领域中的5,000个任务上评估了23个最先进的LLMs，将其推理能力与大学生进行对比。我们的发现表明，许多LLMs在简单推理任务上已超越大学水平，而经过推理扩展的LLMs在复杂推理挑战中也达到了大学水平。此外，LLMs在KUMO任务上的表现与新发布的现实世界推理基准结果高度相关，这进一步凸显了KUMO作为评估LLMs真实推理能力的稳健、持久工具的价值。

English

With powerful large language models (LLMs) demonstrating superhuman reasoning capabilities, a critical question arises: Do LLMs genuinely reason, or do they merely recall answers from their extensive, web-scraped training datasets? Publicly released benchmarks inevitably become contaminated once incorporated into subsequent LLM training sets, undermining their reliability as faithful assessments. To address this, we introduce KUMO, a generative evaluation framework designed specifically for assessing reasoning in LLMs. KUMO synergistically combines LLMs with symbolic engines to dynamically produce diverse, multi-turn reasoning tasks that are partially observable and adjustable in difficulty. Through an automated pipeline, KUMO continuously generates novel tasks across open-ended domains, compelling models to demonstrate genuine generalization rather than memorization. We evaluated 23 state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO, benchmarking their reasoning abilities against university students. Our findings reveal that many LLMs have outperformed university-level performance on easy reasoning tasks, and reasoning-scaled LLMs reach university-level performance on complex reasoning challenges. Moreover, LLM performance on KUMO tasks correlates strongly with results on newly released real-world reasoning benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for genuine LLM reasoning capabilities.

大型語言模型的複雜推理生成評估

Generative Evaluation of Complex Reasoning in Large Language Models

摘要

Summary

Support

Support