大型語言模型的複雜推理生成評估
Generative Evaluation of Complex Reasoning in Large Language Models
April 3, 2025
作者: Haowei Lin, Xiangyu Wang, Ruilin Yan, Baizhou Huang, Haotian Ye, Jianhua Zhu, Zihao Wang, James Zou, Jianzhu Ma, Yitao Liang
cs.AI
摘要
随着强大的大型语言模型(LLMs)展现出超越人类的推理能力,一个关键问题随之而来:LLMs是否真正在进行推理,还是仅仅从其广泛的、网络抓取的训练数据集中回忆答案?一旦公开发布的基准被纳入后续的LLM训练集,它们不可避免地会受到污染,从而削弱其作为忠实评估工具的可靠性。为解决这一问题,我们引入了KUMO,这是一个专门设计用于评估LLMs推理能力的生成式评估框架。KUMO协同结合了LLMs与符号引擎,动态生成多样化的、多轮次的推理任务,这些任务部分可观察且难度可调。通过自动化流程,KUMO在开放式领域中持续生成新颖任务,迫使模型展示真正的泛化能力而非记忆能力。我们在KUMO创建的100个领域中的5,000个任务上评估了23个最先进的LLMs,将其推理能力与大学生进行对比。我们的发现表明,许多LLMs在简单推理任务上已超越大学水平,而经过推理扩展的LLMs在复杂推理挑战中也达到了大学水平。此外,LLMs在KUMO任务上的表现与新发布的现实世界推理基准结果高度相关,这进一步凸显了KUMO作为评估LLMs真实推理能力的稳健、持久工具的价值。
English
With powerful large language models (LLMs) demonstrating superhuman reasoning
capabilities, a critical question arises: Do LLMs genuinely reason, or do they
merely recall answers from their extensive, web-scraped training datasets?
Publicly released benchmarks inevitably become contaminated once incorporated
into subsequent LLM training sets, undermining their reliability as faithful
assessments. To address this, we introduce KUMO, a generative evaluation
framework designed specifically for assessing reasoning in LLMs. KUMO
synergistically combines LLMs with symbolic engines to dynamically produce
diverse, multi-turn reasoning tasks that are partially observable and
adjustable in difficulty. Through an automated pipeline, KUMO continuously
generates novel tasks across open-ended domains, compelling models to
demonstrate genuine generalization rather than memorization. We evaluated 23
state-of-the-art LLMs on 5,000 tasks across 100 domains created by KUMO,
benchmarking their reasoning abilities against university students. Our
findings reveal that many LLMs have outperformed university-level performance
on easy reasoning tasks, and reasoning-scaled LLMs reach university-level
performance on complex reasoning challenges. Moreover, LLM performance on KUMO
tasks correlates strongly with results on newly released real-world reasoning
benchmarks, underscoring KUMO's value as a robust, enduring assessment tool for
genuine LLM reasoning capabilities.Summary
AI-Generated Summary