归纳基准：大语言模型在最简单复杂度类别中的失败

摘要

大型语言模型（LLMs）在推理能力上已展现出显著提升，诸如o1和o3等模型已全面或部分解决了众多现有基准测试。然而，这些基准测试大多侧重于演绎推理，包括数学与编程任务，其中诸如数学公理或编程语法等规则被明确定义，LLMs可据此规划并应用这些规则以得出解答。相比之下，归纳推理——即从观察数据中推断潜在规则——则较少被探索。此类归纳过程是科学发现的核心，它使研究人员能够从经验观察中提炼出普遍原理。为了评估LLMs是否具备这一能力，我们引入了InductionBench，一个旨在评估LLMs归纳推理能力的新基准。我们的实验结果表明，即便是当前最先进的模型，在函数次正则层级中最简单的复杂度类别上也难以掌握，这突显了当前LLMs在归纳推理能力上的显著不足。代码与数据可访问https://github.com/Wenyueh/inductive_reasoning_benchmark。

English

Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast, inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced models available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs' inductive reasoning capabilities. Coda and data are available https://github.com/Wenyueh/inductive_reasoning_benchmark.

归纳基准：大语言模型在最简单复杂度类别中的失败

InductionBench: LLMs Fail in the Simplest Complexity Class

摘要

Summary

Support

Support