HumanEval Pro 和 MBPP Pro：在自引代码生成上评估大型语言模型

摘要

我们引入了自调用代码生成，这是一个旨在评估LLMs的渐进推理和问题解决能力的新任务。在这个任务中，模型被呈现一个基本问题和一个相关的更复杂的问题。它们必须解决基本问题，然后利用其解决方案来解决更复杂的问题。这项工作包括三个关键贡献。首先，我们提出了一般的生成更具挑战性版本的现有基准的方法，从而产生了三个新基准：HumanEval Pro、MBPP Pro和BigCodeBench-Lite Pro，专门设计用于评估LLMs的自调用代码生成。其次，通过对我们基准上二十个LLMs的实验结果进行分析，我们得出了两个重要观察结果：(i) 大多数LLMs在传统的代码生成基准（如HumanEval和MBPP）中表现出色，但在自调用任务中表现下降。例如，o1-mini在HumanEval上达到了96.2%的pass@1，但在HumanEval Pro上只有76.2%。(ii) 在自调用代码生成任务中，经过指令调整的模型与基本模型相比仅表现出轻微改进。第三，我们披露了存在于我们评估结果中的失败模式类型。所有这些结果强调了在自调用代码生成任务中需要进一步的进展，并为未来研究LLMs代码推理能力的增强提供了一个新方向。

English

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs' code reasoning capabilities.

HumanEval Pro 和 MBPP Pro：在自引代码生成上评估大型语言模型

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

摘要

Support