HumanEval Pro 和 MBPP Pro：在自我調用程式碼生成上評估大型語言模型

摘要

我們引入了自我調用程式碼生成，這是一項新任務，旨在評估LLMs的漸進推理和問題解決能力。在這個任務中，模型被呈現一個基本問題和一個相關的更複雜問題。它們必須解決基本問題，然後利用其解決方案來解決更複雜的問題。這項工作包含三個關鍵貢獻。首先，我們提出了一般的生成更具挑戰性版本現有基準的方法，結果產生了三個新基準：HumanEval Pro、MBPP Pro和BigCodeBench-Lite Pro，專門設計用於評估LLMs的自我調用程式碼生成。其次，通過對我們基準上二十個LLMs的實驗結果進行分析，我們有兩個重要觀察：(i) 大多數LLMs在傳統程式碼生成基準上表現出色，如HumanEval和MBPP，但在自我調用任務上表現下降。例如，o1-mini在HumanEval上達到96.2%的pass@1，但在HumanEval Pro上只有76.2%。(ii) 在自我調用程式碼生成任務中，經過指令調整的模型與基本模型相比僅有輕微改進。第三，我們揭示了存在於評估結果中的失敗模式類型。所有這些結果強調了對自我調用程式碼生成任務進一步進展的需求，並為未來研究提供了增強LLMs程式碼推理能力的新方向。

English

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one. This work features three key contributions. First, we propose a general recipe for generating more challenging versions of existing benchmarks, resulting in three new benchmarks: HumanEval Pro, MBPP Pro, and BigCodeBench-Lite Pro, specifically designed to assess LLMs on self-invoking code generation. Second, from the analysis of experimental results over twenty LLMs on our benchmarks, we have two important observations: (i) Most LLMs excel in traditional code generation benchmarks like HumanEval and MBPP, but their performance declines on self-invoking tasks. For example, o1-mini achieves 96.2% pass@1 on HumanEval but only 76.2% on HumanEval Pro. (ii) On self-invoking code generation task, the instruction-tuned models demonstrate only marginal improvements compared to the base models. Third, we disclose the types of failure modes that exist in our evaluation results. All these results underscore the need for further advancements in self-invoking code generation tasks and provide a new direction for future research on enhancing LLMs' code reasoning capabilities.

HumanEval Pro 和 MBPP Pro：在自我調用程式碼生成上評估大型語言模型

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

摘要

Summary

Support