学习生成用于自动调试的单元测试

Learning to Generate Unit Tests for Automated Debugging

February 3, 2025

作者: Archiki Prasad, Elias Stengel-Eskin, Justin Chih-Yao Chen, Zaid Khan, Mohit Bansal

cs.AI

摘要

单元测试（UTs）在评估代码正确性以及向大型语言模型（LLM）提供反馈的过程中发挥着重要作用，因为它在迭代调试错误代码时激励自动生成测试。然而，我们发现在生成能够揭示错误的单元测试输入以及在没有访问黄金解决方案的情况下正确预测单元测试输出之间存在权衡。为了解决这种权衡，我们提出了UTGen，它教导LLMs生成能够揭示错误的单元测试输入以及它们的正确预期输出，这是基于任务描述和候选代码的。我们将UTGen集成到UTDebug中，这是一个强大的调试流水线，使用生成的测试来帮助LLMs有效调试。由于模型生成的测试可能提供嘈杂的信号（例如，来自错误预测的输出），UTDebug（i）通过测试时间计算来扩展UTGen以改进UT输出预测，以及（ii）根据多个生成的UT进行验证和回溯编辑，以避免过度拟合。我们展示了UTGen在基于同时存在揭示错误的UT输入和正确UT输出的度量标准上优于UT生成基线7.59％。当与UTDebug一起使用时，我们发现UTGen的单元测试反馈将Qwen-2.5 7B在HumanEvalFix上的pass@1准确率提高了3％以上，在MBPP+上我们自己更难的调试分割上提高了12.35％（分别超过其他基于LLM的UT生成基线）。

English

Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to a large language model (LLM) as it iteratively debugs faulty code, motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions and candidate code. We integrate UTGen into UTDebug, a robust debugging pipeline that uses generated tests to help LLMs debug effectively. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), UTDebug (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and back-tracks edits based on multiple generated UTs to avoid overfitting. We show that UTGen outperforms UT generation baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen-2.5 7B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3% and 12.35% (respectively) over other LLM-based UT generation baselines.