大型语言模型的参数高效微调用于单元测试生成：一项实证研究

摘要

大型语言模型（LLMs）的出现，如GitHub Copilot，显著提高了程序员的生产力，特别是在代码生成方面。然而，这些模型在没有进行精细调整的情况下往往难以应对现实世界的任务。随着LLMs变得更大更高效，为专门任务进行精细调整变得越来越昂贵。参数高效微调（PEFT）方法只微调模型参数的子集，通过降低调整LLMs的计算成本来提供一种有前途的解决方案，同时保持其性能。现有研究已经探讨了在各种与代码相关的任务中使用PEFT和LLMs，并发现PEFT技术的有效性取决于任务。PEFT技术在单元测试生成中的应用尚未得到充分探讨。目前的最新技术仅限于使用完全微调的LLMs来生成单元测试。本文调查了完全微调和各种PEFT方法，包括LoRA、（IA）^3和提示微调，跨不同的模型架构和大小。我们使用成熟的基准数据集来评估它们在单元测试生成中的有效性。我们的研究结果表明，PEFT方法可以提供与完全微调相媲美的性能，使专门微调更具可行性和成本效益。值得注意的是，提示微调在成本和资源利用方面最为有效，而LoRA在多种情况下接近完全微调的有效性。

English

The advent of large language models (LLMs) like GitHub Copilot has significantly enhanced programmers' productivity, particularly in code generation. However, these models often struggle with real-world tasks without fine-tuning. As LLMs grow larger and more performant, fine-tuning for specialized tasks becomes increasingly expensive. Parameter-efficient fine-tuning (PEFT) methods, which fine-tune only a subset of model parameters, offer a promising solution by reducing the computational costs of tuning LLMs while maintaining their performance. Existing studies have explored using PEFT and LLMs for various code-related tasks and found that the effectiveness of PEFT techniques is task-dependent. The application of PEFT techniques in unit test generation remains underexplored. The state-of-the-art is limited to using LLMs with full fine-tuning to generate unit tests. This paper investigates both full fine-tuning and various PEFT methods, including LoRA, (IA)^3, and prompt tuning, across different model architectures and sizes. We use well-established benchmark datasets to evaluate their effectiveness in unit test generation. Our findings show that PEFT methods can deliver performance comparable to full fine-tuning for unit test generation, making specialized fine-tuning more accessible and cost-effective. Notably, prompt tuning is the most effective in terms of cost and resource utilization, while LoRA approaches the effectiveness of full fine-tuning in several cases.

大型语言模型的参数高效微调用于单元测试生成：一项实证研究

Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study

摘要

Summary

Support

Support