多语言软件漏洞检测中的大语言模型基准测试

摘要

近期生成式AI的进展推动了大型语言模型（LLMs）在软件工程中的广泛应用，解决了诸多长期存在的难题。然而，针对LLMs在软件漏洞检测（SVD）这一软件安全关键领域的能力，目前尚缺乏全面研究。现有研究主要集中于利用C/C++数据集评估LLMs，通常仅探讨提示工程、指令微调和序列分类微调这三种策略中的一两种，针对开源LLMs。因此，关于不同LLMs在多种编程语言中检测漏洞的有效性，存在显著的知识空白。为填补这一空白，我们开展了一项全面的实证研究，评估LLMs在SVD任务上的表现。我们构建了一个包含8,260个Python漏洞函数、7,505个Java漏洞函数及28,983个JavaScript漏洞函数的综合数据集。通过提示工程、指令微调和序列分类微调等多种方法，我们评估了五个开源LLMs，并将它们与五个经过微调的小型语言模型及两款开源静态应用安全测试工具进行对比。此外，我们探索了提升LLMs在SVD上性能的两条路径：a) 数据层面：使用下采样平衡数据集重新训练模型；b) 模型层面：研究结合多个LLMs预测结果的集成学习方法。我们的全面实验表明，SVD对LLMs而言仍是一项挑战。本研究深入探讨了LLMs在SVD中的作用，并为未来利用生成式AI提升软件安全实践提供了实用见解。

English

Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.

多语言软件漏洞检测中的大语言模型基准测试

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection

摘要

Summary

Support

Support