多语言软件漏洞检测中的大语言模型基准测试
Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection
March 3, 2025
作者: Ting Zhang, Chengran Yang, Yindu Su, Martin Weyssow, Hung Nguyen, Tan Bui, Hong Jin Kang, Yikun Li, Eng Lieh Ouh, Lwin Khin Shar, David Lo
cs.AI
摘要
近期生成式AI的进展推动了大型语言模型(LLMs)在软件工程中的广泛应用,解决了诸多长期存在的难题。然而,针对LLMs在软件漏洞检测(SVD)这一软件安全关键领域的能力,目前尚缺乏全面研究。现有研究主要集中于利用C/C++数据集评估LLMs,通常仅探讨提示工程、指令微调和序列分类微调这三种策略中的一两种,针对开源LLMs。因此,关于不同LLMs在多种编程语言中检测漏洞的有效性,存在显著的知识空白。为填补这一空白,我们开展了一项全面的实证研究,评估LLMs在SVD任务上的表现。我们构建了一个包含8,260个Python漏洞函数、7,505个Java漏洞函数及28,983个JavaScript漏洞函数的综合数据集。通过提示工程、指令微调和序列分类微调等多种方法,我们评估了五个开源LLMs,并将它们与五个经过微调的小型语言模型及两款开源静态应用安全测试工具进行对比。此外,我们探索了提升LLMs在SVD上性能的两条路径:a) 数据层面:使用下采样平衡数据集重新训练模型;b) 模型层面:研究结合多个LLMs预测结果的集成学习方法。我们的全面实验表明,SVD对LLMs而言仍是一项挑战。本研究深入探讨了LLMs在SVD中的作用,并为未来利用生成式AI提升软件安全实践提供了实用见解。
English
Recent advancements in generative AI have led to the widespread adoption of
large language models (LLMs) in software engineering, addressing numerous
long-standing challenges. However, a comprehensive study examining the
capabilities of LLMs in software vulnerability detection (SVD), a crucial
aspect of software security, is currently lacking. Existing research primarily
focuses on evaluating LLMs using C/C++ datasets. It typically explores only one
or two strategies among prompt engineering, instruction tuning, and sequence
classification fine-tuning for open-source LLMs. Consequently, there is a
significant knowledge gap regarding the effectiveness of diverse LLMs in
detecting vulnerabilities across various programming languages. To address this
knowledge gap, we present a comprehensive empirical study evaluating the
performance of LLMs on the SVD task. We have compiled a comprehensive dataset
comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in
JavaScript. We assess five open-source LLMs using multiple approaches,
including prompt engineering, instruction tuning, and sequence classification
fine-tuning. These LLMs are benchmarked against five fine-tuned small language
models and two open-source static application security testing tools.
Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data
perspective: Retraining models using downsampled balanced datasets. b) Model
perspective: Investigating ensemble learning methods that combine predictions
from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains
a challenging task for LLMs. This study provides a thorough understanding of
the role of LLMs in SVD and offers practical insights for future advancements
in leveraging generative AI to enhance software security practices.Summary
AI-Generated Summary