Multitask-based Evaluation of Open-Source LLM on Software Vulnerability

Xin Yin, Chao Ni, and Shaohua Wang Both Xin Yin and Chao Ni are with the School of Software Technology, Zhejiang University, China. E-mail: xyin,[email protected] Wang is with Central University of Finance and Economics, China. E-mail: [email protected] Ni is the corresponding author.
Abstract

This paper proposes a pipeline for quantitatively evaluating interactive LLMs using publicly available datasets. We carry out an extensive technical evaluation of LLMs using Big-Vul covering four different common software vulnerability tasks. We evaluate the multitask and multilingual aspects of LLMs based on this dataset. We find that the existing state-of-the-art methods are generally superior to LLMs in software vulnerability detection. Although LLMs improve accuracy when providing context information, they still have limitations in accurately predicting severity ratings for certain CWE types. In addition, LLMs demonstrate some ability to locate vulnerabilities for certain CWE types, but their performance varies among different CWE types. Finally, LLMs show uneven performance in generating CVE descriptions for various CWE types, with limited accuracy in a few-shot setting. Overall, though LLMs perform well in some aspects, they still need improvement in understanding the subtle differences in code vulnerabilities and the ability to describe vulnerabilities to fully realize their potential. Our evaluation pipeline provides valuable insights for further enhancing LLMs’ software vulnerability handling capabilities.

Index Terms:
Software vulnerability Analysis, Large Language Model.

1 Introduction

Software Vulnerabilities (SVs) can expose software systems to risk situations and eventually cause huge economic losses or even threaten people’s lives. Therefore, completing software vulnerabilities is an important task for software quality assurance (SQA). Generally, there are many important software quality activities for software vulnerabilities such as SV detection, SV assessment, SV location, and SV description. The relationship among the SQA activities is intricate and interdependent and can be illustrated in Fig. 1. SV detection serves as the initial phase, employing various tools and techniques to identify potential vulnerabilities within the software. Once detected, the focus shifts to SV assessment, where the severity and potential impact of each vulnerability are meticulously evaluated. This critical evaluation informs the subsequent steps in the process. SV location follows the assessment, pinpointing the exact areas within the software’s code or architecture where vulnerabilities exist. This step is crucial for precise remediation efforts and to prevent the recurrence of similar vulnerabilities in the future. The intricacies of SV location feed into the comprehensive SV description, which encapsulates detailed information about each vulnerability, including its origin, characteristics, and potential exploits. In essence, the synergy among SV detection, SV assessment, SV location, and SV description creates a robust pipeline for addressing software vulnerabilities comprehensively. This systematic approach not only enhances the overall quality of the software but also fortifies it against potential threats, thereby safeguarding against economic losses and potential harm to individuals. As a cornerstone of software quality assurance, the seamless integration of these activities underscores the importance of a proactive and thorough approach to managing software vulnerabilities in today’s dynamic and interconnected digital landscape.

Refer to caption
Figure 1: The relationship among software vulnerability analysis activities

Recently, Large Language Models (LLMs) [1] have been widely adopted since the advances in Natural Language Processing (NLP) which enable LLM to be well-trained with both billions of parameters and billions of training samples, consequently bringing a large performance improvement on tasks adopted by LLMs. LLMs can be easily used for a downstream task by being fine-tuned [2] or being prompted [3] since they are trained to be general and they can capture different knowledge from various domain data. Fine-tuning is used to update model parameters for a particular downstream task by iterating the model on a specific dataset while prompting can be directly used by providing natural language descriptions or a few examples of the downstream task. Compared to prompting, fine-tuning is expensive since it requires additional model training and has limited usage scenarios, especially in cases where sufficient training datasets are unavailable. LLMs have demonstrated remarkable language comprehension and generation capabilities, and have been able to perform well on a variety of natural language processing tasks, such as text summarization [4]. Given the outstanding performance of LLMs, there is a growing focus on exploring their potential in software engineering tasks and seeking new opportunities to address them. Currently, as more and more LLMs designed for software engineering tasks are deployed [5, 6, 7, 8, 9, 10, 11], many research works focused on the application of LLMs in the software engineering domain [12, 13, 14, 15, 16]. However, in the existing literature, adequate systematic reviews and surveys have been conducted on LLMs in areas such as generating high-quality code and high-coverage test cases [17, 18], but a systematic review and evaluation of LLMs in the field of software vulnerability is still missing.

In this paper, we focus on evaluating LLMs’ performance in various software vulnerability (SV)-related tasks in few-shot and fine-tuning settings to obtain a basic, comprehensive, and better understanding of their multi-task ability, and we aim to answer the following research questions.

  • RQ-1: How does LLMs perform on vulnerability detection? Software Vulnerabilities (SVs) can expose software systems to risk situations and consequently software function failure. Therefore, detecting these SVs is an important task for software quality assurance. We want to explore the ability of LLMs on vulnerability detection as well as the performance difference compared with state-of-the-art approaches.

  • RQ-2: How does LLMs perform on vulnerability assessment? In practice, due to the limitation of SQA resources [19], it is impossible to treat all detected SVs equally and fix all SVs simultaneously. Thus, it is necessary to prioritize these detected software vulnerabilities for better treatment. An effective solution to prioritize those SVs is to use one of the most widely known SV assessment frameworks CVSS (Common Vulnerability Scoring System) [20], which characterizes SVs by considering three metric groups: Base, Temporal, and Environmental. The metrics that are in the groups can be further used as the criterion for selecting serious SVs to fix early. Therefore, we want to explore the ability of LLMs to assess vulnerabilities.

  • RQ-3: How do LLMs perform on vulnerability location? Identifying the precise location of vulnerabilities in software systems is of critical importance for mitigating risks and improving software quality. The vulnerability location task involves pinpointing these weaknesses accurately and helps to narrow the scope for developers to fix problems. Therefore, we aim to investigate LLMs’ capability in effectively identifying the precise location of vulnerabilities in software systems.

  • RQ-4: How does LLMs perform on vulnerability description? Understanding the intricacies of vulnerabilities in software systems plays a pivotal role in alleviating risks and bolstering software quality. The vulnerability description task focuses on conveying a detailed explanation of these identified issues in the source codes and helps participants to better understand the risk as well as its impacts. We aim to assess LLMs’ capacity to effectively generate the description of vulnerabilities within software systems.

To extensively and comprehensively analyze the LLMs’ ability, we use a large-scale dataset containing real-world project vulnerabilities (named Big-Vul [21]). We carefully design experiments to discover the findings by answering four RQs. The main contribution of our work is summarized as follows and takeaway findings are shown in Table I. Eventually, we present the comparison of LLMs across four software vulnerability tasks under different settings, as well as the impact of varying model sizes on performance, as depicted in Fig. LABEL:fig:radar1 and Fig. LABEL:fig:radar2. In summary, the key contributions of this paper include:

  • We extensively evaluate the performance of LLMs on different software vulnerability tasks and conduct an extensive comparison among LLMs and learning-based approaches to software vulnerability.

  • We design four RQs to comprehensively understand LLMs from different dimensions, and provide detailed results with examples.

  • We release our replication package for further study [22].

TABLE I: Insights and Takeaways: Evaluating LLMs on Software Vulnerability
Dimension Findings or Insights
Vulnerability Detection 1 . Fine-tuned LLMs perform weaker than transformer-based methods, yet comparably to graph-based methods. Moreover, LLMs in the few-shot setting show lower performance than existing methods. 2. After fine-tuning, the detection capability of LLMs has improved, except for Mistral. Larger models usually perform better, but performance can also be influenced by model design and pre-training data. 3. WizardCoder has the best vulnerability detection capability, while Mistral is the worst.
Vulnerability Assessment 4 . Larger model parameter counts did not enhance vulnerability assessment performance with LLMs, prioritize smaller parameter models for better cost-performance balance. 5. LLMs have a limited capacity for assessment of vulnerability severity based on source code only, but can be extremely improved if provided with more context information in most cases.
Vulnerability Localization 6 . Few-shot setting expose LLM limitations, but fine-tuning enhances caution. 7. Mistral’s significant improvement after fine-tuning showcases its potential.
Vulnerability Description 8 . CodeLlama, StarCoder, WizardCoder, and Mistral excel at learning from historical description data.

2 Background and Related Work

2.1 Large Language Model

Since the advancements in Natural Language Processing, Large Language Models (LLMs) [1] have seen widespread adoption due to their capacity to be effectively trained with billions of parameters and training samples, resulting in significant performance enhancements. LLMs can readily be applied to downstream tasks through either fine-tuning [2] or prompting [3]. Their versatility stems from being trained to possess a broad understanding, enabling them to capture diverse knowledge across various domains. Fine-tuning involves updating the model parameters specifically for a given downstream task through iterative training on a specific dataset. In contrast, prompting allows for direct utilization by providing natural language descriptions or a few examples of the downstream task. Compared to prompting, fine-tuning is resource-intensive as it necessitates additional model training and is applicable in limited scenarios, particularly when adequate training datasets are unavailable.

LLMs are usually built on the transformer architecture [23] and can be classified into three types of architectures: encoder-only, encoder-decoder, and decoder-only. Encoder-only (e.g., CodeBERT [24], GraphCodeBERT [25], and UniXcoder [26]) and Encoder-Decoder (e.g., PLBART [27], CodeT5 [7], and CodeT5+ [8]) models are trained using Masked Language Modeling (MLM) or Masked Span Prediction (MSP) objective, respectively, where a small portion (e.g., 15%) of the tokens are replaced with either masked tokens or masked span tokens, LLMs are trained to recover the masked tokens. These models are trained as general ones on the code-related data and then are fine-tuned for the downstream tasks to achieve superior performance. Decoder-only models also attract a small portion of people’s attention and they are trained by using Causal Language Modeling objectives to predict the probability of the next token given all previous tokens. GPT [2] and its variants are the most representative models, which bring the large language models into practical usage.

Recently, the ChatGPT model attracts the widest attention from the world, which is the successor of the large language model InstructGPT [28] with a dialog interface that is fine-tuned using the Reinforcement Learning with Human Feedback (RLHF) approach [29, 28, 30]. RLHF initially fine-tunes the base model using a small dataset of prompts as input and the desired output, typically human-written, to refine its performance. Subsequently, a reward model is trained on a larger set of prompts by sampling outputs generated by the fine-tuned model. These outputs are then reordered by human labelers to provide feedback for training the reward model. Reinforcement learning [31] is then used to calculate rewards for each output generated based on the reward model, updating LLM parameters accordingly. With fine-tuning and alignment with human preferences, LLMs better understand input prompts and instructions, enhancing performance across various tasks [32, 28].

2.2 Software Vulnerability

Software Vulnerabilities (SVs) can expose software systems to risk situations and consequently make the software under cyber-attacks, eventually causing huge economic losses and even threatening people’s lives. Therefore, vulnerability databases have been created to document and analyze publicly known security vulnerabilities. For example, Common Vulnerabilities and Exposures (CVE) [33, 34] and SecurityFocus [35] are two well-known vulnerability databases. Besides, Common Weakness Enumeration (CWE) defines the common software weaknesses of individual vulnerabilities, which are often referred to as vulnerability types of CVEs. To better address these vulnerabilities, researchers have proposed many approaches for understanding the effects of software vulnerabilities, including SV detection [36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46], SV assessment [47, 48, 20, 49, 50], SV localization [51, 52, 53], SV repair [54, 55, 56, 57] as well as SV description [58, 59, 60, 61]. Many novel technologies are adopted to promote the progress of software vulnerability management, including software analysis [62, 63], machine learning [36, 43], and deep learning [47, 52], especially large language models [59, 60].

3 Experimental Design

In this section, we present our studied dataset, our studied LLMs, the prompt engineering, the baseline approaches, the evaluation metrics, and the experiment settings.

3.1 Studied Dataset

We adopt the widely used dataset (named Big-Vul) provided by Fan et al. [21] by considering the following reasons. The most important one is to satisfy the distinct characteristics of the real world as well as the diversity in the dataset, which is suggested by previous works [43, 45]. Big-Vul, to the best of our knowledge, is the most large-scale vulnerability dataset with diverse information about the vulnerabilities, which are collected from practical projects and these vulnerabilities are recorded in the Common Vulnerabilities and Exposures (CVE)111https://cve.mitre.org/. The second one is to compare fairly with existing state-of-the-art (SOTA) approaches (e.g., LineVul, Devign, and SVulD).

Big-Vul totally contains 3,754 code vulnerabilities collected from 348 open-source projects spanning 91 different vulnerability types from 2002 to 2019. It has 188,636 C/C++ functions with a vulnerable ratio of 5.7% (i.e., 10,900 vulnerability functions). The authors linked the code changes with CVEs and their descriptive information to enable a deeper analysis of the vulnerabilities.

In our work, some baselines need to obtain the structure information (e.g., control flow graph (CFG), data flow graph (DFG)) of the studied functions. Therefore, we adopt the same toolkit with Joern [64] to transform functions. The functions are dropped out directly if they cannot be transformed by Joern successfully. We also remove the duplicated functions and the statistics of the studied dataset are shown in Table II. Finally, the filtered dataset is used for evaluation. We follow the same strategy to build the training data, validating data, and testing data from the original dataset with previous work does [37, 54, 46]. Specifically, 80% of functions are treated as training data, 10% of functions are treated as validation data, and the left 10% of functions are treated as testing data. We also keep the distribution as same as the original ones in training, validating, and testing data.

TABLE II: The statistic of studied dataset
Datasets # Vul. # Non-Vul. # Total % Vul.: Non-Vul.
Original Big-Vul 10,900 177,736 188,636 0.061
Filtered Big-Vul 5,260 96,308 101,568 0.055
Training 4,208 4,208 8,416 1
Validating 526 9,631 10,157 0.055
Testing 526 9,631 10,157 0.055

3.2 Studied LLMs

The general LLMs are pre-trained on textual data, including natural language and code, and can be used for a variety of tasks. In contrast, code-related LLMs are specifically pre-trained to automate code-related tasks. Due to the empirical nature of this work, we are interested in assessing the effectiveness of both LLM categories in vulnerability tasks. For the code-related LLMs, we select the top four models released recently (in 2023), namely DeepSeek-Coder [9], CodeLlama [11], StarCoder [10], and WizardCoder [65]. For the general LLMs, we select the top two models, resulting in the selection of Mistral [66], and Phi-2 [67]. For the few-shot setting, we select the models with no more than 34B parameters from the Hugging Face Open LLM Leaderboard [68], as for the fine-tuning setting, we select the models with 7B parameters or less. The constraint on the number of parameters is imposed by our computing resources (i.e., 192GB RAM, 10 × NVIDIA RTX 3090 GPU). Table III summarizes the characteristics of the studied LLMs, we briefly introduce these LLMs to make our paper self-contained.

TABLE III: Overview of studied LLMs
Models Code-related LLMs General LLMs
DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2
Fine-Tuning 6.7B 7B 7B 7B 7B 2.7B
Few-Shot 6.7B & 33B 7B & 34B 7B & 34B 7B & 15.5B 7B 2.7B
Release Date Nov’23 Aug’23 May’23 June’23 Sep’23 Dec’23

Group 1: Code-related LLMs.

DeepSeek-Coder developed by DeepSeek AI [9] is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese. They provide various sizes of the code model, ranging from 1B to 33B versions. Each model is pre-trained on project-level code corpus by employing a window size of 16K and an extra fill-in-the-blank task, to support project-level code completion and infilling. For coding capabilities, DeepSeek-Coder achieves state-of-the-art performance among open-source code models on multiple programming languages and various benchmarks.

CodeLlama proposed by Rozière et al. [11] is a set of large pre-trained language models for code built on Llama 2. They achieve state-of-the-art performance among open models on code tasks, provide infilling capabilities, support large input contexts, and demonstrate zero-shot instruction following for programming problems. CodeLlama is created by further training Llama 2 using increased sampling of code data. As with Llama 2, the authors applied extensive safety mitigations to the fine-tuned CodeLlama versions.

StarCoder proposed by Li et al. [10] is a large pre-trained language model specifically designed for code. It was pre-trained on a large amount of code data to acquire programming knowledge and trained on permissive data from GitHub, including over 80 programming languages, Git commits, GitHub issues, and Jupyter notebooks. StarCoder can perform code editing tasks, understand natural language prompts, and generate code that conforms to APIs. StarCoder represents the advancement of applying large language models in programming.

WizardCoder proposed by Luo et al. [65] is a large pre-trained language model that empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, the authors unveil the exceptional capabilities of their model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, WizardCoder even outperforms the largest closed LLMs, Anthropic’s Claude and Google’s Bard, on HumanEval and HumanEval+.

Group 2: General LLMs. Mistral is a 7-billion-parameter language model released by Mistral AI [66]. Mistral 7B is a carefully designed language model that provides both efficiency and high performance to enable real-world applications. Due to its efficiency improvements, the model is suitable for real-time applications where quick responses are essential. At the time of its release, Mistral 7B outperformed the best open source 13B model (Llama 2) in all evaluated benchmarks.

Phi-2 proposed by Microsoft [67] packed with 2.7 billion parameters. It’s designed to make machines think more like humans and do it safely. Phi-2 isn’t just about numbers; it’s about a smarter, safer way for computers to understand and interact with the world. Phi-2 stands out because it’s been taught with a mix of new language data and careful checks to make sure it acts right. It’s built to do many things like writing, summarizing texts, and coding, but with better common sense and understanding than its earlier version, Phi-1.5. Phi-2’s evaluation demonstrates its proficiency over larger models in aggregated benchmarks, emphasizing the potential of smaller models to achieve comparable or superior performance to their larger counterparts. This is particularly evident in its comparison with Google Gemini Nano 2, where Phi-2 outshines despite its smaller size.

3.3 Prompt Engineering

We follow the prompt similar to those used in the artifacts, papers, or technical reports associated with each corresponding model [5, 10, 11]. For fine-tuning setting and few-shot setting, we use the same prompt, where each prompt contains three pieces of information: (1) task description, (2) source code, and (3) indicator. Using the software vulnerability detection task as an example, the prompt utilized for LLM consists of three crucial components, as depicted in Fig. 3:

  • Task Description (marked as ①). LLM is provided with the description constructed as ‘‘If this C code snippet has vulnerabilities, output Yes; otherwise, output No’’. The task descriptions used in the SV detection task vary based on the source programming language we employ.

  • Source Code (marked as ②). We provide LLM with the code wrapped in ‘‘// Code Start’’ and ‘‘// Code End’’ Since we illustrate an example in C, we use the C comment format of ‘‘//’’ as a prefix for the description. We also employ different comment prefixes based on the programming language of the code.

  • Indicator (marked as ③). LLM is instructed to think about the results. In this paper, we follow the best practice in previous work [12] and adopt the same prompt named ‘‘// Detection’’.

Depending on the specific software vulnerability tasks, the task descriptions and indicators in the prompts may vary. The task descriptions and indicators for different software vulnerability tasks are presented in Table IV.

TABLE IV: The task descriptions and indicators for different software vulnerability tasks
Dimension Task Description Indicator
Vulnerability Detection
If this C code snippet has vulnerabilities, output Yes; otherwise, output No.
// Detection
Vulnerability Assessment
Provide a qualitative severity ratings of CVSS v2.0 for the vulnerable C code snippet.
// Assessment
Vulnerability Location
Provide a vulnerability location result for the vulnerable C code snippet.
// Location
Vulnerability Description
Provide a CVE description for the vulnerable C code snippet.
// Description
Refer to caption
Figure 3: The prompt contains three pieces of information: (1) task description, (2) source code, and (3) indicator

3.4 Baselines

To comprehensively compare the vulnerability detection performance of LLMs with existing state-of-the-art (SOTA) approaches, in this study, we consider the five approaches: Devign [36], ReVeal [45], IVDetect [52], LineVul [37], and SVulD [46]. We briefly introduce them as follows.

Devign proposed by Zhou et al. [36] is a general graph neural network-based model for graph-level classification through learning on a rich set of code semantic representations including AST, CFG, DFG, and code sequences. It uses a novel Conv𝐶𝑜𝑛𝑣Convitalic_C italic_o italic_n italic_v module to efficiently extract useful features in the learned rich node representations for graph-level classification.

ReVeal proposed by Chakraborty et al. [45] contains two main phases. In the feature extraction phase, it translates code into a graph embedding, and in the training phase, it trains a representation learner on the extracted features to obtain a model that can distinguish the vulnerable functions from non-vulnerable ones.

IVDetect proposed by Li et al. [52] contains the coarse-grained vulnerability detection component and fine-grained interpretation component. In particular, IVDetect represents source code in the form of a program dependence graph (PDG) and treats the vulnerability detection problem as graph-based classification via graph convolution network with feature attention. As for interpretation, IVDetect adopts a GNNExplainer to provide fine-grained interpretations that include the sub-graph in PDG with crucial statements that are relevant to the detected vulnerability.

LineVul proposed by Fu et al. [37] is a Transformer-based line-level vulnerability prediction approach. LineVul leverages BERT architecture with self-attention layers which can capture long-term dependencies within a long sequence. Besides, benefiting from the large-scale pre-trained model, LineVul can intrinsically capture more lexical and logical semantics for the given code input. Moreover, LineVul adopts the attention mechanism of BERT architecture to locate the vulnerable lines for finer-grained detection.

SVulD proposed by Ni et al. [46] is a function-level subtle semantic embedding for vulnerability detection along with heuristic explanations. Particularly, SVulD adopts contrastive learning to train the UniXcoder semantic embedding model for learning distinguishing semantic representation of functions regardless of their lexically similar information.

3.5 Evaluation Metrics

For considered software vulnerability-related tasks, we will perform evaluations using the widely adopted performance metrics. More precisely, to evaluate the effectiveness of LLMs on vulnerability detection and vulnerability assessment, we consider the following four metrics: F1-score, Recall, Precision, and Accuracy.

For the software vulnerability location task, we adopt the Hit@Acc, Precision, and Recall metrics (refer to Section 4.3 for more details). For the software vulnerability description task, we use Rouge-1, Rouge-2, and Rouge-L metrics.

3.6 Implementation

We develop the generation pipeline in Python, utilizing PyTorch [69] implementations of DeepSeek Coder, CodeLlama, StarCoder, WizardCoder, Mistral, and Phi-2. We use the Huggingface [70] to load the model weights and generate outputs. We also adhere to the best-practice guide [71] for each prompt. For the fine-tuning setting, we select the models with 7B parameters or less, and for the few-shot setting, we use models with fewer than 34B parameters. To directly compare the fine-tuning setting with the few-shot setting, we employ models with the same parameter in both settings (i.e., DeepSeek Coder 6.7B, CodeLlama 7B, StarCoder 7B, WizardCoder 7B, Mistral 7B, and Phi-2 2.7B). The constraint on the number of parameters is imposed by our computing resources. Table III summarizes the characteristics of the studied LLMs. Furthermore, considering the limitation of LLM’s conversation windows, we manually select three examples for the few-shot setting from the training set. Regarding ReVeal, IVDetect, Devign, LineVul, and SVulD, we utilize their publicly available source code and perform fine-tuning with the default parameters provided in their original code. Considering Devign’s code is not publicly available, we make every effort to replicate its functionality and achieve similar results on the original paper’s dataset. All these models are implemented using the PyTorch [69] framework. The evaluation is conducted on a 16-core workstation equipped with an Intel(R) Xeon(R) Gold 6226R CPU @ 2.90Ghz, 192GB RAM, and 10 × NVIDIA RTX 3090 GPU, running Ubuntu 20.04.1 LTS.

4 Experimental results

This section presents the experimental results by evaluating LLMs performances on the widely used comprehensive dataset (i.e., Big-Vul [21]) covering four SV-related tasks.

4.1 RQ-1: Evaluating Vulnerability Detection of LLMs

In this RQ, we first investigate the vulnerability detection of LLMs and make a comparison with the existing state-of-the-art (SOTA) approaches. Then, we conduct a more detailed analysis of the results, comparing the detection performance of LLMs under the Top-10 CWE types.

Experimental Setting. We instruct LLMs with the following task description to tell it to act as a vulnerability detector.

Task Description: If this C code snippet has vulnerabilities, output Yes; otherwise, output No.

We consider the five SOTA baselines: Devign [36], ReVeal [45], IVDetect [52], LineVul [37], and SVulD [46]. These approaches can be divided into two groups: graph-based (i.e., Devign, ReVeal and IVDetect) and transformer-based (i.e., LineVul and SVulD). Besides, in order to comprehensively compare the performance among baselines and LLMs, we consider four widely used performance measures (i.e., Precision, Recall, F1-score, and Accuracy) and conduct experiments on the popular dataset. Since graph-based approaches need to obtain the structure information (e.g., control flow graph (CFG), data flow graph (DFG)) of the studied functions, we adopt the same toolkit with Joern to transform functions. The functions are dropped out directly if they cannot be transformed by Joern successfully. Finally, the filtered dataset (shown in Table II) is used for evaluation. We follow the same strategy to build the training data, validating data, and testing data from the original dataset with previous work does [37, 54]. Specifically, 80% of functions are treated as training data, 10% of functions are treated as validation data, and the left 10% of functions are treated as testing data. We also keep the distribution as same as the original ones in training, validating, and testing data. Apart from presenting the overall performance comparison, we also give the detailed performance of LLMs on the Top-10 CWE types for a better analysis.

Results. [A] LLMs vs. SOTA approaches. Table V shows the overall performance measures between LLMs and five baselines and the best performances are highlighted in bold. According to the results in Table V, we can obtain the following observations:

(1) Fine-tuned LLMs have poor performance compared with transformer-based approaches when considering F1-score, Precision, and Accuracy. In particular, SVulD obtains 0.336, 0.282, and 0.915 in terms of F1-score, Precision, and Accuracy, which surpass the fine-tuned LLMs by 23.5%-242.9%, 70.9%-432.1%, and 16.9%-162.2% in terms of F1-score, Precision, and Accuracy, respectively. LineVul also outperforms all LLMs in precision and accuracy, with F1-score equal to the best LLM.

(2) The performance of fine-tuned LLMs is comparable to graph-based approaches. As for recall, LLMs under the fine-tuning setting generally outperform previous SOTA approaches, except for StarCoder (0.646), which falls slightly behind the top-performing approach, Devign (0.660). As for F1-score, LLMs achieve a range of 0.214-0.272, whereas graph-based approaches achieve a range of 0.200-0.232. As for Precision, LLMs score between 0.123 and 0.159, while graph-based methods range from 0.118 to 0.172. Additionally, in terms of Accuracy, LLMs range from 0.685 to 0.771, whereas graph-based methods range from 0.726 to 0.815.

(3) LLMs under few-shot setting have poor performance compared with existing approaches. LLMs ranging from 2.7B to 34B parameters perform less favorably than existing approaches in terms of F1-score and Precision. However, as for Accuracy, SVulD (transformer-based) obtains the best performance (0.915) and DeepSeek-Coder 6.7B ranks third (0.823), which is better than the three graph-based approaches.

Finding-1. LLMs can detect software vulnerabilities, but fine-tuned LLMs perform weaker than transformer-based methods, yet comparably to graph-based methods. Moreover, LLMs in the few-shot setting show lower performance than existing methods.

[B] Fine-Tuning vs. Few-Shot. The experimental results are presented in Table  V. Based on these experimental findings, we can draw the following observations: (1) LLMs fine-tuned for vulnerability detection demonstrate superior performance on the task compared to LLMs (except Mistral) in the few-shot setting. The F1-score and Precision have doubled, while the Recall has also shown improvement. (2) Mistral experiences a significant drop in performance after fine-tuning, except for Recall. This could be due to the absence of high-quality security vulnerability-related data in its pre-training. (3) LLMs with more parameters typically exhibit better performance. For example, CodeLlama 34B improves upon CodeLlama 7B by 19.4%, 34.5%, and 37.0% in terms of F1-score, Precision, and Accuracy, respectively. However, different LLMs may exhibit performance variations due to differences in model design and the quality of pre-training data. (4) Phi-2 achieves performance approximating that of other LLMs with 7 billion parameters, even with a parameter size of 2.7 billion. This may be attributed to the higher quality of its pre-training data.

Finding-2. After fine-tuning, the detection capability of LLMs has improved, except for Mistral. Larger models usually perform better, but performance can also be influenced by model design and pre-training data.

[C] The comparisons of Top-10 CWE types between LLMs. Table VI shows the detailed comparisons of Top-10 CWE types between fine-tuned LLMs. In this table, we highlight the best performance for each performance metric in bold. According to the results, we can achieve the following observations:

(1) In most cases, WizardCoder obtains better performance than LLMs by considering all performance metrics. Mistral performs the worst on many CWE types, even weaker than other LLMs by several times, which means Mistral has almost no ability to detect vulnerabilities. Other LLMs have certain advantages in different CWE types, complementing each other.

(2) Considering the performance of all metrics, WizardCoder achieves the best performances on CWE-119 (“Improper Restriction of Operations within the Bounds of a Memory Buffer”), CWE-125 (“Out-of-bounds Read”), CWE-362 (“Concurrent Execution using Shared Resource with Improper Synchronization (’Race Condition’)”), and CWE-476 (“NULL Pointer Dereference”), which indicates WizardCoder is exceptionally skilled at detecting and mitigating vulnerabilities related to memory handling and synchronization issues.

Finding-3. WizardCoder has the best vulnerability detection capability, while Mistral is the worst, with the remaining LLMs complementing each other.
TABLE V: The comparison between LLMs and five baselines on software vulnerability detection
Methods F1-score Recall Precision Accuracy
Devign 0.200 0.660 0.118 0.726
ReVeal 0.232 0.354 0.172 0.811
IVDetect 0.231 0.540 0.148 0.815
LineVul 0.272 0.620 0.174 0.828
SVulD 0.336 0.414 0.282 0.915
Fine-Tuning Setting
DeepSeek-Coder 6.7B 0.214 0.830 0.123 0.685
CodeLlama 7B 0.265 0.797 0.159 0.771
StarCoder 7B 0.224 0.646 0.136 0.769
WizardCoder 7B 0.272 0.781 0.165 0.783
Mistral 0.098 0.686 0.053 0.349
Phi-2 0.239 0.754 0.142 0.751
Few-Shot Setting
DeepSeek-Coder 6.7B 0.084 0.156 0.057 0.823
DeepSeek-Coder 33B 0.107 0.688 0.058 0.404
CodeLlama 7B 0.098 0.449 0.055 0.570
CodeLlama 34B 0.117 0.281 0.074 0.781
StarCoder 7B 0.094 0.443 0.053 0.560
StarCoder 15.5B 0.097 0.557 0.053 0.463
WizardCoder 7B 0.086 0.380 0.049 0.583
WizardCoder 34B 0.128 0.559 0.072 0.607
Mistral 0.126 0.401 0.074 0.711
Phi-2 0.099 0.563 0.054 0.471
TABLE VI: The software vulnerability detection comparison on Top-10 CWEs among fine-tuned LLMs
CWE Type # Total # Vul. DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2 DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2
F1-score Precision
CWE-119 1549 128 0.291 0.310 0.311 0.340 0.138 0.319 0.181 0.200 0.201 0.219 0.078 0.205
CWE-20 1082 80 0.261 0.292 0.263 0.282 0.130 0.252 0.155 0.176 0.160 0.170 0.071 0.151
CWE-264 800 64 0.411 0.444 0.396 0.495 0.186 0.425 0.270 0.309 0.276 0.343 0.107 0.290
CWE-399 697 35 0.242 0.259 0.279 0.299 0.080 0.269 0.140 0.154 0.167 0.181 0.042 0.157
CWE-125 582 29 0.218 0.216 0.222 0.268 0.099 0.265 0.126 0.128 0.134 0.160 0.053 0.158
CWE-200 573 27 0.252 0.277 0.244 0.314 0.107 0.244 0.152 0.174 0.152 0.202 0.058 0.152
CWE-189 442 21 0.162 0.229 0.168 0.275 0.067 0.114 0.092 0.135 0.097 0.163 0.035 0.065
CWE-362 413 16 0.063 0.065 0.057 0.063 0.022 0.038 0.033 0.034 0.030 0.033 0.011 0.020
CWE-416 406 12 0.123 0.118 0.084 0.136 0.042 0.125 0.068 0.067 0.048 0.077 0.022 0.069
CWE-476 367 11 0.089 0.091 0.049 0.127 0.026 0.082 0.047 0.049 0.026 0.068 0.013 0.043
CWE Type # Total # Vul. DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2 DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2
Recall Accuracy
CWE-119 1549 128 0.742 0.680 0.680 0.758 0.633 0.719 0.701 0.750 0.751 0.757 0.349 0.747
CWE-20 1082 80 0.828 0.859 0.734 0.813 0.797 0.766 0.723 0.753 0.756 0.755 0.369 0.731
CWE-264 800 64 0.863 0.788 0.700 0.888 0.738 0.800 0.753 0.803 0.786 0.819 0.356 0.784
CWE-399 697 35 0.889 0.815 0.852 0.852 0.741 0.963 0.785 0.819 0.829 0.845 0.340 0.798
CWE-125 582 29 0.828 0.690 0.655 0.828 0.724 0.828 0.704 0.751 0.771 0.775 0.345 0.771
CWE-200 573 27 0.743 0.686 0.629 0.714 0.657 0.629 0.731 0.782 0.763 0.810 0.328 0.763
CWE-189 442 21 0.688 0.750 0.625 0.875 0.625 0.438 0.742 0.817 0.776 0.833 0.367 0.753
CWE-362 413 16 0.800 0.600 0.600 0.600 0.600 0.400 0.714 0.792 0.758 0.782 0.349 0.753
CWE-416 406 12 0.667 0.500 0.333 0.583 0.500 0.667 0.719 0.778 0.786 0.781 0.323 0.724
CWE-476 367 11 0.857 0.714 0.429 1.000 0.429 0.714 0.665 0.728 0.684 0.738 0.381 0.695

4.2 RQ-2: Evaluating Vulnerability Assessment of LLMs

In this RQ, we delineate two task descriptions for vulnerability assessment: (1) code-based and (2) code-based with additional key information. We compare the performance of LLMs in both task descriptions for vulnerability assessment and concurrently conduct a case study to illustrate the effectiveness of incorporating key important information.

Experimental Setting. We instruct LLM with the following task descriptions (i.e., Task Description 1 and Task Description 2) to tell it to act as a vulnerability assessor. We first provide LLM with the vulnerable codes to explore its performance (Task Description 1). Moreover, we provide LLM with some key important information, including the CVE description, the project, the commit message as well as the file name when the vulnerable code exists to investigate the performance differences (Task Description 2).

Task Description 1: Provide a qualitative severity rating of CVSS v2.0 for the vulnerable C code snippet. Task Description 2: Provide a qualitative severity rating of CVSS v2.0 for the vulnerable C code snippet (with additional information).
TABLE VII: The comparison of LLMs on software vulnerability assessment
Methods F1-score Recall Precision Accuracy
Fine-Tuning Setting
DeepSeek-Coder 6.7B 0.573 0.547 0.648 0.691
CodeLlama 7B 0.621 0.592 0.684 0.703
StarCoder 7B 0.563 0.539 0.614 0.667
WizardCoder 7B 0.558 0.537 0.619 0.686
Mistral 0.318 0.343 0.329 0.541
Phi-2 0.495 0.481 0.599 0.663
Few-Shot Setting
DeepSeek-Coder 6.7B 0.244 0.375 0.358 0.253
DeepSeek-Coder 33B 0.302 0.342 0.338 0.352
CodeLlama 7B 0.303 0.331 0.319 0.367
CodeLlama 34B 0.274 0.367 0.321 0.301
StarCoder 7B 0.314 0.363 0.358 0.450
StarCoder 15.5B 0.309 0.378 0.340 0.356
WizardCoder 7B 0.238 0.357 0.334 0.265
WizardCoder 34B 0.334 0.388 0.368 0.390
Mistral 0.201 0.349 0.320 0.202
Phi-2 0.266 0.346 0.335 0.312
TABLE VIII: A vulnerable code for CodeLlama to assess with different prompts
Improper Restriction of Operations within the Bounds of a Memory Buffer Vulnerability (CWE-119) in Linux
Task Description 1 Provide a qualitative severity ratings of CVSS v2.0 for the vulnerable C code snippet.
Input 1 An example of a C code snippet with vulnerabilities (CVE-2011-2517).
Response 1 Severity: Medium
Task Description 2 Provide a qualitative severity rating of CVSS v2.0 for the vulnerable C code snippet (with additional information).
Input 2 Project: Linux
File Name: net/wireless/nl80211.c
CVE Description: Multiple buffer overflows in net/wireless/nl80211.c in the Linux kernel before 2.6.39.2 allow local users to gain privileges by leveraging the CAP_NET_ADMIN capability during scan operations with a long SSID value.
Commit Message: nl80211: fix check for valid SSID size in scan operations. In both trigger_scan and sched_scan operations, we were checking for the SSID length before assigning the value correctly. Since the memory was just kzalloc’ed, the check was always failing and SSID with over 32 characters were allowed to go through. This was causing a buffer overflow when copying the actual SSID to the proper place. This bug has been there since 2.6.29-rc4.
Response 2 Severity: High
Analysis The true Severity is High. After providing additional key information, CodeLlama output for the Severity changed from Medium to High.

Results. Table VII shows the detailed results of LLMs on vulnerable assessment. Based on these experimental results, we can observe a significant improvement in the vulnerability assessment capability of LLMs after fine-tuning. Specifically, the accuracy has increased from 0.202-0.471, reaching a range of 0.541-0.703, while precision has improved from 0.319-0.358, now ranging from 0.329-0.684. This underscores the necessity of using fine-tuning in downstream tasks. It’s worth noting that CodeLlama, after fine-tuning, achieves the best performance across all metrics. If researchers need to perform tasks such as vulnerability assessment with LLM, we recommend choosing CodeLlama as the preferred option. We also find that Mistral exhibits a relatively smaller improvement after fine-tuning, which aligns with our expectations, as it is a general LLM. In a few-shot setting, StarCoder stands out by achieving performance comparable to 34B LLMs with a significantly smaller parameter size (i.e., 7B). In this context, it achieves F1-score and Accuracy metrics of 0.314 and 0.450, respectively.

Finding-4: Larger model parameter counts did not enhance vulnerability assessment performance with LLMs, prioritize smaller parameter models for better cost-performance balance.

Case Study. To illustrate the effectiveness of key important information, we present an instance of a vulnerability (CWE-119) in Big-Vul that is exclusively assess by CodeLlama, as depicted in Table VIII. This example is a vulnerability in the Linux project, categorized under CWE-119 (Improper Restriction of Operations within the Bounds of a Memory Buffer Vulnerability). In an initial assessment without critical information, CodeLlama did not fully grasp the severity of this vulnerability and labeled it as “Medium”. However, with the provision of crucial details, CodeLlama can more accurately evaluate the risk level of this vulnerability. The CVE description for this vulnerability highlights multiple buffer overflows in the net/wireless/nl80211.c file of the Linux kernel prior to version 2.6.39.2. These vulnerabilities allow local users to gain elevated privileges by leveraging the CAP NET ADMIN capability during scan operations with an excessively long SSID value. In this scenario, the lack of proper validation of the SSID length leads to buffer overflows, enabling attackers to exploit the vulnerability, escalate privileges, and execute malicious code. The commit message described that this bug has existed since version 2.6.29-rc4 of the Linux kernel. Given this information, CodeLlama reassesses the risk level of this vulnerability as “High”. This is because it allows attackers to escalate privileges and execute malicious code, and it has persisted for a considerable period of time. It is crucial to address and patch this vulnerability promptly by updating the operating system or kernel to ensure security.

To compare the vulnerability assessment capabilities of LLMs after providing key information, we have created a performance comparison bar chart, as shown in Fig.4. LLMs have limited capacity for assessing vulnerability severity based solely on source code. However, when provided with key important information, most LLMs (i.e., DeepSeek-Coder, CodeLlama, WizardCoder, and Mistral) exhibit significantly improved vulnerability assessment capabilities, particularly in terms of the Accuracy metric. The Accuracy has increased from the range of 0.2-0.37 to the range of 0.21-0.43. StarCoder and Phi-2 are showing a declining trend, and we believe this may be attributed to the addition of key information, resulting in an increase in the number of input tokens. These LLMs may not excel in handling excessively long text sequences. In contrast, DeepSeek-Coder exhibits a significant improvement, possibly due to its proficiency in handling long sequential text.

Refer to caption
Figure 4: The impact of key important information on LLM Vulnerability Assessment
Finding-5: LLMs have the limited capacity for assessment of vulnerability severity based on source code only, but can be extremely improved if provided with more context information in most cases.

4.3 RQ-3: Evaluating Vulnerability Location of LLMs

In this RQ, we first outline how to assess the vulnerability location capabilities of LLMs. Then, we proceed to compare the vulnerability location abilities of LLMs across different settings, both at a general level and in detail, and analyze the reasons behind the observed differences.

Experimental Setting. We select the vulnerable functions with information on vulnerable lines from the testing set for the evaluation and instruct LLM with the following task description to explore its vulnerability location performance.

Task Description: Provide a vulnerability location result for the vulnerable C code snippet.

As for a specific vulnerable function, it may contain one or several vulnerable lines of code (Linesground𝐿𝑖𝑛𝑒subscript𝑠𝑔𝑟𝑜𝑢𝑛𝑑Lines_{ground}italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_g italic_r italic_o italic_u italic_n italic_d end_POSTSUBSCRIPT), and LLM may also predict one or several potential ones (Linespredict𝐿𝑖𝑛𝑒subscript𝑠𝑝𝑟𝑒𝑑𝑖𝑐𝑡Lines_{predict}italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t end_POSTSUBSCRIPT). We compare Linespredict𝐿𝑖𝑛𝑒subscript𝑠𝑝𝑟𝑒𝑑𝑖𝑐𝑡Lines_{predict}italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t end_POSTSUBSCRIPT to Linesground𝐿𝑖𝑛𝑒subscript𝑠𝑔𝑟𝑜𝑢𝑛𝑑Lines_{ground}italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_g italic_r italic_o italic_u italic_n italic_d end_POSTSUBSCRIPT to check whether the line index LiLinespredictsubscript𝐿𝑖𝐿𝑖𝑛𝑒subscript𝑠𝑝𝑟𝑒𝑑𝑖𝑐𝑡L_{i}\in Lines_{predict}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t end_POSTSUBSCRIPT belongs to Linesground𝐿𝑖𝑛𝑒subscript𝑠𝑔𝑟𝑜𝑢𝑛𝑑Lines_{ground}italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_g italic_r italic_o italic_u italic_n italic_d end_POSTSUBSCRIPT and we treat it predict correctly for the line Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT if it belongs to, otherwise it predicts incorrectly. We also use the Linesboth𝐿𝑖𝑛𝑒subscript𝑠𝑏𝑜𝑡Lines_{both}italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_b italic_o italic_t italic_h end_POSTSUBSCRIPT to represent the intersection of Linespredict𝐿𝑖𝑛𝑒subscript𝑠𝑝𝑟𝑒𝑑𝑖𝑐𝑡Lines_{predict}italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t end_POSTSUBSCRIPT to Linesground𝐿𝑖𝑛𝑒subscript𝑠𝑔𝑟𝑜𝑢𝑛𝑑Lines_{ground}italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_g italic_r italic_o italic_u italic_n italic_d end_POSTSUBSCRIPT.

To better evaluate the vulnerability location performance of LLM on a specific vulnerable function, we give the following definitions:

  • Hit@Acc means the effectiveness of LLM and equals 1 if LLM correctly predicts at least one line of vulnerable line, other it equals 0.

  • Precision indicates how many of the LLM’s predicted vulnerability locations are actual vulnerability locations. It is defined as 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=#𝐿𝑖𝑛𝑒𝑠𝑏𝑜𝑡ℎ#𝐿𝑖𝑛𝑒𝑠𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛#subscript𝐿𝑖𝑛𝑒𝑠𝑏𝑜𝑡ℎ#subscript𝐿𝑖𝑛𝑒𝑠𝑝𝑟𝑒𝑑𝑖𝑐𝑡\mathit{Precision=\frac{\#Lines_{both}}{\#Lines_{predict}}}italic_Precision = divide start_ARG # italic_Lines start_POSTSUBSCRIPT italic_both end_POSTSUBSCRIPT end_ARG start_ARG # italic_Lines start_POSTSUBSCRIPT italic_predict end_POSTSUBSCRIPT end_ARG.

  • Recall indicates how many actual vulnerability locations LLM can be correctly found. It is defined as: 𝑅𝑒𝑐𝑎𝑙𝑙=#𝐿𝑖𝑛𝑒𝑠𝑏𝑜𝑡ℎ#𝐿𝑖𝑛𝑒𝑠𝑔𝑟𝑜𝑢𝑛𝑑𝑅𝑒𝑐𝑎𝑙𝑙#subscript𝐿𝑖𝑛𝑒𝑠𝑏𝑜𝑡ℎ#subscript𝐿𝑖𝑛𝑒𝑠𝑔𝑟𝑜𝑢𝑛𝑑\mathit{Recall=\frac{\#Lines_{both}}{\#Lines_{ground}}}italic_Recall = divide start_ARG # italic_Lines start_POSTSUBSCRIPT italic_both end_POSTSUBSCRIPT end_ARG start_ARG # italic_Lines start_POSTSUBSCRIPT italic_ground end_POSTSUBSCRIPT end_ARG.

For example, for a given vulnerable function, it totally has six vulnerable lines “[2, 3, 5, 9, 14,23]”, and LLM gives out its prediction with 10 potential lines “[1, 3, 5, 11, 15, 16, 17, 21, 22, 23]”. Then, we know that Linesground𝐿𝑖𝑛𝑒subscript𝑠𝑔𝑟𝑜𝑢𝑛𝑑Lines_{ground}italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_g italic_r italic_o italic_u italic_n italic_d end_POSTSUBSCRIPT equals “[2, 3, 5, 9, 14, 23]”, Linespredict𝐿𝑖𝑛𝑒subscript𝑠𝑝𝑟𝑒𝑑𝑖𝑐𝑡Lines_{predict}italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_p italic_r italic_e italic_d italic_i italic_c italic_t end_POSTSUBSCRIPT equals “[1, 3, 5, 11, 15, 16, 17, 21, 22, 23]” and Linesboth𝐿𝑖𝑛𝑒subscript𝑠𝑏𝑜𝑡Lines_{both}italic_L italic_i italic_n italic_e italic_s start_POSTSUBSCRIPT italic_b italic_o italic_t italic_h end_POSTSUBSCRIPT equals “[3, 5, 23]”. According to these values, we obtain that Precision=310𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛310Precision=\frac{3}{10}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n = divide start_ARG 3 end_ARG start_ARG 10 end_ARG, Recall=36𝑅𝑒𝑐𝑎𝑙𝑙36Recall=\frac{3}{6}italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG 3 end_ARG start_ARG 6 end_ARG, and Hit@Acc=1𝐻𝑖𝑡@𝐴𝑐𝑐1Hit@Acc=1italic_H italic_i italic_t @ italic_A italic_c italic_c = 1.

TABLE IX: The comparison of LLMs on software vulnerability location
Methods Precision Recall Hit@Acc
Fine-Tuning Setting
DeepSeek-Coder 6.7B 0.121 0.186 0.307
CodeLlama 7B 0.097 0.137 0.304
StarCoder 7B 0.113 0.175 0.284
WizardCoder 7B 0.111 0.122 0.255
Mistral 0.133 0.163 0.320
Phi-2 0.081 0.177 0.276
Few-Shot Setting
DeepSeek-Coder 6.7B 0.085 0.314 0.472
DeepSeek-Coder 33B 0.094 0.393 0.508
CodeLlama 7B 0.066 0.228 0.335
CodeLlama 34B 0.082 0.231 0.358
StarCoder 7B 0.093 0.341 0.469
StarCoder 15.5B 0.085 0.499 0.662
WizardCoder 7B 0.078 0.188 0.338
WizardCoder 34B 0.086 0.204 0.327
Mistral 0.068 0.060 0.155
Phi-2 0.102 0.063 0.211
TABLE X: The comparison of software vulnerability location performance for Top-10 CWE Types between LLMs
CWE Type # Vul. DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2 DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2
Precision of Fine-Tuning Setting Precision of Few-Shot Setting
CWE-119 128 0.121 0.065 0.125 0.103 0.100 0.056 0.084 0.043 0.099 0.052 0.043 0.105
CWE-20 80 0.154 0.141 0.198 0.190 0.178 0.088 0.075 0.093 0.097 0.121 0.073 0.054
CWE-264 64 0.083 0.065 0.112 0.059 0.090 0.081 0.064 0.062 0.072 0.058 0.081 0.091
CWE-399 35 0.108 0.194 0.128 0.116 0.173 0.055 0.066 0.050 0.131 0.058 0.036 0.152
CWE-125 29 0.094 0.127 0.048 0.055 0.090 0.112 0.087 0.077 0.090 0.066 0.041 0.065
CWE-200 27 0.088 0.082 0.041 0.077 0.120 0.062 0.068 0.072 0.078 0.066 0.092 0.127
CWE-189 21 0.048 0.170 0.035 0.024 0.231 0.294 0.125 0.083 0.069 0.043 0.041 0.185
CWE-362 16 0.037 0.069 0.027 0.088 0.138 0.077 0.032 0.034 0.089 0.030 0.033 0.167
CWE-416 12 0.200 0.333 0.031 0.286 0.080 0.034 0.203 0.116 0.134 0.185 0.000 0.190
CWE-476 11 0.020 0.053 0.029 0.111 0.071 0.000 0.120 0.099 0.129 0.256 0.200 0.071
CWE Type # Vul. DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2 DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2
Recall of Fine-Tuning Setting Recall of Few-Shot Setting
CWE-119 128 0.184 0.128 0.234 0.128 0.163 0.153 0.356 0.163 0.366 0.147 0.038 0.078
CWE-20 80 0.144 0.106 0.116 0.106 0.106 0.069 0.156 0.153 0.275 0.169 0.041 0.025
CWE-264 64 0.138 0.114 0.263 0.114 0.120 0.293 0.341 0.305 0.341 0.198 0.096 0.096
CWE-399 35 0.317 0.256 0.256 0.195 0.366 0.244 0.293 0.220 0.463 0.146 0.024 0.085
CWE-125 29 0.208 0.178 0.089 0.059 0.099 0.297 0.356 0.267 0.327 0.119 0.030 0.030
CWE-200 27 0.262 0.123 0.062 0.092 0.185 0.215 0.400 0.354 0.415 0.323 0.169 0.108
CWE-189 21 0.077 0.205 0.103 0.026 0.615 0.128 0.615 0.359 0.436 0.154 0.103 0.128
CWE-362 16 0.125 0.250 0.125 0.375 1.000 0.500 0.125 0.125 0.625 0.125 0.125 0.125
CWE-416 12 0.356 0.289 0.044 0.311 0.044 0.022 0.622 0.378 0.422 0.378 0.000 0.089
CWE-476 11 0.071 0.286 0.071 0.286 0.143 0.000 0.786 0.786 0.786 0.714 0.286 0.071
CWE Type # Vul. DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2 DeepSeek-Coder CodeLlama StarCoder WizardCoder Mistral Phi-2
Hit@Acc of Fine-Tuning Setting Hit@Acc of Few-Shot Setting
CWE-119 128 0.326 0.292 0.270 0.191 0.270 0.292 0.416 0.270 0.438 0.303 0.101 0.247
CWE-20 80 0.404 0.426 0.426 0.404 0.532 0.277 0.489 0.362 0.596 0.383 0.149 0.149
CWE-264 64 0.172 0.241 0.414 0.276 0.276 0.259 0.397 0.276 0.379 0.328 0.121 0.224
CWE-399 35 0.476 0.286 0.190 0.238 0.524 0.333 0.571 0.286 0.476 0.238 0.095 0.238
CWE-125 29 0.500 0.364 0.318 0.227 0.273 0.318 0.545 0.545 0.591 0.273 0.136 0.136
CWE-200 27 0.385 0.231 0.154 0.192 0.385 0.269 0.538 0.500 0.577 0.577 0.346 0.269
CWE-189 21 0.250 0.333 0.250 0.083 0.417 0.333 0.667 0.333 0.667 0.500 0.250 0.333
CWE-362 16 0.333 0.667 0.333 0.333 1.000 0.333 0.333 0.333 0.667 0.333 0.333 0.333
CWE-416 12 0.200 0.400 0.200 0.400 0.200 0.100 0.600 0.300 0.500 0.300 0.000 0.400
CWE-476 11 0.143 0.429 0.143 0.429 0.286 0.000 0.571 0.571 0.571 0.571 0.571 0.143

Results. Table IX presents the average performance of vulnerability location for LLMs, while Table X provides a comparison of software vulnerability location performance among the Top-10 CWE Types across different LLMs. Both Table IX and Table X provide a comparative analysis of the impact of fine-tuning and few-shot settings on LLMs. Based on these tables, we can achieve the following observations:

(1) The few-shot setting reveals limitations in LLMs, and after fine-tuning, LLMs become more cautious. From the perspective of the Recall and Hit@Acc metrics, it appears that the vulnerability location capability of the LLMs after fine-tuning has actually weakened. For example, as shown in Table IX, the Recall range has changed from 0.063-0.341 to 0.122-0.186, and the Hit@Acc range has changed from 0.155-0.472 to 0.255-0.320. From Table X, it can be observed that across different CWE types, the best performance in terms of Recall and Hit@Acc metrics often occurs in the few-shot setting. This contradicts our previous experience, as it seems that LLMs perform worse after learning. To address this issue, we conducted a detailed analysis of the output results of the LLMs under fine-tuning and few-shot settings. We found that LLMs have limitations, and in the few-shot setting, they tend to output more vulnerable lines, even if these lines do not contain vulnerabilities. Let’s take StarCoder as an example. Fig. 5 depicts a vulnerability code snippet from the Big-Vul dataset, with the vulnerability behavior occurring in lines 3 and 4. However, in the few-shot setting, StarCoder tends to output more vulnerability lines, such as “[1, 2, 3, 4, 5, 6, 7, 8, 9]”, whereas after fine-tuning, StarCoder becomes more cautious and only outputs “[4]”. We observed a similar pattern in other LLMs, and we believe this is the reason for the higher Recall and Hit@Acc values in the few-shot setting. Therefore, it appears that the excellent performance of LLMs in few-shot setting is attributed to their inclination to output more potential vulnerability lines in those scenarios, even when such lines do not contain actual vulnerabilities.

Refer to caption
Figure 5: An example to demonstrate the limitations of StarCoder in vulnerability location

(2) Mistral is a promising LLM for vulnerability location task. Overall, after fine-tuning, LLMs show improved Precision, indicating their ability to learn and perform better on vulnerability location tasks. Surprisingly, Mistral, being a text LLM, exhibits improvements across all metrics after fine-tuning, especially with Precision reaching the best (i.e., 0.133), surpassing all other LLMs. This suggests that Mistral possesses significant potential to learn how to effectively locate vulnerabilities. If Mistral could be trained using pre-training tasks similar to those used for code LLMs, it might become an exceptionally powerful model for software vulnerability location.

Finding-6. Few-shot settings expose LLM limitations, but fine-tuning enhances caution. Finding-7. Mistral’s significant improvement after fine-tuning showcases its potential.

4.4 RQ-4: Evaluating Vulnerability Description of LLMs

In this RQ, we employ the ROUGH metric to evaluate the LLMs’ vulnerability description capabilities. We conduct a detailed statistical analysis of LLMs’ abilities and also perform a case study to provide a comprehensive assessment of their performance in describing vulnerabilities.

Experimental Setting. We instruct LLMs with a designated task description, guiding it to perform the role of a vulnerability descriptor. Table XII illustrates an example of our approach to evaluating LLMs’ proficiency in conducting vulnerability descriptions.

Task Description: Provide a CVE description for the vulnerable C code snippet.

To evaluate the precision of the generated CVE description, we adopt the widely used performance metric ROUGE [72], which is a set of metrics and is used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Here, we totally consider three settings: 1, 2, and L.

TABLE XI: The comparison of LLMs on software vulnerability description
Methods ROUGE-1 ROUGE-2 ROUGE-L
Fine-Tuning Setting
DeepSeek-Coder 6.7B 0.296 0.125 0.281
CodeLlama 7B 0.505 0.360 0.493
StarCoder 7B 0.532 0.397 0.522
WizardCoder 7B 0.514 0.373 0.502
Mistral 0.536 0.409 0.528
Phi-2 0.308 0.110 0.288
Few-Shot Setting
DeepSeek-Coder 6.7B 0.237 0.078 0.218
DeepSeek-Coder 33B 0.232 0.066 0.216
CodeLlama 7B 0.232 0.088 0.219
CodeLlama 34B 0.247 0.072 0.227
StarCoder 7B 0.229 0.081 0.215
StarCoder 15.5B 0.238 0.080 0.220
WizardCoder 7B 0.218 0.061 0.202
WizardCoder 34B 0.254 0.098 0.240
Mistral 0.264 0.100 0.249
Phi-2 0.253 0.104 0.241
TABLE XII: An example of CodeLlama generating vulnerability description for vulnerable code
Describe vulnerability code (CWE-119, Improper Restriction of Operations within the Bounds of a Memory Buffer)
Task Description Provide a CVE description for the vulnerable C code snippet.
Source Code An example of a C code snippet with vulnerabilities. CVE ID: CVE-2017-0637.
Response From CodeLlama A remote denial of service vulnerability in libhevc in Mediaserver could enable an attacker to use a specially crafted file to cause a device hang or reboot. This issue is rated as High severity due to the possibility of remote denial of service. Product: Android. Versions: 5.0.2, 5.1.1, 6.0, 6.0.1, 7.0, 7.1.1, 7.1.2. Android ID: A-34819017. Android ID: A-34819017.
CVE Description A remote code execution vulnerability in libhevc in Mediaserver could enable an attacker using a specially crafted file to cause memory corruption during media file and data processing. This issue is rated as Critical due to the possibility of remote code execution within the context of the Mediaserver process.Product: Android. Versions: 5.0.2, 5.1.1, 6.0, 6.0.1, 7.0, 7.1.1, 7.1.2. Android ID: A-34064500.

Results. Table XI represents the vulnerability description capabilities of LLMs on difference settings. According to the results, we can obtain the following observations: (1) Fine-tuning can significantly enhance the performance of LLMs in vulnerability descriptions, especially in the case of CodeLlama, StarCoder, WizardCoder, and Mistral. After fine-tuning, there is a several-fold improvement in ROUGE-1, ROUGE-2, and ROUGE-L metrics. This suggests that these LLMs possess strong learning capabilities and can extract more gains from historical data. (2) The low ROUGE-2 scores indicate that DeepSeek-Coder and Phi-2 have limited ability to generate accurate and relevant high-order n-grams (pairs of consecutive words) in vulnerability descriptions, indicating potential issues in capturing specific and detailed information.

Case Study. To demonstrate the effectiveness of LLMs in generating vulnerability descriptions, we present an example of a vulnerability (CWE-119) described by CodeLlama, as shown in Table XII. This example represents a vulnerability within the Linux project, categorized as CWE-119 (Improper Restriction of Operations within the Bounds of a Memory Buffer Vulnerability). It is noteworthy that even when provided with only the code of the vulnerability, CodeLlama produces text highly similar to the CVE description (highlighted in orange), indicating that pre-trained LLMs on extensive text and code are capable of comprehending the essence and crucial features of vulnerabilities and expressing this information in natural language.

Finding-8: CodeLlama, StarCoder, WizardCoder, and Mistral possess strong learning capabilities and can extract more gains from historical data.

5 Threats to Validate

Threats to Internal Validity mainly contains in two-folds. The first one is the design of a prompt to instruct LLMs to give out responses. We design our prompt according to the practical advice [71] which has been verified by many users online and can obtain a good response from LLMs. Furthermore, LLMs will generate responses with some randomness even given the same prompt. Therefore, we set “temperature” to 0, which will reduce the randomness at most and we try our best to collect all these results in two days to avoid the model being upgraded. The second one is about the potential mistakes in the implementation of studied baselines. To minimize such threats, we directly use the original source code shared by corresponding authors.

Threats to External Validity may correspond to the generalization of the studied dataset. To mitigate this threat, we adopt the most large-scale vulnerability dataset with diverse information about the vulnerabilities, which are collected from practical projects, and these vulnerabilities are recorded in the Common Vulnerabilities and Exposures (CVE). However, we do not consider these vulnerabilities found recently. Besides, we do not adopt another large-scale vulnerability dataset named SARD since it is built manually and cannot satisfy the distinct characteristics of the real world [43, 45].

Threats to Construct Validity mainly correspond to the performance metrics in our evaluations. To minimize such threats, we consider a few widely used performance metrics to evaluate the performance of LLMs on different types of tasks. (e.g., Accuracy, Precision, Recall, and ROUGE).

6 Conclusion

This paper aims to comprehensively investigate the capabilities of LLMs for software vulnerability tasks as well as its impacts. To achieve that, we adopt a large-scale vulnerability dataset (named Big-Vul) and then conduct several experiments focusing on five dimensions: (1) Vulnerability Detection, (2) Vulnerability Assessment, (3) Vulnerability Localization, and (4) Vulnerability Description. Overall, although LLMs shows some ability in certain areas, it still needs further improvement to be competent in software vulnerability related tasks. Our research conducts a comprehensive survey of LLMs’ capabilities and provides a reference for enhancing its understanding of software vulnerabilities in the future.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No.62202419 and No. 62172214), the Ningbo Natural Science Foundation (No. 2022J184), and the Key Research and Development Program of Zhejiang Province (No.2021C01105).

References

  • [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  • [2] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
  • [3] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing,” ACM Computing Surveys, vol. 55, no. 9, pp. 1–35, 2023.
  • [4] R. Tang, Y.-N. Chuang, and X. Hu, “The science of detecting llm-generated texts,” arXiv preprint arXiv:2303.07205, 2023.
  • [5] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong, “Codegen: An open large language model for code with multi-turn program synthesis,” arXiv preprint arXiv:2203.13474, 2022.
  • [6] Q. Zheng, X. Xia, X. Zou, Y. Dong, S. Wang, Y. Xue, Z. Wang, L. Shen, A. Wang, Y. Li et al., “Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x,” arXiv preprint arXiv:2303.17568, 2023.
  • [7] Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” arXiv preprint arXiv:2109.00859, 2021.
  • [8] Y. Wang, H. Le, A. D. Gotmare, N. D. Bui, J. Li, and S. C. Hoi, “Codet5+: Open code large language models for code understanding and generation,” arXiv preprint arXiv:2305.07922, 2023.
  • [9] D. AI, “Deepseek coder: Let the code write itself,” https://github.com/deepseek-ai/DeepSeek-Coder, 2023.
  • [10] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source be with you!” arXiv preprint arXiv:2305.06161, 2023.
  • [11] B. Rozière, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
  • [12] C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery, 2023.
  • [13] C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,” arXiv preprint arXiv:2304.00385, 2023.
  • [14] ——, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” in Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 959–971.
  • [15] R. Pan, A. R. Ibrahimzada, R. Krishna, D. Sankar, L. P. Wassi, M. Merler, B. Sobolev, R. Pavuluri, S. Sinha, and R. Jabbarvand, “Understanding the effectiveness of large language models in code translation,” arXiv preprint arXiv:2308.03109, 2023.
  • [16] S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot testers: Exploring llm-based general bug reproduction,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE).   IEEE, 2023, pp. 2312–2323.
  • [17] D. Zan, B. Chen, F. Zhang, D. Lu, B. Wu, B. Guan, W. Yongji, and J.-G. Lou, “Large language models meet nl2code: A survey,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 7443–7464.
  • [18] C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Esca** coverage plateaus in test generation with pre-trained large language models,” in International conference on software engineering (ICSE), 2023.
  • [19] S. Khan and S. Parkinson, “Review into state of the art of vulnerability assessment using artificial intelligence,” Guide to Vulnerability Analysis for Computer Networks and Systems, pp. 3–32, 2018.
  • [20] T. H. Le, H. Chen, and M. A. Babar, “A survey on data-driven software vulnerability assessment and prioritization,” ACM Computing Surveys (CSUR), 2021.
  • [21] J. Fan, Y. Li, S. Wang, and T. N. Nguyen, “A c/c++ code vulnerability dataset with code changes and cve summaries,” in Proceedings of the 17th International Conference on Mining Software Repositories, 2020, pp. 508–512.
  • [22] “Replication,” 2024. [Online]. Available: https://github.com/vinci-grape/VulEmpirical
  • [23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [24] Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155, 2020.
  • [25] D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu et al., “Graphcodebert: Pre-training code representations with data flow,” arXiv preprint arXiv:2009.08366, 2020.
  • [26] D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” arXiv preprint arXiv:2203.03850, 2022.
  • [27] W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” arXiv preprint arXiv:2103.06333, 2021.
  • [28] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  • [29] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017.
  • [30] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019.
  • [31] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [32] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung et al., “A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity,” arXiv preprint arXiv:2302.04023, 2023.
  • [33] C. MITRE, “Common vulnerabilities and exposures (cve),” 2023. [Online]. Available: https://cve.mitre.org/
  • [34] G. Bhandari, A. Naseer, and L. Moonen, “Cvefixes: automated collection of vulnerabilities and their fixes from open-source software,” in Proceedings of the 17th International Conference on Predictive Models and Data Analytics in Software Engineering, 2021, pp. 30–39.
  • [35] Symantec, “securityfocus,” 2023. [Online]. Available: https://www.securityfocus.com/
  • [36] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” in In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, p. 10197–10207.
  • [37] M. Fu and C. Tantithamthavorn, “Linevul: A transformer-based line-level vulnerability prediction,” 2022.
  • [38] S. Cao, X. Sun, L. Bo, R. Wu, B. Li, and C. Tao, “Mvd: Memory-related vulnerability detection based on flow-sensitive graph neural networks,” arXiv preprint arXiv:2203.02660, 2022.
  • [39] Z. Li, D. Zou, S. Xu, H. **, Y. Zhu, and Z. Chen, “Sysevr: A framework for using deep learning to detect software vulnerabilities,” IEEE Transactions on Dependable and Secure Computing, 2021.
  • [40] X. Cheng, G. Zhang, H. Wang, and Y. Sui, “Path-sensitive code embedding via contrastive learning for software vulnerability detection,” in Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 519–531.
  • [41] Y. Wu, D. Zou, S. Dou, W. Yang, D. Xu, and H. **, “Vulcnn: An image-inspired scalable vulnerability detection system,” 2022.
  • [42] Z. Li, D. Zou, S. Xu, X. Ou, H. **, S. Wang, Z. Deng, and Y. Zhong, “Vuldeepecker: A deep learning-based system for vulnerability detection,” in Proceedings of the 25th Annual Network and Distributed System Security Symposium, 2018.
  • [43] D. Hin, A. Kan, H. Chen, and M. A. Babar, “Linevd: Statement-level vulnerability detection using graph neural networks,” arXiv preprint arXiv:2203.05181, 2022.
  • [44] X. Zhan, L. Fan, S. Chen, F. We, T. Liu, X. Luo, and Y. Liu, “Atvhunter: Reliable version detection of third-party libraries for vulnerability identification in android applications,” in 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).   IEEE, 2021, pp. 1695–1707.
  • [45] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vulnerability detection: Are we there yet,” IEEE Transactions on Software Engineering, 2021.
  • [46] C. Ni, X. Yin, K. Yang, D. Zhao, Z. Xing, and X. Xia, “Distinguishing look-alike innocent and vulnerable code by subtle semantic representation learning and explanation,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1611–1622.
  • [47] C. Ni, L. Shen, W. Wang, X. Chen, X. Yin, and L. Zhang, “Fva: Assessing function-level vulnerability by integrating flow-sensitive structure and code statement semantic,” in 2023 IEEE/ACM 31st International Conference on Program Comprehension (ICPC).   IEEE, 2023, pp. 339–350.
  • [48] A. Feutrill, D. Ranathunga, Y. Yarom, and M. Roughan, “The effect of common vulnerability scoring system metrics on vulnerability exploit delay,” in 2018 Sixth International Symposium on Computing and Networking (CANDAR).   IEEE, 2018, pp. 1–10.
  • [49] G. Spanos and L. Angelis, “A multi-target approach to estimate software vulnerability characteristics and severity scores,” Journal of Systems and Software, vol. 146, pp. 152–166, 2018.
  • [50] T. H. M. Le, D. Hin, R. Croft, and M. A. Babar, “Deepcva: Automated commit-level vulnerability assessment with deep multi-task learning,” in 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2021, pp. 717–729.
  • [51] Z. Li, D. Zou, S. Xu, Z. Chen, Y. Zhu, and H. **, “Vuldeelocator: a deep learning-based fine-grained vulnerability detector,” IEEE Transactions on Dependable and Secure Computing, 2021.
  • [52] Y. Li, S. Wang, and T. N. Nguyen, “Vulnerability detection with fine-grained interpretations,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 292–303.
  • [53] C. Ni, W. Wang, K. Yang, X. Xia, K. Liu, and D. Lo, “ The Best of Both Worlds: Integrating Semantic Features with Expert Features for Defect Prediction and Localization,” in Proceedings of the 2022 30th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.   ACM, 2022, pp. 672–683.
  • [54] C. Ni, K. Yang, X. Xia, D. Lo, X. Chen, and X. Yang, “Defect identification, categorization, and repair: Better together,” arXiv preprint arXiv:2204.04856, 2022.
  • [55] Q. Zhang, Y. Zhao, W. Sun, C. Fang, Z. Wang, and L. Zhang, “Program repair: Automated vs. manual,” arXiv preprint arXiv:2203.05166, 2022.
  • [56] Z. Chen, S. J. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, and M. Monperrus, “Sequencer: Sequence-to-sequence learning for end-to-end program repair,” IEEE Transactions on Software Engineering, 2019.
  • [57] Q. Zhu, Z. Sun, Y.-a. Xiao, W. Zhang, K. Yuan, Y. Xiong, and L. Zhang, “A syntax-guided edit decoder for neural program repair,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 341–353.
  • [58] J. Sun, Z. Xing, H. Guo, D. Ye, X. Li, X. Xu, and L. Zhu, “Generating informative cve description from exploitdb posts by extractive summarization,” ACM Transactions on Software Engineering and Methodology (TOSEM), 2022.
  • [59] H. Guo, S. Chen, Z. Xing, X. Li, Y. Bai, and J. Sun, “Detecting and augmenting missing key aspects in vulnerability descriptions,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 3, pp. 1–27, 2022.
  • [60] H. Guo, Z. Xing, S. Chen, X. Li, Y. Bai, and H. Zhang, “Key aspects augmentation of vulnerability description based on multiple security databases,” in 2021 IEEE 45th Annual Computers, Software, and Applications Conference (COMPSAC).   IEEE, 2021, pp. 1020–1025.
  • [61] H. Guo, Z. Xing, and X. Li, “Predicting missing information of key aspects in vulnerability reports,” arXiv preprint arXiv:2008.02456, 2020.
  • [62] G. Fan, R. Wu, Q. Shi, X. Xiao, J. Zhou, and C. Zhang, “Smoke: scalable path-sensitive memory leak detection for millions of lines of code,” in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).   IEEE, 2019, pp. 72–82.
  • [63] W. Li, H. Cai, Y. Sui, and D. Manz, “Pca: memory leak detection using partial call-path analysis,” in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 1621–1625.
  • [64] “Joern,” 2023. [Online]. Available: https://github.com/joernio/joern
  • [65] Z. Luo, C. Xu, P. Zhao, Q. Sun, X. Geng, W. Hu, C. Tao, J. Ma, Q. Lin, and D. Jiang, “Wizardcoder: Empowering code large language models with evol-instruct,” arXiv preprint arXiv:2306.08568, 2023.
  • [66] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al., “Mistral 7b,” arXiv preprint arXiv:2310.06825, 2023.
  • [67] “Phi-2: The surprising power of small language models,” 2023. [Online]. Available: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
  • [68] “Hugging face open llm leaderboard,” 2023. [Online]. Available: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
  • [69] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  • [70] “Hugging face,” 2023. [Online]. Available: https://huggingface.co
  • [71] J. Shieh, “Best practices for prompt engineering with openai api,” OpenAI, February https://help.openai. com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api, 2023.
  • [72] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text summarization branches out, 2004, pp. 74–81.