A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification

Norbert Tihanyi, Ridhi Jain, Mohamed Amine Ferrag (Technology Innovation Institute (TII), UAE)
Yiannis Charalambous, Youcheng Sun, Lucas C. Cordeiro (University of Manchester, UK)

Abstract

This paper introduces an innovative approach that combines Large Language Models (LLMs) with Formal Verification strategies for automatic software vulnerability repair. Initially, we employ Bounded Model Checking (BMC) to identify vulnerabilities and extract counterexamples. These counterexamples are supported by mathematical proofs and the stack trace of the vulnerabilities. Using a specially designed prompt, we combine the original source code with the identified vulnerability, including its stack trace and counterexample that specifies the line number and error type. This combined information is then fed into an LLM, which is instructed to attempt to fix the code. The new code is subsequently verified again using BMC to ensure the fix succeeded. We present the ESBMC-AI framework as a proof of concept, leveraging the well-recognized and industry-adopted Efficient SMT-based Context-Bounded Model Checker (ESBMC) and a pre-trained transformer model to detect and fix errors in C programs, particularly in critical software components. We evaluated our approach on $50,000$ C programs randomly selected from the FormAI dataset with their respective vulnerability classifications. Our results demonstrate ESBMC-AI’s capability to automate the detection and repair of issues such as buffer overflow, arithmetic overflow, and pointer dereference failures with high accuracy. ESBMC-AI is a pioneering initiative, integrating LLMs with BMC techniques, offering potential integration into the continuous integration and deployment (CI/CD) process within the software development lifecycle.

Index Terms:

Large Language Models, Generative Pre-trained Transformers, Formal Verification, Fault Localization, and Program Repair.

I Introduction

Implementation bugs can impact the software quality by causing crashes, data loss, poor performance, or incorrect results [1, 2]. These bugs often lead to vulnerabilities, emphasizing the need for early identification and resolution [3]. Consequently, automated software testing [4, 5, 6], fault localization [7], and repair [8] have been active research areas over the past few decades. While classic static analysis aids early bug detection, it introduces false positives impacting developer productivity [9, 10]. Recent deep learning (DL) advancements have drawn the attention of the software engineering community, offering potential solutions to longstanding issues [11, 12, 13]. For example, DLFix [14] and DeepRepair [15] treat source code as text; however, as opposed to natural language, source code has a stronger syntax and semantics [16]; further, as these approaches rely on previously seen data, they may produce incorrect results. Often, this comprises small snippets of buggy code [14, 17, 18]; thus, the model may not have the details of the bug, its origin, and how it interacts with the rest of the program. Contrarily, CURE [16] employs a programming language model to parse, analyze, and model the source code. DEAR [19] combines spectrum-based fault localization with DL to learn the appropriate code-context.

Recent advances in Large Language Models (LLMs) such as OpenAI’s Codex [20], a GPT-like LLM tailored for code program repair [21, 22], have demonstrated significant promise in addressing software engineering and testing challenges. For instance, InferFix [23] applies LLMs to fix issues such as Null Pointer Dereference (NPD), Resource Leak (RL), and Thread Safety Violation (TSV). Xia et al. [24] show that applying state-of-the-art LLMs directly can outperform existing automated program repair techniques. Indeed, leveraging LLMs holds potential in vulnerability detection and Automatic Code Repair (ACR) [25, 26, 27]. However, deploying LLMs in software verification has limitations. Notably, state-of-the-art LLMs struggled to respond accurately when verifying software containing arithmetic expressions involving non-deterministic variables. In ACR, addressing a specific bug requires bit-precise calculations, appropriate Satisfiability Modulo Theory (SMT) solving skills, accurate parsing of the Abstract Syntax Tree (AST) [28], and precise data flow analysis [29]. These tasks demand precise and strict answers, where the non-deterministic behavior of LLMs can be problematic. Here, a precise external guide is needed for LLMs to pinpoint the exact location of vulnerabilities in the code. To address the unreliability of LLMs when used as stand-alone vulnerability detection tools, we propose integrating an LLM with the Efficient SMT-based Context-Bounded Model Checker (ESBMC) [30] a well-recognized and industry-adopted Formal Verification (FV) tool, which produces very low false negative and false positive findings, thereby enhancing the method’s efficiency. Figure 1 illustrates our counterexample-guided ACR methodology, combining BMC and LLM. The process involves the following steps: ➀ Initial Verification: The BMC module takes the source code provided by the user and verifies or falsifies a property specification. ➁ Failure Handling: If the verification fails, the BMC engine refutes the safety/security property. The original code and the counterexample for the property violation generated by BMC are then passed to the LLM module. ➂ Iterative Correction: The LLM engine receives customized queries to produce potentially corrected code, which is then fed back to the BMC module to verify whether the corrected version meets the initial safety specification.

Refer to caption — Figure 1: An overview of the ESBMC-AI framework. Initially, a C source code is verified with ESBMC. If the verification fails, the property violation output from ESBMC, along with the original C code, is fed to the LLM to obtain the potentially fixed code. This process is repeated for the generated C code until it can be successfully verified by ESBMC.

In this paper, we aim to address the following research questions:

This research aims to study and identify the impact of formal verification tool-based feedback on LLMs’ ability to repair faulty C code. The main original contributions of this work are as follows:

1.

We propose a novel software verification and repair approach, ESBMC-AI, which leverages the Efficient SMT-based Context-Bounded Model Checker (ESBMC) to provide stack traces and counterexamples of given vulnerabilities to an LLM for code repair;
2.

A comprehensive experimental assessment on over $1,000$ C programs, randomly selected from the FormAI dataset [31], to examine the effectiveness of ESBMC-AI in repairing program codes;
3.

Overall, formal program verification is undecidable [32, 33], making it impossible to create a computational method that can determine whether any given program is completely error-free. It is also ambitious to expect proof that a fixed patch does not disrupt the original program’s functionality. To address this issue, we have calculated the cyclomatic complexity of each generated patch to identify any significant deviations from the original program. Additionally, each patch has been verified by at least one human expert to ensure that the accuracies reported in this study are as precise as possible.
4.

With ESBMC-AI, we achieved a code repair accuracy of $90.40$ % for buffer overflow on scanf, $86.47$ % for division by zero, $70.27$ % for arithmetic overflow on add, and $69.66$ % for array bounds violation errors.
5.

We release ESBMC-AI 0.5.1¹¹1https://pypi.org/project/esbmc-ai/ as a PyPI module for industrial partners and the research community to use for ACR. Further details and results are available on the project website: https://github.com/esbmc/esbmc-ai

This paper is organized as follows: Section II provides motivating examples for this counterexample-guided code repair framework. We discuss prior related work in Section III, while the background is discussed in Section IV. We present our experimental results in Section V, which includes the experimental setup details and findings. Lastly, we conclude our work with future research directions in Section VII.

II Motivation Example

Detecting software vulnerabilities with LLMs is challenging, given their tendency to generate multiple alternative solutions for the same problem without explicitly identifying the root cause. While this adaptability is advantageous in language processing and text generation, it introduces complexities when proposing solutions for even trivial software security vulnerabilities [26, 34, 35]. Although Bounded Model Checking (BMC) excels in identifying vulnerabilities through mathematical proofs, rectifying code based on BMC output demands a deep understanding of the tools and a background in formal verification. Automating code repair using BMC holds great potential for streamlining secure software development but has complexities. Despite their current limitations in software vulnerability identification, LLMs can significantly enhance their effectiveness by integrating the output from a BMC tool. This integration could lead to a surprisingly effective vulnerability detection and code repair approach. For example, consider the simple C code fragment in Listing II, where the initialization of variable $z$ triggers an arithmetic overflow.

{listing}

[H] An arithmetic overflow example C program

Upon querying GPT-3.5 about potential issues with the code fragment in Listing II, it responded: ”There is no problem with this code. The code correctly calculates the value of $z$ as the square of $y$ and then calculates the value of $r$ as $z$ divided by 1000. Even after rephrasing the question as “Is there any overflow in the program?”, the model continued to respond inaccurately, specifically: ”The value of $y$ is calculated as the cube of $x$ , which is $77\times 77\times 77=456533$ , and the value of $z$ is calculated as the square of $y$ , which is $456533\times 456533=208598234889$ . This value can be represented within the $32$ -bit int variable range, so there is no integer overflow.”

According to Imani et al. [36], GPT models still struggle with evaluating arithmetic operations in programming languages. This is evident in the example where the model correctly calculates $456,533\times 456,533=208,598,234,889$ but incorrectly claims that this value can be represented by a $32$ -bit int variable, overlooking the maximum value a 32-bit integer can fit, i.e., $2^{32}=4,294,967,296$ .

While the solution might be correct for this specific case when using more advanced LLMs such as the latest GPT-4o [37] or Falcon-180B [38], the same issue could arise when handling more complex arithmetic operations during program repair, especially when the BMC module generates a very complex SMT formula to be solved. Four claims must be resolved during verification, even for a small program, listed in Listing II. The SMT formula generated involves 64-bit BitVec operations, requiring a robust SMT solver such as Z3 [39], Boolector [40], or CVC5 [41] to solve the underlying equation.

Initial experiments suggest that solving such straightforward examples with high confidence remains challenging. However, augmenting the capabilities of an LLM with an external tool can greatly improve its code repair abilities. Using BMC, we can easily detect the arithmetic overflow in our simple program: PROPERTY VIOLATION: line 4 function main arithmetic overflow on mul !overflow("*", y, y). When we ask the LLM to correct the code based on the specified property violation, supported by the stack trace and exact counterexample, we receive an accurate answer: “The issue is an arithmetic overflow on the line int $z=y*y;$ . The int data type in C has a limited range of values that it can represent, typically $-2,147,483,648$ to $2,147,483,647$ . In the line int $z=y*y$ ; there is a risk of integer overflow if the value of y is large. This large value exceeds an int data type’s range, resulting in an arithmetic overflow.” We advocate that assistance from an external BMC tool can guide the LLM towards the correct solution. The LLM utilizes this to generate code that is safeguarded against arithmetic issues by substituting the $int$ data type with $long$ $long$ $int$ , as illustrated in Listing II. {listing}[H] Corrected code for the code shown in Listing II

Upon running the BMC tool against the updated code, we received a ‘‘VERIFICATION SUCCESSFUL’’ output, indicating no integer boundary violations or overflows in the modified code. This small example provides convincing evidence that this approach is feasible and highly useful for ACR in industries requiring formal verification for critical software components. This motivates us to further investigate and explore this promising research direction in greater detail.

III Related Work

III-A Traditional Vulnerability Detection

Traditional vulnerability detection methods often rely on static [42, 43, 44] and dynamic [45, 46, 47] analysis techniques to identify security weaknesses in software. Although static approaches, including static code analysis [48, 49, 50], abstract syntax tree (AST) parsing [51, 28], and data flow analysis [29, 52] enable early detection, they have high false positive rates [53]. In contrast, dynamic analysis techniques, such as penetration testing [54, 55, 56], fuzz testing [46, 47], and runtime monitoring [57, 58, 59], provide a more realistic assessment by evaluating software behavior during execution. However, these approaches are often input-dependent, provide only partial code coverage, and are expensive. Hybrid approaches [60, 61, 62] combine static and dynamic analysis to balance their strengths and weaknesses. Bhayat et al. [61] propose a comprehensive strategy integrating pre- and post-deployment techniques. Pre-deployment involves identifying vulnerabilities through static analysis using bounded model checking and symbolic execution. Post-deployment focuses on mitigating these vulnerabilities through hardware measures and software runtime protection. The hybrid approach underscores the effectiveness of integrated protection over individual components. Aljaafari et al. [62] proposed Ensembles of Bounded Model Checking with Fuzzing (EBF) that combine BMC with Gray-Box Fuzzing (GBF) in OpenGBF to detect software vulnerabilities in concurrent programs.

Alternately, BMC provides reliable results with reduced costs as they limit the exploration depth for the test program. Song et al. [63] introduce ESBMC-Solidity, a Solidity frontend for ESBMC designed to verify the security of smart contracts on Ethereum’s blockchain network. Alshmrany et al. [64] present an upgraded version of FuSeBMC, a tool that uses BMC and Evolutionary Fuzzing engines for improved code coverage and bug detection. However, these approaches do not scale well even with the restricted depth exploration.

III-B Deep Learning-based Vulnerability Detection

DeepFix [13] is a multi-layer sequence-to-sequence neural network that can fix compile-time errors. SEQUENCER [17] employs a similar technique to fix logical bugs by suggesting single-line patches, requiring a larger vocabulary. VRepair generates multiline patches using transfer learning [65]. GetaFix [66] learns to generate patches by analyzing past human commits. Similarly, DEAR [19] uses AST-differencing to learn fine-grained changes and implements fault localization to identify problematic statements and produce relevant patches. DEAR and several other studies [67, 18] model ACR as a Neural Machine Translation (NMT) [68] problem. DeepRepair [15] uses DL code similarity to generate and validate patches. Huang et al. [69] leverage Large Language Models of Code (LLMCs) for ACR by fine-tuning these models under the NMT paradigm.

Latest advancements in DL, transformers, and LLMs have revolutionized natural language processing, enabling machines to understand and generate human-like language [70, 71]. These models can process vast amounts of textual data and extract meaningful information, making them useful tools for applications such as language translation, text summarization, sentiment analysis, and question-answering systems. LLMs’ ability to generate code [72, 73, 74] has made them a popular candidate for ACR [75, 23, 74, 76].

Many studies on LLM for ACR evaluate their approaches [77, 26] on QuixBugs [78], containing only Java and Python test programs. Researchers have also investigated the potency of GPT in identifying and repairing software bugs [79, 80, 77, 25, 26, 27]. Self-Edit [81] employs a generate-and-edit approach using test execution results from LLM-generated code to fix and improve code quality. RepairAgent [82] is an LLM-based agent for program repair, enabling dynamic bug-fixing through interaction with bug information, repair tools, and validation mechanisms. SecRepair [83], leveraging CodeGen2 and reinforcement learning, identifies and fixes vulnerabilities with descriptive code comments. MOREPAIR [84] introduces a fine-tuning approach for LLMs in ACR, emphasizing syntactic adaptation and logical reasoning behind code changes.

GPT models, with billions of parameters, produce accurate and contextually aware language models that are customizable through fine-tuning for specific tasks. Nonetheless, studies show that the codes and patches synthesized by GPT models may be incorrect and untrustworthy [34, 85, 25, 86, 87, 88]. New research proposes a prompt-based approach to verify the generated programs [89, 90, 91]. The quality of fixes generated depends on the feedback. For instance, COMPCODER [90] uses the compiler feedback to repair code but misses run-time errors. D4C [92] aligns LLM output with their training objective for effective whole-program refinement without prior fault localization. LLM-CompDroid [93] enhances Android app reliability by integrating LLMs with traditional tools to detect and repair XML configuration compatibility bugs. RING [94] is a multilingual repair engine for correcting last-mile coding errors across multiple languages. ChatRepair [95] uses a conversation-driven approach with prior test failure information to generate patches. Similarly, Conversational ACR [89] validates generated patches against a test suite, though test suite-based testing lacks completeness and may be inconsistently available.

TABLE I: Comparison of related software bug detection and repair approaches.

Framework details					Repair
Name	Year	Open Source	Dataset	Language	Granularity	Compiles	Method
Bhayat et al. [61]	2021	✗	SV-COMP [96]	C/C++	N/A	N/A	N/A
OpenGBF [62]	2022	✓	SV-COMP [96]	C/C++	N/A	N/A	N/A
ESBMC-Solidity [63]	2022	✓	Own²²2https://github.com/esbmc/esbmc/tree/master/regression/esbmc-solidity	Solidity	N/A	N/A	N/A
FuseBMC [64]	2022	✓	Test-Comp [97]	C/C++	N/A	N/A	N/A
COMPCODER [90]	2022	✗	AdVTest [98], CodeSearchNet [99]	Python	Program	✓	Compiler Feedback based code completion
Jigsaw [76]	2022	✗	PandasEval1, PandasEval2 [76]³³3https://github.com/microsoft/JigsawDataset/tree/main/datasets	Python	Snippets	✗	Program Synthesis
Conversational ACR [89]	2023	✗	QuixBugs [78]	Java, Python	Function	✗	Prompt-based repair
ChatRepair [95]	2023	✗	Defects4J [100], QuixBugs [78]	Java, Python	Patch	✗	Learns from previously failed tests
Pearce et al. [25]	2023	✓	ExtractFix [101]	C, Python	Program	✓	Security tests-based
RING [94]	2023	✗	BIFI [102], Bavishi et al. [103], TFix [104]	Excel, C, PowerFx, PS, Python, JS	Program	✓	Compiler message
Huang et al. [69]	2023	✓	Defects4J [100], CPatMiner [19]	Java, C/C++, Python	Patch	✗	Model trained on buggy code - fix pair
FuzzGPT [105]	2024	✗	Own [105] (unavailable)	Python	-	✗	LLM-based Fuzzing
RepairAgent [82]	2024	✗	Defects4J [100]	Java	Program	✓	Invoking suitable tools
SecRepair [83]	2024	✗	InstructVul [83] (unavailable)	C/C++	Program	✓	Fine-tuned instruction training
Self-Edit [81]	2024	✓	APPS [106], HumanEval [20]	Python	Program	✓	Compile/Runtime with tests
LLM-CompDroid [81]	2024	✗	ConfFix [107]	XML	Configuration	✗	Prompt-based
ContrastRepair [108]	2024	✗	Defects4J [100], HumanEval [20], QuixBugs [78]	Java, Python	Program	✓	Contrastive test-pair
CigaR [91]	2024	✓	Defects4J [100], HumanEval [20]	Java	Patches	✗	Prompt optimization
ESBMC-AI	2024	✓	FormAI [31]	C/C++	Program	✓	Formal verification based feedback

Our work uses automated theorem provers to explore the uninvestigated combination of LLMs with FV techniques, particularly symbolic model checking. Table I gives a quick view of how we position our ESBMC-AI framework concerning existing work. A desirable balance between two disparate concepts, symbolic verification and DL, can enhance the quality and speed of program repair. Relevant feedback that can be obtained from state-of-the-art software model checkers, such ESBMC [30], can show massive improvements in the patches suggested by GPTs.

IV Background: Formal Verification Meets Large Language Models

BMC and LLMs are complementary techniques used in software engineering and artificial intelligence, respectively, and they are not directly connected. Given the current knowledge of automated reasoning and software verification, both methods have yet to be used to solve similar problems, such as software bug detection and debugging. Here, we use BMC to verify programs and provide diagnostic counterexamples via text to LLM. In contrast, LLM is used to understand the textual trace that leads to the program bug and thus tentatively produce code to fix the identified vulnerability.

IV-A Bounded Model Checking (BMC)

BMC is a primary component of our proposed counterexample-guided repair framework. State-of-the-art BMC engines support various industrial programming languages [109, 110, 111, 112]. BMC represents the program as a state transition system extracted from the control-flow graph (CFG) [113], which is built during the translation from program text to Static Single Assignment (SSA) form. SSA is the “language” that the state-of-the-art SAT/SMT solvers understand, i.e., SSA expressions are converted to an SMT formula [109]. A node in the CFG represents either a (non-) deterministic assignment, while an edge in the CFG represents a possible change in the program’s control location.

We define a state transition system, denoted by $M$ , as a triple $\left(S,R,s_{1}\right)$ where $S$ represents the set of states, $R\subseteq S\times S$ represents the set of transitions and $s_{1}\subseteq S$ represents the set of initial states. A state $s\in S$ consists of the value of the program counter pc and the values of all program variables. An initial state $s_{1}$ assigns the initial program location of the CFG to pc. We identify each transition $T=(s_{i},s_{i+1})\in R$ between two states $s_{i}$ and $s_{i+1}$ with a logical formula $T(s_{i},s_{i+1})$ . This captures the constraints on the corresponding values of the program counter and the program variables.

We also define properties under verification in BMC: $\phi(s)$ is the logical formula encoding states satisfying a safety/security property, and $\psi(s)$ is the logical formula encoding states satisfying the completeness threshold, i.e., states corresponding to the program terminating. $\psi(s)$ will contain unwindings no deeper than the maximum number of loop iterations in the program. Note that, in our notation, termination, and error are mutually exclusive: $\phi(s)\wedge\psi(s)$ is by construction unsatisfiable; $s$ is a deadlock state if $T(s_{i},s_{i+1})\vee\phi(s)$ is unsatisfiable. The associated BMC problem is formulated by constructing the following logical formula:

\text{BMC(k)}=I(s_{1})\wedge\bigwedge^{k-1}_{i=1}T(s_{i},s_{i+1})\wedge\bigvee% ^{k}_{i=1}\neg\phi(s_{i}).

(1)

Here, $I$ the set of initial states of $M$ and $T(s_{n},s_{n+1})$ is the transition relation of $M$ between time steps $i$ and $i+1$ . Hence, $I(s_{1})\wedge\bigwedge^{k-1}_{i=1}T(s_{i},s_{i+1})$ represents the executions of $M$ of length $k$ and $BMC(k)$ can be satisfied if and only if for some $i\leq k$ there exists a reachable state at time step $i$ in which $\phi$ is violated. Suppose $BMC(k)$ is satisfiable. In that case, $\phi$ is violated, and the SMT solver provides a satisfying assignment from which we can extract the values of the program variables to construct a counterexample.

We define a counterexample (or trace) for a violated property $\phi$ as a finite sequence of states $s_{1},\ldots,s_{k}$ with $s_{1},\ldots,s_{k}\in S$ , and $T(s_{i},s_{i+1})$ for $0\leq i<k$ . This sequence informs our LLM engine on reproducing the software vulnerability since it tells how to go from the program entry point to the property violation. Suppose that equation (1) is unsatisfiable. We could conclude that no error state is reachable in $k$ steps or less. In this case, we use this information to conclude that no software vulnerability exists in the program up to the bound $k$ .

In our method, counterexamples are pivotal in guiding the LLM model. They enable the LLM to propose code corrections, taking inspiration from these counterexamples. Each counterexample specifies the exact trace, line number, and variable name, effectively highlighting the issue within the code. Without these counterexamples, even a simple code, as observed in the motivation section, could pose challenges for the LLM in suggesting a suitable fix. Further, it is essential to note that these counterexamples are based on rigorous mathematical proofs of whether a property holds. Consequently, the likelihood of introducing false positive findings is reduced to a very minimal level (though implementation errors may still exist), unlike results from simple static analysis tools.

IV-B Large Language Models (LLMs)

LLMs are DL systems based on the transformer architecture. They can understand, process, and generate human-like natural language. The input to an LLM consists of a sequence of tokens representing words, subwords, or characters transformed into a high-dimensional vector space using an embedding technique. These embedded tokens pass through multiple network layers, each applying non-linear transformations governed by learnable parameters. The output is often a probability distribution over possible next tokens, with the model selecting the highest probability token. While LLMs are less efficient than state-of-the-art BMC tools for exact arithmetic operations and bounded model checking tasks, they excel in various natural language processing tasks, such as translation, question answering, and text generation. Transforming violated properties into human-like sentences enhances the LLM’s understanding of code issues, allowing BMC counterexamples to correct erroneous code effectively.

Tom et al. [114] introduced GPT-3, the third iteration of the Generative Pretrained Transformer model developed by OpenAI. This paper’s primary focus is on the few-shot learning capability of language models. The authors demonstrate that language models start exhibiting remarkable few-shot performance when scaled up, essentially learning from a limited number of examples. Lampinen et al. [115] investigated how AI systems interpret, understand, and apply knowledge from explanations provided in various contexts. Specifically, this is an important contribution to AI, particularly in language understanding and knowledge acquisition by machine learning models. Training or fine-tuning a transformer-based LLM, such as GPT-4 [116], BERT [117], T5 [118], typically involves providing the model with a substantial volume of data in the form of input-output pairs.

In this task, our inputs are the preprocessed counterexamples from BMC, and the outputs are human-readable interpretations of those counterexamples. When training an LLM, the model uses the “Scaled Dot-Product Attention” and “Multi-Head Attention” [119]. The attention mechanism allows the model to focus on different parts of the input sequence when producing the output sequence, which is especially useful for translating between complex BMC outputs and human language. Mathematically, the scaled dot-product attention is calculated as:

\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)V,

(2)

where $Q$ , $K$ , and $V$ are queries, keys, and values, respectively, and $d_{k}$ is the dimension of the queries and keys. This attention function is used in parallel or in “heads”, enabling the model to focus on different features in the input. While scaled dot-product attention is frequently employed during training, it also proves to be highly valuable in the inference phase, showcasing a proficient understanding of BMC counterexamples.

Counterexamples provided by the BMC module often contain important but disconnected information, making them difficult for previous solutions, especially pre-transformer models, to interpret. The transformer’s attention mechanism, specifically the scaled dot-product attention, enhances understanding complex inputs. For instance, consider the counterexample "overflow line 7 function main, ERROR: argv[0]=32768, *mul(y,y)". Interpreting this requires the language model to understand multiple aspects. This counterexample shows an overflow in the variable $y$ during multiplication at line number 7.

The Scaled Dot-Product Attention can focus on different input parts based on their relevance to the current context. In this case, it could identify the link between the overflow error, the mul(y,y) function, and the specific line number mentioned. In other words, it can “attend” to the related information about the overflow error and the associated line of code when recommending an appropriate code fix.

This ability to dynamically allocate attention based on the input’s content is one of the main reasons why transformer-based models such as GPT have succeeded across various tasks, including code debugging and automatic repair. They can understand the context of a given input, including intricate relations between separated segments, enabling them to suggest more accurate and relevant solutions or recommendations.

V Methodoology

ESBMC-AI is an AI-powered platform designed to expedite the detection and repair of critical software components. It employs a BMC tool in the background to identify vulnerabilities using formal verification methods such as abstract interpretation, constraint programming, and symbolic model checking, after which the generated counterexample is provided to the LLM with a specially crafted prompt.

V-A Why ESBMC?

Our ACR methodology can be implemented with various BMC tools. We chose the Efficient SMT-based Context-Bounded Model Checker (ESBMC) [30] to implement our approach towards building self-healing software via LLMs and formal verification methods, illustrated in Figure 1. In particular, we chose ESBMC since it is an efficient software verifier that can solve the highest amount of reachability-safety verification tasks within $10$ seconds time-limit according to SV-COMP 2024 [120]. We note that the selection of a 10-second time limit is not arbitrary. While increasing the time limit could yield improved results, longer processing times are unsuitable for code fixing in a live Integrated Development Environment (IDE) and continuous integration (CI) pipeline. By adhering to this limit, one can apply the proposed approach to such an existing framework and provide nearly real-time feedback to the programmer.

V-B User Chat Mode (UCM) and Fix Code Mode (FCM)

ESBMC-AI currently operates in two distinct modes: the User Chat Mode (UCM) and the Fix Code Mode (FCM). The primary purpose of the UCM mode is to utilize the capabilities of LLMs to simplify and clarify the complex counterexamples of ESBMC for the user. The huge amount of training data that LLMs have been trained with, along with their architecture, allows LLMs to use in-context learning of the counterexample; this allows the LLMs to generate high-quality explanations of a wide variety of problems and counterexamples. In UCM, users can pose questions to ESBMC-AI, such as “How can I correct this code?” or “What is the problematic line of code?”, among others. Based on the output from ESBMC, the LLM engine can offer valuable advice to the user, which may be implemented into the software according to the user’s choice. In this mode, there is no automated code repair. Yet, the suggestions are grounded in the ESBMC output, which tends to be more precise than identifying specific bugs without formal verification methods.

Our primary focus is on FCM, aiming an advanced environment for identifying bugs and performing automatic code repair while ensuring the code remains compilable and retains its original behavior. In this mode, we utilize the well-recognized and industry-adopted ESBMC tool to detect vulnerabilities and leverage LLMs to fix the code. This presents challenges: we require a large and reliable dataset to evaluate our methodology, and human experts must carefully evaluate the applied patches to assess the success of the LLM in code rectification. Specialized prompts for each vulnerability are required to “interpret” the ESBMC counterexamples for an LLM. Human experts with a formal verification and software security background craft these prompts. For example, distinct prompts are required to address dereference failure versus buffer overflow in scanf(). Utilizing a general prompt such as “fix the code based on this counterexample” will significantly reduce accuracy in ACR.

V-C The ESBMC-AI Evaluation Dataset

We need a sufficient number of vulnerable code samples to evaluate the effectiveness of the ESBMC-AI methodology. To showcase the strength of our methodology fully, we must note that not all datasets are suitable for our needs. The samples must be compilable, and the dataset should be labeled with the appropriate vulnerability class. Most available datasets [121, 122, 123, 124] do not cater to at least one of these requirements.

The FormAI [31] dataset comprises $112,000$ AI-generated C programs, with $51.24$ % containing at least one vulnerability. The dataset covers diverse tasks, including complex ones like network management, encryption, table games, and simpler tasks like string manipulation. All C codes are compilable, and every C program in the dataset is labeled using a bounded model checking methodology with a $k=1$ bound parameter. Overall, we created $50,000$ samples for our evaluation. We then classified each program sample using ESBMC 7.6.1 and saved the results. To enhance vulnerability detection in each sample, we transitioned from bounded to unbounded model checking with unlimited k-steps and a $500$ -second timeout. This method strengthens the original approach applied in the FormAI dataset, where classification is based on bounded model checking with a $30$ -second timeframe [31]. This process is very time-consuming and resource-intensive, even for small C programs. We used an Amazon AWS r7i.48xlarge instance with an AMD EPYC 9R14 CPU family featuring 192 vCPUs and 1.5TB of DDR5 RAM to handle this. Once the dataset was prepared and we identified which C samples were vulnerable and which were not, we applied our ESBMC-AI ACR methodology to attempt to fix the vulnerabilities. We randomly selected samples from eight popular vulnerability categories from the FormAI dataset (see Table III) for manual inspection and to verify the correctness of our approach.

VI Experimental Results

This section presents the outcomes of integrating LLMs and BMC in ESBMC-AI, addressing the three research questions proposed at the beginning.

We aim to answer these questions through our experiments and provide an in-depth statistical analysis of the results, offering comprehensive insights into the effectiveness of the ESBMC-AI approach and potential future improvements.

Let us denote all the $50000$ C samples by $\Sigma$ , such that $\Sigma=\{c_{1},c_{2},\ldots,c_{50000}\}$ , where each $c_{i}$ represents an individual sample. The samples can be divided into three primary categories: Verification Successful ( $\mathcal{VS}$ ), Verification Failed ( $\mathcal{VF}$ ), and Verification Unknown ( $\mathcal{VU}$ ). These categories are mutually exclusive, meaning a single sample cannot belong to more than one category. Our main focus is the $\mathcal{VF}$ category, which includes $31801$ samples, indicating that $63.60$ % of the code is vulnerable. The vulnerable samples can also be divided into three main subcategories: dereference failures ( $\mathcal{DF}$ ), arithmetic overflow issues ( $\mathcal{AO}$ ), and buffer overflow issues. The precise distribution of vulnerabilities in our dataset is shown in Table II.

TABLE II: Top 32 Vulnerabilities in the 50000 dataset

Cat	Violation Type	Count (%)
Vulnerability distribution
$\mathcal{DF}$	Dereference failure: NULL pointer	14,700 (23.49%)
$\mathcal{BO}$	Buffer overflow on scanf	13,518 (21.60%)
$\mathcal{DF}$	Dereference failure: forgotten memory	7,681 (12.27%)
$\mathcal{DF}$	Dereference failure: invalid pointer	5,487 (8.77%)
$\mathcal{DF}$	Dereference failure: array bounds violated	4,020 (6.42%)
$\mathcal{AO}$	Arithmetic overflow on add	2,761 (4.41%)
$\mathcal{AO}$	Arithmetic overflow on sub	2,349 (3.75%)
$\mathcal{DF}$	Array bounds violated: upper bound	1,893 (3.02%)
$\mathcal{DF}$	Array bounds violated: lower bound	1,521 (2.43%)
$\mathcal{AO}$	Arithmetic overflow on mul	1,145 (1.83%)
$\mathcal{DF}$	DF: invalidated dynamic object	977 (1.56%)
$\mathcal{BO}$	Buffer overflow on fscanf	961 (1.54%)
$\mathcal{AO}$	Arithmetic overflow on FP ieee_mul	943 (1.51%)
$\mathcal{DF}$	Division by zero	631 (1.01%)
$\mathcal{AO}$	Arithmetic overflow on FP ieee_div	591 (0.94%)
$\mathcal{DF}$	VLA size overflows address space	507 (0.81%)
$\mathcal{BO}$	Buffer overflow on sscanf	498 (0.80%)
$\mathcal{AO}$	Arithmetic overflow on FP ieee_add	497 (0.79%)
$\mathcal{DF}$	DF: Access to object OOB	453 (0.72%)
$\mathcal{AO}$	Arithmetic overflow on FP ieee_sub	297 (0.47%)
$\mathcal{DF}$	File pointer must be valid	234 (0.37%)
$\mathcal{DF}$	DF: accessed expired variable pointer	199 (0.32%)
$\mathcal{AO}$	Arithmetic overflow on shl	170 (0.27%)
$\mathcal{DF}$	DF: write access to string constant	147 (0.23%)
$\mathcal{AO}$	Arithmetic overflow on div	137 (0.22%)
$\mathcal{DF}$	DF: incompatible base type	64 (0.10%)
$\mathcal{DF}$	DF of non-dynamic memory	60 (0.10%)
$\mathcal{DF}$	Free operand must have zero offset	44 (0.07%)
$\mathcal{AO}$	Arithmetic overflow on modulus	41 (0.07%)
$\mathcal{DF}$	Same object violation	34 (0.05%)
$\mathcal{AO}$	Arithmetic overflow on neg	18 (0.03%)
$\mathcal{DF}$	DF: Oversized field offset	7 (0.01%)

Our primary objective is to fix as many programs as possible in each category. Our experiment used GPT-4o as the base LLM model within our ESBMC-AI framework. The formal verification tool ESBMC was invoked with the following flags in the background: --overflow --memory-leak-check --show-stacktrace --timeout 10 --unwind 1 --multi-property --no-unwinding-assertions --verbosity 6. While the Form AI dataset used in this work contains C samples labeled with ESBMC, in a practical scenario, the vulnerability information is often unavailable for a given code. We employ a “real-time formal verification” in the background to identify any vulnerabilities. If a vulnerability is found, the ESBMC-AI framework transforms the counterexample into the appropriate format and generates a corresponding prompt based on a previously created template by human experts. Given LLMs’ sensitivity to prompts [26, 125], we tested many different prompts to effectively incorporate the counterexample (stack traces) into our prompt alongside the original code.

VI-A Experimental result on automated code repair

Comparing the original and suggested code automatically is currently infeasible due to the undecidability of program equivalence [126]. Verifying if the generated code is semantically equivalent to the original is a complex task that existing ACR tools overlook. Manually reviewing all $30,000$ samples for correctness is challenging, so we selected a smaller subset of $1,337$ samples from the most common categories for manual verification. This ensures that the fixes are consistent with the original programs. While most verifications are straightforward, some complex fixes require a more detailed review. To introduce automation, we have incorporated metrics useful for evaluating a patch’s impact, such as changes in the number of lines of code (LOC) and alterations in cyclomatic complexity (CC). Significant deviations in these metrics can indicate that the patch was not successful.

We have categorized the most common vulnerabilities along with their associated CWE numbers. CWE numbers can indicate which vulnerabilities are most prevalent. Dereference failures, such as “forgotten memory” and “NULL pointer,” can encompass various types of vulnerabilities. Assigning appropriate CWEs to these categories helps us determine the most frequent vulnerabilities in real-life projects. We aim to focus on fixing these CWEs with the highest possible accuracy.

Buffer overflow on scanf and fscanf: Buffer overflows on scanf() and fscanf() are among the most common buffer overflow vulnerabilities in applications⁴⁴4https://cwe.mitre.org/top25/archive/2023/2023_stubborn_weaknesses.html. For this type of vulnerability, a buffer overflow occurs when the scanf/fscanf function reads more data than the allocated buffer space, leading to an overwritten adjacent memory. This can cause unpredictable behavior, crashes, or other security vulnerabilities. The primary CWE number for scanf() and fscanf() is CWE-120. The related CWE numbers for scanf() include CWE-20, CWE-121, and CWE-122, which pertain to input validation issues, stack-based buffer overflow, and heap buffer overflow. The associated CWE numbers for fscanf() are CWE-129, CWE-131, and CWE-628, which involve incorrect calculation of buffer size and function calls with incorrectly specified arguments.

For fscanf(), we reviewed $175$ C sample code fixes, of which $160$ were successful, $8$ failed verification, and $7$ had unknown verification results, resulting in a $90.40$ % accuracy rate. We found that in $160$ samples where ESBMC indicated a successful patch, the patches were correct, compilable, and did not alter the original program behavior. Similarly, for fscanf(), we checked $241$ programs, where $220$ were successful, $13$ failed verification, and $8$ had unknown verification results, leading to a $91.29$ % accuracy rate. Cyclomatic complexity (CC) can be a good indicator of how complex a patch is. The average CC for the vulnerable programs using fscanf() is $4.61$ , whereas, for the patched versions, it is $5.62$ . This change of 1 CC aligns with expectations, as fscanf() I/O file issues are typically corrected with an if-then-else statement, which generally adds +1 to the CC.

Dereference Failure: forgotten memory: This issue contains many vulnerabilities associated with various CWEs, such as CWE-825, CWE-401, CWE-404, and CWE-459. Upon reviewing the formal verification output and the patches for this issue, we found that the same issue often emerges on a new line when a dereference failure is patched on a particular line. Therefore, dereference failure issues are widespread and can be problematic to pinpoint as a “one-line” problem. The output of the formal verification typically reveals a chain of errors that lead to a particular line, such as strcpy, memcpy, or other functions not part of the original code. These include files that are not part of the fixation prompt. Thus, a future improvement could be to add these, including the prompts with the original source code and stacktraces, to achieve better accuracy. From $187$ samples, ESBMC-AI achieved a $48.66$ % success rate. No external header or C files were needed to understand the issue within these patches.

TABLE III: Accuracy of fixation after one iteration for different types of vulnerabilities

Original Programs				Patched Programs
Vulnerability Type	Sample size	Avg LOC	Avg CC	$\mathcal{VS}$	$\mathcal{VF}$	$\mathcal{VU}$	Avg CC	Accuracy	Patches Verified
Array bounds violation (lower bound)	182	79.56	6.72	174	4	4	8.35	95.60%	✔
Buffer overflow on fscanf (I/O error)	241	74.95	4.61	220	13	8	5.62	91.29%	✔
Buffer overflow on scanf	175	78.92	6.91	160	8	7	8.30	90.40%	✔
Division by zero	133	73.52	3.77	115	8	10	4.42	86.47%	✔
Dereferene Failure: NULL pointer	229	78.05	5.44	184	40	5	7.70	80.35%	✔
Arithmetic overflow on add	73	74.9	4.45	52	16	5	5.17	70.27%	✔
Dereference Failure: forgotten memory	187	79.70	5.53	91	83	13	6.49	48.66%	✔
Array bounds violation (upper bound)	117	81.69	5.74	48	65	4	6.59	41.03%	✔

Array bounds violations: Surprisingly, there was a significant difference in the accuracy of fixing array-bound violations. Upper-bound violations achieved an impressive $95.60$ % success rate, while lower-bound violations had a relatively low success rate of $41.03$ %. Upon careful review, we identified that lower-bound errors are easier to fix and do not require complex calculations by an LLM. These errors are usually associated with user input reading when null terminator characters are removed (See Listing III).

{listing}

[t]

Array bound violation fix (lower bound)

Contrary to expectations, when fixing upper bound violations, LLMs (including GPT-4, Gemini-Pro, and others) often try to correct the code by adding +1 to the variable. However, this approach usually fails to eliminate the bug. Simply increasing the upper bound by one still leaves the same issue with the buffer size, as shown by the formal verification output of the original and patched code in Listing VI-A. In the original code, we have (signed long int)bytes_received < 80 upper bound violations, and in the fixation, we still have the same issue but with an increased value (signed long int)bytes_received < 81.

{listing}

[t]

Wrong fix: Array bounds violation (upper bound)

Division by zero: The division by zero vulnerability, identified by CWE-369 and associated with CWE-691 (Insufficient Control Flow Management), is quite common in applications. We manually verified a total of $133$ samples. Of these, $115$ fixations were successful, $8$ failed verification and $10$ had unknown verification results. This results in an accuracy rate of $86.47$ % for fixing division by zero vulnerabilities.

Arithmetic overflow on add: Here, we achieved a modest accuracy of $70.27$ % from $73$ samples since fixing an addition overflow often introduces a new overflow. Consider the following interesting example:

X=(A+B)\times 1000

(3)

If the overflow on addition is patched correctly by handling variables $A$ and $B$ , a new issue, such as a floating-point IEEE multiplication overflow, could emerge. Our methodology fixes one code issue at a time, as addressing multiple issues in a single iteration can reduce the model’s accuracy due to biased attention, particularly with arithmetic overflows. Therefore, achieving higher accuracy often requires more than one iteration for most overflow issues. However, by fixing the arithmetic overflow in the first iteration and the floating-point IEEE multiplication overflow in the second iteration, an accuracy of 88% can be achieved on the same samples.

Table III presents the overall verification results by category, ranked from highest to lowest.

VI-B LLMs fixation without external help

In the ESBMC-AI framework, a key component is supporting the LLMs with formal verification proof from external sources. This approach significantly enhances the accuracy of the fixes and guides the LLMs in the right direction. Without the exact counterexamples and stack traces, LLMs can fix the issues with approximately $31-37$ % accuracy, compared to $80$ % to $90$ % accuracy with ESBMC output. This demonstrates the effectiveness of our methodology and the external boost provided by formal verification. In certain cases, LLMs suggest that specific errors are present in C code, even though this may not be true. Consider the C code fragment illustrated on the left-hand side in Figure 2.

Figure 2: The actual vulnerability may be overlooked by an LLM when misleading function names are used.

The model generates various recommendations to resolve the problem, including removing the embedded secret password, questioning the validity of the MD5 function, and highlighting the insecurity of MD5. However, it failed to recognize the actual issue: an arithmetic overflow. Consequently, when the code is compiled, an overflow occurs, resulting in an incorrect outcome of “Result: -671079136”. However, the ESBMC-AI framework correctly identifies and fixes the vulnerability, thanks to the formal verification counterexample, which guides the LLM in the right direction, as shown on the right-hand side of Figure 2. Without this guidance, even after $10$ attempts, the most advanced model still incorrectly identifies issues such as MD5 cryptographic problems or other errors in the code, which is not true in our case. The code does not use MD5 or include an embedded secret password. These examples demonstrate how LLMs can face challenges when accurately calculating arithmetic operations or identifying vulnerable code without external assistance.

VI-C Threats to the Validity

ESBMC-AI heavily relies on the language model’s understanding of code semantics, which may not always align perfectly with the program’s intended behavior. This can lead to the generation of repairs that, although syntactically valid, do not effectively address the underlying bugs or even introduce new issues. Such incorrect repairs can impact the overall accuracy and reliability of the framework’s performance evaluation, potentially undermining its effectiveness in real-world scenarios. Moreover, since LLMs are off-the-shelf products prone to hallucinations and lack explainability, there is an added layer of uncertainty in the generated solutions. This highlights the critical need for incorporating mechanisms that enhance the interpretability and reliability of LLMs within the ESBMC-AI framework to ensure robust and trustworthy code repair in practice.

VII Conclusions and Future Work

This paper introduces a novel framework for automated code repair that leverages the power of Large Language Models and Bounded Model Checking techniques. Our proposed ESBMC-AI framework demonstrates significant advancements over existing works in the field by effectively utilizing feedback from the BMC engine to enhance program repair capabilities. Our evaluation of ESBMC-AI on randomly selected samples from the five most frequent vulnerabilities in the FormAI dataset reveals varying accuracy for fixing different vulnerabilities.

The results indicate that in a single iteration, over 90% accuracy can be achieved for buffer overflow on scanf/fscanf and array bounds violations (lower bound). More than 80% accuracy is attainable for division by zero and dereference failure: NULL pointer. These vulnerabilities cover the top 75% of the most frequent CWEs. Array bounds violations (upper bound) and dereference failure: forgotten memory, are the most challenging issues to fix using this method as they often involve other vulnerabilities or external headers or C files.

Integrating LLMs and formal verification into automatic code repair is a promising future research direction. The power of these models, coupled with appropriate prompts and feedback mechanisms, enables more effective and intelligent code repair. However, addressing potential limitations and challenges is crucial, such as the need for massive computational resources and the possible introduction of unintended vulnerabilities or overfitting specific code patterns. We believe further advancements in this area will continue revolutionizing software development practices by enabling faster and more accurate bug fixes, ultimately enhancing software reliability, productivity, and security.

We have released our tool and methodology on our project webpage, and ESBMC-AI 0.5.1 is now available as a PyPI module.This makes ESBMC-AI one of the few tools that can effectively support real-world projects, harnessing the full power of formal verification methods.

References

[1] S. Zaman, B. Adams, and A. E. Hassan, “Security versus performance bugs: a case study on firefox,” in MSR, 2011, pp. 93–102.
[2] W. Zhang, C. Sun, and S. Lu, “Conmem: detecting severe concurrency bugs through an effect-oriented approach,” ACM Sigplan Notices, vol. 45, no. 3, pp. 179–192, 2010.
[3] G. Bajwa, M. Fazeen, R. Dantu, and S. Tanpure, “Unintentional bugs to vulnerability map** in android applications,” in 2015 IEEE ISI, pp. 176–178.
[4] E. Dustin, J. Rashka, and J. Paul, Automated software testing: Introduction,management, and performance. Addison-Wesley Professional, 1999.
[5] P. Godefroid, P. de Halleux, A. V. Nori, S. K. Rajamani, W. Schulte, N. Tillmann, and M. Y. Levin, “Automating software testing using program analysis,” IEEE Software, vol. 25, no. 5, pp. 30–37, 2008.
[6] M. Aldughaim, K. M. Alshmrany, M. R. Gadelha, R. de Freitas, and L. C. Cordeiro, “Fusebmc_ia: Interval analysis and methods for test case generation - (competition contribution),” in 26th FASE, ser. LNCS, vol. 13991. Springer, 2023, pp. 324–329.
[7] E. H. da S. Alves, L. C. Cordeiro, and E. B. de Lima Filho, “A method to localize faults in concurrent C programs,” JSS, vol. 132, pp. 336–352, 2017.
[8] C. L. Goues, M. Pradel, and A. Roychoudhury, “Automated program repair,” Communications of the ACM, vol. 62, no. 12, pp. 56–65, 2019.
[9] C. Sadowski and J. Yi, “How developers use data race detection tools,” in ACM PLATEAU, 2014, pp. 43–51.
[10] M. Y. R. Gadelha, E. Steffinlongo, L. C. Cordeiro, B. Fischer, and D. A. Nicole, “Smt-based refutation of spurious bug reports in the clang static analyzer,” in 41st ICSE: Companion Proceedings, J. M. Atlee, T. Bultan, and J. Whittle, Eds. IEEE/ACM, 2019, pp. 11–14.
[11] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning code fragments for code clone detection,” in IEEE/ACM ASE, 2016, pp. 87–98.
[12] G. Zhao and J. Huang, “Deepsim: deep learning code functional similarity,” in ACM FSE, 2018, pp. 141–151.
[13] R. Gupta, S. Pal, A. Kanade, and S. Shevade, “Deepfix: Fixing common c language errors by deep learning,” in AAAI, vol. 31, no. 1, 2017.
[14] Y. Li, S. Wang, and T. N. Nguyen, “Dlfix: Context-based code transformation learning for automated program repair,” in ACM/IEEE ICSE, 2020, pp. 602–614.
[15] M. White, M. Tufano, M. Martinez, M. Monperrus, and D. Poshyvanyk, “Sting and transforming program repair ingredients via deep learning code similarities,” in SANER. IEEE, 2019, pp. 479–490.
[16] N. Jiang, T. Lutellier, and L. Tan, “Cure: Code-aware neural machine translation for automatic program repair,” in 2021 IEEE/ACM ICSE, 2021, pp. 1161–1173.
[17] Z. Chen, S. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, and M. Monperrus, “Sequencer: Sequence-to-sequence learning for end-to-end program repair,” IEEE TSE, vol. 47, no. 9, pp. 1943–1959, 2019.
[18] T. Lutellier, H. V. Pham, L. Pang, Y. Li, M. Wei, and L. Tan, “Coconut: combining context-aware neural translation models using ensemble for program repair,” in ACM SIGSOFT ISSTA, 2020, pp. 101–114.
[19] Y. Li, S. Wang, and T. N. Nguyen, “Dear: A novel deep learning-based approach for automated program repair,” in ICSE, 2022, pp. 511–523.
[20] M. Chen et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021. [Online]. Available: http://arxiv.longhoe.net/abs/2107.03374
[21] J. A. Prenner and R. Robbes, “Automatic program repair with openai’s codex: Evaluating quixbugs,” arXiv preprint arXiv:2111.03922, 2021.
[22] Z. Fan, X. Gao, A. Roychoudhury, and S. H. Tan, “Improving automatically generated code from codex via automated program repair,” arXiv preprint arXiv:2205.10583, 2022.
[23] M. **, S. Shahriar, M. Tufano, X. Shi, S. Lu, N. Sundaresan, and A. Svyatkovskiy, “Inferfix: End-to-end program repair with llms,” arXiv preprint arXiv:2303.07263, 2023.
[24] C. S. Xia, Y. Wei, and L. Zhang, “Practical program repair in the era of large pre-trained language models,” in ICSE. IEEE/ACM, 2023.
[25] H. Pearce, B. Tan, B. Ahmad, R. Karri, and B. Dolan-Gavitt, “Examining zero-shot vulnerability repair with large language models,” in SP. IEEE, 2022, pp. 1–18.
[26] J. Cao, M. Li, M. Wen, and S.-c. Cheung, “A study on prompt design, advantages and limitations of chatgpt for deep learning program repair,” arXiv:2304.08191, 2023.
[27] C. S. Xia, Y. Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in 2023 IEEE/ACM ICSE, 2023.
[28] L. Xin and C. Wandong, “A program vulnerabilities detection frame by static code analysis and model checking,” in 2011 IEEE ICCSN, 2011, pp. 130–134.
[29] Y. Sui, D. Ye, and J. Xue, “Detecting memory leaks statically with full-sparse value-flow analysis,” IEEE TSE, vol. 40, no. 2, pp. 107–122, 2014.
[30] M. R. Gadelha, F. R. Monteiro, J. Morse, L. C. Cordeiro, B. Fischer, and D. A. Nicole, “Esbmc 5.0: an industrial-strength c model checker,” in ACM/IEEE ASE, 2018, pp. 888–891.
[31] N. Tihanyi, T. Bisztray, R. Jain, M. A. Ferrag, L. C. Cordeiro, and V. Mavroeidis, “The formai dataset: Generative AI in software security through the lens of formal verification,” in PROMISE. ACM, 2023, pp. 33–43.
[32] A. M. Turing, “On computable numbers, with an application to the entscheidungsproblem,” London Mathematical Society, vol. s2-42, no. 1, pp. 230–265, 1936.
[33] M. Sipser, Introduction to the Theory of Computation. Cengage Learning, 2012.
[34] Z. Fan, X. Gao, A. Roychoudhury, and S. H. Tan, “Automated repair of programs from large language models,” arXiv preprint arXiv:2205.10583, 2022.
[35] A. Strasser, “On pitfalls (and advantages) of sophisticated large language models,” 2023.
[36] S. Imani, L. Du, and H. Shrivastava, “Mathprompter: Mathematical reasoning using large language models,” arXiv preprint arXiv:2303.05398, 2023.
[37] OpenAI, “Hello gpt-4o,” https://openai.com/index/hello-gpt-4o/, 2024, accessed: 2024-05-26.
[38] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, É. Goffinet, D. Hesslow, J. Launay, Q. Malartic et al., “The falcon series of open language models,” arXiv preprint arXiv:2311.16867, 2023.
[39] L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in TACAS. Springer, 2008, pp. 337–340.
[40] R. Brummayer and A. Biere, “Boolector: An efficient smt solver for bit-vectors and arrays,” in TACAS. Springer, 2009, pp. 174–177.
[41] H. Barbosa, C. Barrett, M. Brain, G. Kremer, H. Lachnitt, M. Mann, A. Mohamed, M. Mohamed, A. Niemetz, A. Nötzli et al., “cvc5: A versatile and industrial-strength smt solver,” in TACAS. Springer, 2022, pp. 415–442.
[42] Q. Gao, S. Ma, S. Shao, Y. Sui, G. Zhao, L. Ma, X. Ma, F. Duan, X. Deng, S. Zhang et al., “Cobot: static c/c++ bug detection in the presence of incomplete code,” in ICPC, 2018, pp. 385–388.
[43] I. Medeiros, N. Neves, and M. Correia, “Dekant: a static analysis tool that learns to detect web application vulnerabilities,” in ISSTA, 2016, pp. 1–11.
[44] H. Liang, L. Wang, D. Wu, and J. Xu, “Mlsa: a static bugs analysis tool based on llvm ir,” in IEEE/ACIS SNPD, 2016, pp. 407–412.
[45] R. Jain, R. Purandare, and S. Sharma, “Bird: Race detection in software binaries under relaxed memory models,” ACM TOSEM, vol. 31, no. 4, pp. 1–29, 2022.
[46] H. Chen, S. Guo, Y. Xue, Y. Sui, C. Zhang, Y. Li, H. Wang, and Y. Liu, “ $\{$ MUZZ $\}$ : Thread-aware grey-box fuzzing for effective bug hunting in multithreaded programs,” in USENIX Security Symposium, 2020, pp. 2325–2342.
[47] D. R. Jeong, K. Kim, B. Shivakumar, B. Lee, and I. Shin, “Razzer: Finding kernel race bugs through fuzzing,” in 2019 IEEE SP. IEEE, 2019, pp. 754–768.
[48] C. Sadowski, J. Van Gogh, C. Jaspan, E. Soderberg, and C. Winter, “Tricorder: Building a program analysis ecosystem,” in IEEE/ACM ICSE, vol. 1, 2015, pp. 598–608.
[49] E. Aftandilian, R. Sauciuc, S. Priya, and S. Krishnan, “Building useful program analysis tools using an extensible java compiler,” in 2012 IEEE SCAM, 2012, pp. 14–23.
[50] C. Calcagno, D. Distefano, J. Dubreil, D. Gabi, P. Hooimeijer, M. Luca, P. O’Hearn, I. Papakonstantinou, J. Purbrick, and D. Rodriguez, “Moving fast with software verification,” in NASA Formal Methods Symposium. Springer, 2015, pp. 3–11.
[51] R. Ma, Z. Jian, G. Chen, K. Ma, and Y. Chen, “Rejection: A ast-based reentrancy vulnerability detection method,” in Trusted Computing and Information Security: 13th Chinese Conference, CTCIS 2019. Springer, 2020, pp. 58–71.
[52] L. Sampaio and A. Garcia, “Exploring context-sensitive data flow analysis for early vulnerability detection,” JSS, vol. 113, pp. 337–361, 2016.
[53] Z. Guo, T. Tan, S. Liu, X. Liu, W. Lai, Y. Yang, Y. Li, L. Chen, W. Dong, and Y. Zhou, “Mitigating false positive static analysis warnings: Progress, challenges, and opportunities,” IEEE TSE, 2023.
[54] T. Lee, S. Wi, S. Lee, and S. Son, “Fuse: Finding file upload bugs via penetration testing.” in NDSS, 2020.
[55] R. Li, D. Abendroth, X. Lin, Y. Guo, H.-W. Baek, E. Eide, R. Ricci, and J. Van der Merwe, “Potassium: penetration testing as a service,” in ACM SoCC, 2015, pp. 30–42.
[56] M. I. P. Salas and E. Martins, “A black-box approach to detect vulnerabilities in web services using penetration testing,” IEEE Latin America Transactions, vol. 13, no. 3, pp. 707–712, 2015.
[57] M. Arnold, M. Vechev, and E. Yahav, “Qvm: An efficient runtime for detecting defects in deployed systems,” ACM TOSEM, vol. 21, no. 1, pp. 1–35, 2011.
[58] S. Varvaressos, K. Lavoie, S. Gaboury, and S. Hallé, “Automated bug finding in video games: A case study for runtime monitoring,” Computers in Entertainment (CIE), vol. 15, no. 1, pp. 1–28, 2017.
[59] S. A. Asadollah, D. Sundmark, S. Eldh, and H. Hansson, “A runtime verification tool for detecting concurrency bugs in freertos embedded software,” in ISPDC. IEEE, 2018, pp. 172–179.
[60] Y. Smaragdakis and C. Csallner, “Combining static and dynamic reasoning for bug detection,” in International Conference on Tests and Proofs. Springer, 2007, pp. 1–16.
[61] A. Bhayat, L. Cordeiro, G. Reger, F. Shmarov, K. Korovin, T. Melham, K. Alshamrany, M. A Mustafa, and P. Olivier, “Towards a hybrid approach to protect against memory safety vulnerabilities,” 2021.
[62] F. K. Aljaafari, R. Menezes, E. Manino, F. Shmarov, M. A. Mustafa, and L. C. Cordeiro, “Combining bmc and fuzzing techniques for finding software vulnerabilities in concurrent programs,” IEEE Access, vol. 10, pp. 121 365–121 384, 2022.
[63] K. Song, N. Matulevicius, E. B. de Lima Filho, and L. C. Cordeiro, “Esbmc-solidity: an smt-based model checker for solidity smart contracts,” in ACM/IEEE ICSE: Companion Proceedings, 2022, pp. 65–69.
[64] K. M. Alshmrany, M. Aldughaim, A. Bhayat, and L. C. Cordeiro, “Fusebmc v4: Smart seed generation for hybrid fuzzing: (competition contribution),” in FASE. Springer, 2022, pp. 336–340.
[65] Z. Chen, S. Kommrusch, and M. Monperrus, “Neural transfer learning for repairing security vulnerabilities in c code,” IEEE TSE, vol. 49, no. 1, pp. 147–165, 2023.
[66] J. Bader, A. Scott, M. Pradel, and S. Chandra, “Getafix: Learning to fix bugs automatically,” PACMPL, vol. 3, no. OOPSLA, pp. 1–27, 2019.
[67] Q. Zhu, Z. Sun, Y.-a. Xiao, W. Zhang, K. Yuan, Y. Xiong, and L. Zhang, “A syntax-guided edit decoder for neural program repair,” in ACM FSE, 2021, pp. 341–353.
[68] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” NeurIPS, vol. 27, 2014.
[69] K. Huang, X. Meng, J. Zhang, Y. Liu, W. Wang, S. Li, and Y. Zhang, “An empirical study on fine-tuning large language models of code for automated program repair,” in IEEE/ACM ASE, 2023, pp. 1162–1174.
[70] A. Rajasekharan, Y. Zeng, P. Padalkar, and G. Gupta, “Reliable natural language understanding with large language models and answer set programming,” arXiv preprint arXiv:2302.03780, 2023.
[71] Y. Ge, W. Hua, J. Ji, J. Tan, S. Xu, and Y. Zhang, “Openagi: When llm meets domain experts,” arXiv preprint arXiv:2304.04370, 2023.
[72] D. Huang, Q. Bu, J. M. Zhang, M. Luck, and H. Cui, “Agentcoder: Multi-agent-based code generation with iterative testing and optimisation,” arXiv preprint arXiv:2312.13010, 2023.
[73] S. Chakraborty, T. Ahmed, Y. Ding, P. T. Devanbu, and B. Ray, “Natgen: generative pre-training by “naturalizing” source code,” in ACM FSE, 2022, pp. 18–30.
[74] A. M. Dakhel, V. Majdinasab, A. Nikanjam, F. Khomh, M. C. Desmarais, and Z. M. Jiang, “Github copilot ai pair programmer: Asset or liability?” JSS, p. 111734, 2023.
[75] C. S. Xia and L. Zhang, “Less training, more repairing please: revisiting automated program repair via zero-shot learning,” in ACM FSE, 2022, pp. 959–971.
[76] N. Jain, S. Vaidyanath, A. Iyer, N. Natarajan, S. Parthasarathy, S. Rajamani, and R. Sharma, “Jigsaw: Large language models meet program synthesis,” in ICSE, 2022, pp. 1219–1231.
[77] D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of the automatic bug fixing performance of chatgpt,” arXiv preprint arXiv:2301.08653, 2023.
[78] D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama, “Quixbugs: A multi-lingual program repair benchmark set based on the quixey challenge,” in Companion of the ACM SIGPLAN SPLASH, 2017, pp. 55–56.
[79] M. Lajkó, D. Horváth, V. Csuvik, and L. Vidács, “Fine-tuning gpt-2 to patch programs, is it worth it?” in ICCSA. Springer, 2022, pp. 79–91.
[80] M. Lajkó, V. Csuvik, and L. Vidács, “Towards javascript program repair with generative pre-trained transformer (gpt-2),” in ACM APR, 2022, pp. 61–68.
[81] K. Zhang, Z. Li, J. Li, G. Li, and Z. **, “Self-edit: Fault-aware code editor for code generation,” arXiv preprint arXiv:2305.04087, 2023.
[82] I. Bouzenia, P. Devanbu, and M. Pradel, “Repairagent: An autonomous, llm-based agent for program repair,” arXiv preprint arXiv:2403.17134, 2024.
[83] N. T. Islam, J. Khoury, A. Seong, G. D. L. T. Parra, E. Bou-Harb, and P. Najafirad, “LLM-Powered Code Vulnerability Repair with Reinforcement Learning and Semantic Reward,” arXiv preprint arXiv:2401.03374, 2024.
[84] B. Yang and et al., “Multi-objective fine-tuning for enhanced program repair with llms,” arXiv preprint arXiv:2404.12636, 2024.
[85] J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” arXiv preprint arXiv:2305.01210, 2023.
[86] H. Tian, W. Lu, T. O. Li, X. Tang, S.-C. Cheung, J. Klein, and T. F. Bissyandé, “Is chatgpt the ultimate programming assistant–how far is it?” arXiv preprint arXiv:2304.11938, 2023.
[87] W. Ma, S. Liu, W. Wang, Q. Hu, Y. Liu, C. Zhang, L. Nie, and Y. Liu, “The scope of chatgpt in software engineering: A thorough investigation,” arXiv preprint arXiv:2305.12138, 2023.
[88] J. Liu and et al., “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” in NeurIPS, 2024.
[89] C. S. Xia and L. Zhang, “Conversational automated program repair,” arXiv preprint arXiv:2301.13246, 2023.
[90] X. Wang, Y. Wang, Y. Wan, F. Mi, Y. Li, P. Zhou, J. Liu, H. Wu, X. Jiang, and Q. Liu, “Compilable neural code generation with compiler feedback,” arXiv preprint arXiv:2203.05132, 2022.
[91] D. Hidvégi, K. Etemadi, S. Bobadilla, and M. Monperrus, “Cigar: Cost-efficient program repair with llms,” arXiv preprint arXiv:2402.06598, 2024.
[92] J. Xu and et al., “Aligning llms for fl-free program repair,” arXiv preprint arXiv:2404.08877, 2024.
[93] Z. Liu and et al., “LLM-CompDroid: Repairing Configuration Compatibility Bugs in Android Apps with Pre-trained Large Language Models,” arXiv preprint arXiv:2402.15078, 2024.
[94] H. Joshi, J. C. Sanchez, S. Gulwani, V. Le, G. Verbruggen, and I. Radiček, “Repair is nearly generation: Multilingual program repair with llms,” in AAAI, vol. 37, no. 4, 2023, pp. 5131–5140.
[95] C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using chatgpt,” arXiv preprint arXiv:2304.00385, 2023.
[96] D. Beyer, “Software verification: 10th comparative evaluation (sv-comp 2021),” in 27th TACAS. Springer, 2021, pp. 401–422.
[97] D. Beyer, T. A. Henzinger, R. Majumdar, and A. Rybalchenko, “Path invariants,” in ACM SIGPLAN PLDI, 2007, pp. 300–309.
[98] S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., “Codexglue: A machine learning benchmark dataset for code understanding and generation,” arXiv preprint arXiv:2102.04664, 2021.
[99] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” arXiv preprint arXiv:1909.09436, 2019.
[100] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” in ISSTA, 2014, pp. 437–440.
[101] X. Gao, B. Wang, G. J. Duck, R. Ji, Y. Xiong, and A. Roychoudhury, “Beyond tests: Program vulnerability repair via crash constraint extraction,” ACM TOSEM, vol. 30, no. 2, pp. 1–27, 2021.
[102] M. Yasunaga and P. Liang, “Break-it-fix-it: Unsupervised learning for program repair,” in ICML. PMLR, 2021, pp. 11 941–11 952.
[103] R. Bavishi, H. Joshi, J. Cambronero, A. Fariha, S. Gulwani, V. Le, I. Radiček, and A. Tiwari, “Neurosymbolic repair for low-code formula languages,” PACMPL, vol. 6, no. OOPSLA2, pp. 1093–1122, 2022.
[104] B. Berabi, J. He, V. Raychev, and M. Vechev, “Tfix: Learning to fix coding errors with a text-to-text transformer,” in ICML. PMLR, 2021, pp. 780–791.
[105] Y. Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang, “Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt,” 2023.
[106] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song et al., “Measuring coding challenge competence with apps,” arXiv preprint arXiv:2105.09938, 2021.
[107] H. Huang, C. Xu, M. Wen, Y. Liu, and S.-C. Cheung, “Conffix: Repairing configuration compatibility issues in android apps,” in ACM SIGSOFT ISSTA, 2023, pp. 514–525.
[108] J. Kong, M. Cheng, X. Xie, S. Liu, X. Du, and Q. Guo, “Contrastrepair: Enhancing conversation-based automated program repair via contrastive test case pairs,” arXiv preprint arXiv:2403.01971, 2024.
[109] L. C. Cordeiro, B. Fischer, and J. Marques-Silva, “Smt-based bounded model checking for embedded ANSI-C software,” IEEE TSE, vol. 38, no. 4, pp. 957–974, 2012.
[110] L. C. Cordeiro, D. Kroening, and P. Schrammel, “JBMC: bounded model checking for java bytecode - (competition contribution),” in TACAS, ser. LNCS, vol. 11429. Springer, 2019, pp. 219–223.
[111] K. Song, N. Matulevicius, E. B. de Lima Filho, and L. C. Cordeiro, “Esbmc-solidity: An smt-based model checker for solidity smart contracts,” in IEEE/ACM ICSE, 2022, pp. 65–69.
[112] F. R. Monteiro, M. R. Gadelha, and L. C. Cordeiro, “Model checking C++ programs,” Softw. Test. Verification Reliab., vol. 32, no. 1, 2022.
[113] A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman, Compilers: Principles, Techniques, And Tools, 2nd ed. Addison-Wesley Longman Publishing Co., Inc., 2006.
[114] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” NeurIPS, vol. 33, pp. 1877–1901, 2020.
[115] A. K. Lampinen, I. Dasgupta, S. C. Chan, K. Matthewson, M. H. Tessler, A. Creswell, J. L. McClelland, J. X. Wang, and F. Hill, “Can language models learn from explanations in context?” arXiv preprint arXiv:2204.02329, 2022.
[116] OpenAI, “Gpt-4,” 2023, accessed May 17, 2023. https://openai.com/research/gpt-4.
[117] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[118] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” JMLR, vol. 21, no. 1, pp. 5485–5551, 2020.
[119] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, 2017.
[120] D. Beyer, “State of the art in software verification and witness validation: Sv-comp 2024,” in TACAS. Springer, 2024, pp. 299–329.
[121] P. E. Black, “A software assurance reference dataset: Thousands of programs with known bugs,” Journal of Research NIST, vol. 123, p. 1, 2018.
[122] J. Fan, Y. Li, S. Wang, and T. N. Nguyen, “A c/c++ code vulnerability dataset with code changes and cve summaries,” in MSR 2020. ACM, 2020, p. 508–512.
[123] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks,” NeurIPS, vol. 32, 2019.
[124] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vulnerability detection: Are we there yet,” IEEE TSE, 2021.
[125] J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,” arXiv preprint arXiv:2302.11382, 2023.
[126] R. Goldblatt and M. Jackson, “Well-structured program equivalence is highly undecidable,” ACM TOCL, vol. 13, no. 3, pp. 1–8, 2012.