-
Code Agents are State of the Art Software Testers
Authors:
Niels Mündler,
Mark Niklas Müller,
**gxuan He,
Martin Vechev
Abstract:
Rigorous software testing is crucial for develo** and maintaining high-quality code, making automated test generation a promising avenue for both improving software quality and boosting the effectiveness of code generation methods. However, while code generation with Large Language Models (LLMs) is an extraordinarily active research area, test generation remains relatively unexplored. We address…
▽ More
Rigorous software testing is crucial for develo** and maintaining high-quality code, making automated test generation a promising avenue for both improving software quality and boosting the effectiveness of code generation methods. However, while code generation with Large Language Models (LLMs) is an extraordinarily active research area, test generation remains relatively unexplored. We address this gap and investigate the capability of LLM-based Code Agents for formalizing user issues into test cases. To this end, we propose a novel benchmark based on popular GitHub repositories, containing real-world issues, ground-truth patches, and golden tests. We find that LLMs generally perform surprisingly well at generating relevant test cases with Code Agents designed for code repair exceeding the performance of systems designed specifically for test generation. Further, as test generation is a similar but more structured task than code generation, it allows for a more fine-grained analysis using fail-to-pass rate and coverage metrics, providing a dual metric for analyzing systems designed for code repair. Finally, we find that generated tests are an effective filter for proposed code fixes, doubling the precision of SWE-Agent.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation
Authors:
Niels Mündler,
**gxuan He,
Slobodan Jenko,
Martin Vechev
Abstract:
Large language models (large LMs) are susceptible to producing text that contains hallucinated content. An important instance of this problem is self-contradiction, where the LM generates two contradictory sentences within the same context. In this work, we present a comprehensive investigation into self-contradiction for various instruction-tuned LMs, covering evaluation, detection, and mitigatio…
▽ More
Large language models (large LMs) are susceptible to producing text that contains hallucinated content. An important instance of this problem is self-contradiction, where the LM generates two contradictory sentences within the same context. In this work, we present a comprehensive investigation into self-contradiction for various instruction-tuned LMs, covering evaluation, detection, and mitigation. Our primary evaluation task is open-domain text generation, but we also demonstrate the applicability of our approach to shorter question answering. Our analysis reveals the prevalence of self-contradictions, e.g., in 17.7% of all sentences produced by ChatGPT. We then propose a novel prompting-based framework designed to effectively detect and mitigate self-contradictions. Our detector achieves high accuracy, e.g., around 80% F1 score when prompting ChatGPT. The mitigation algorithm iteratively refines the generated text to remove contradictory information while preserving text fluency and informativeness. Importantly, our entire framework is applicable to black-box LMs and does not require retrieval of external knowledge. Rather, our method complements retrieval-based methods, as a large portion of self-contradictions (e.g., 35.2% for ChatGPT) cannot be verified using online text. Our approach is practically effective and has been released as a push-button tool to benefit the public at https://chatprotect.ai/.
△ Less
Submitted 15 March, 2024; v1 submitted 25 May, 2023;
originally announced May 2023.
-
A Verified Implementation of B+-Trees in Isabelle/HOL
Authors:
Niels Mündler,
Tobias Nipkow
Abstract:
In this paper we present the verification of an imperative implementation of the ubiquitous B+-tree data structure in the interactive theorem prover Isabelle/HOL. The implementation supports membership test, insertion and range queries with efficient binary search for intra-node navigation. The imperative implementation is verified in two steps: an abstract set interface is refined to an executabl…
▽ More
In this paper we present the verification of an imperative implementation of the ubiquitous B+-tree data structure in the interactive theorem prover Isabelle/HOL. The implementation supports membership test, insertion and range queries with efficient binary search for intra-node navigation. The imperative implementation is verified in two steps: an abstract set interface is refined to an executable but inefficient purely functional implementation which is further refined to the efficient imperative implementation.
△ Less
Submitted 18 August, 2022;
originally announced August 2022.
-
Association rule mining and itemset-correlation based variants
Authors:
Niels Mündler
Abstract:
Association rules express implication formed relations among attributes in databases of itemsets. The apriori algorithm is presented, the basis for most association rule mining algorithms. It works by pruning away rules that need not be evaluated based on the user specified minimum support confidence. Additionally, variations of the algorithm are presented that enable it to handle quantitative att…
▽ More
Association rules express implication formed relations among attributes in databases of itemsets. The apriori algorithm is presented, the basis for most association rule mining algorithms. It works by pruning away rules that need not be evaluated based on the user specified minimum support confidence. Additionally, variations of the algorithm are presented that enable it to handle quantitative attributes and to extract rules about generalizations of items, but preserve the downward closure property that enables pruning. Intertransformation of the extensions is proposed for special cases.
△ Less
Submitted 22 July, 2019;
originally announced July 2019.