-
CrashJS: A NodeJS Benchmark for Automated Crash Reproduction
Authors:
Philip Oliver,
Jens Dietrich,
Craig Anslow,
Michael Homer
Abstract:
Software bugs often lead to software crashes, which cost US companies upwards of $2.08 trillion annually. Automated Crash Reproduction (ACR) aims to generate unit tests that successfully reproduce a crash. The goal of ACR is to aid developers with debugging, providing them with another tool to locate where a bug is in a program. The main approach ACR currently takes is to replicate a stack trace f…
▽ More
Software bugs often lead to software crashes, which cost US companies upwards of $2.08 trillion annually. Automated Crash Reproduction (ACR) aims to generate unit tests that successfully reproduce a crash. The goal of ACR is to aid developers with debugging, providing them with another tool to locate where a bug is in a program. The main approach ACR currently takes is to replicate a stack trace from an error thrown within a program. Currently, ACR has been developed for C, Java, and Python, but there are no tools targeting JavaScript programs. To aid the development of JavaScript ACR tools, we propose CrashJS: a benchmark dataset of 453 Node.js crashes from several sources. CrashJS includes a mix of real-world and synthesised tests, multiple projects, and different levels of complexity for both crashes and target programs.
△ Less
Submitted 9 May, 2024;
originally announced May 2024.
-
On the Security Blind Spots of Software Composition Analysis
Authors:
Jens Dietrich,
Shawn Rasheed,
Alexander Jordan,
Tim White
Abstract:
Modern software heavily relies on the use of components. Those components are usually published in central repositories, and managed by build systems via dependencies. Due to issues around vulnerabilities, licenses and the propagation of bugs, the study of those dependencies is of utmost importance, and numerous software composition analysis tools have emerged for this purpose. A particular challe…
▽ More
Modern software heavily relies on the use of components. Those components are usually published in central repositories, and managed by build systems via dependencies. Due to issues around vulnerabilities, licenses and the propagation of bugs, the study of those dependencies is of utmost importance, and numerous software composition analysis tools have emerged for this purpose. A particular challenge are hidden dependencies that are the result of cloning or shading where code from a component is "inlined", and, in the case of shading, moved to different namespaces.
We present a novel approach to detect vulnerable clones in the Maven repository. Our approach is lightweight in that it does not require the creation and maintenance of a custom index. Starting with 29 vulnerabilities with assigned CVEs and proof-of-vulnerability projects, we retrieve over 53k potential vulnerable clones from Maven Central. After running our analysis on this set, we detect 727 confirmed vulnerable clones (86 if versions are aggregated) and synthesize a testable proof-of-vulnerability project for each of those. We demonstrate that existing SCA tools often miss those exposures. At the time of submission those results have led to changes to the entries for six CVEs in the GitHub Security Advisory Database (GHSA) via accepted pull requests, with more pending.
△ Less
Submitted 9 October, 2023; v1 submitted 8 June, 2023;
originally announced June 2023.
-
On the Effect of Instrumentation on Test Flakiness
Authors:
Shawn Rasheed,
Jens Dietrich,
Amjed Tahir
Abstract:
Test flakiness is a problem that affects testing and processes that rely on it. Several factors cause or influence the flakiness of test outcomes. Test execution order, randomness and concurrency are some of the more common and well-studied causes. Some studies mention code instrumentation as a factor that causes or affects test flakiness. However, evidence for this issue is scarce. In this study,…
▽ More
Test flakiness is a problem that affects testing and processes that rely on it. Several factors cause or influence the flakiness of test outcomes. Test execution order, randomness and concurrency are some of the more common and well-studied causes. Some studies mention code instrumentation as a factor that causes or affects test flakiness. However, evidence for this issue is scarce. In this study, we attempt to systematically collect evidence for the effects of instrumentation on test flakiness. We experiment with common types of instrumentation for Java programs - namely, application performance monitoring, coverage and profiling instrumentation. We then study the effects of instrumentation on a set of nine programs obtained from an existing dataset used to study test flakiness, consisting of popular GitHub projects written in Java. We observe cases where real-world instrumentation causes flakiness in a program. However, this effect is rare. We also discuss a related issue - how instrumentation may interfere with flakiness detection and prevention.
△ Less
Submitted 16 March, 2023;
originally announced March 2023.
-
Towards Understanding Provenance in Industry
Authors:
Matthias Galster,
Jens Dietrich
Abstract:
Context: Trustworthiness of software has become a first-class concern of users (e.g., to understand software-made decisions). Also, there is increasing demand to demonstrate regulatory compliance of software and end users want to understand how software-intensive systems make decisions that affect them. Objective: We aim to provide a step towards understanding provenance needs of the software indu…
▽ More
Context: Trustworthiness of software has become a first-class concern of users (e.g., to understand software-made decisions). Also, there is increasing demand to demonstrate regulatory compliance of software and end users want to understand how software-intensive systems make decisions that affect them. Objective: We aim to provide a step towards understanding provenance needs of the software industry to support trustworthy software. Provenance is information about entities, activities, and people involved in producing data, software, or output of software, and used to assess software quality, reliability and trustworthiness of digital products and services. Method: Based on data from in-person and questionnaire-based interviews with professionals in leadership roles we develop an ``influence map'' to analyze who drives provenance, when provenance is relevant, what is impacted by provenance and how provenance can be managed. Results: The influence map helps decision makers navigate concerns related to provenance. It can also act as a checklist for initial provenance analyses of systems. It is empirically-grounded and designed bottom-up (based on perceptions of practitioners) rather than top-down (from regulations or policies). Conclusion: We present an imperfect first step towards understanding provenance based on current perceptions and offer a preliminary view ahead.
△ Less
Submitted 12 February, 2023;
originally announced February 2023.
-
Test Flakiness' Causes, Detection, Impact and Responses: A Multivocal Review
Authors:
Shawn Rasheed,
Amjed Tahir,
Jens Dietrich,
Negar Hashemi,
Lu Zhang
Abstract:
Flaky tests (tests with non-deterministic outcomes) pose a major challenge for software testing. They are known to cause significant issues such as reducing the effectiveness and efficiency of testing and delaying software releases. In recent years, there has been an increased interest in flaky tests, with research focusing on different aspects of flakiness, such as identifying causes, detection m…
▽ More
Flaky tests (tests with non-deterministic outcomes) pose a major challenge for software testing. They are known to cause significant issues such as reducing the effectiveness and efficiency of testing and delaying software releases. In recent years, there has been an increased interest in flaky tests, with research focusing on different aspects of flakiness, such as identifying causes, detection methods and mitigation strategies. Test flakiness has also become a key discussion point for practitioners (in blog posts, technical magazines, etc.) as the impact of flaky tests is felt across the industry. This paper presents a multivocal review that investigates how flaky tests, as a topic, have been addressed in both research and practice. We cover a total of 651 articles (560 academic articles and 91 grey literature articles/posts), and structure the body of relevant research and knowledge using four different dimensions: causes, detection, impact and responses. For each of those dimensions we provide a categorisation, and classify existing research, discussions, methods and tools. With this, we provide a comprehensive and current snapshot of existing thinking on test flakiness, covering both academic views and industrial practices, and identify limitations and opportunities for future research.
△ Less
Submitted 1 December, 2022;
originally announced December 2022.
-
Efficient Utility Function Learning for Multi-Objective Parameter Optimization with Prior Knowledge
Authors:
Farha A. Khan,
Jörg P. Dietrich,
Christian Wirth
Abstract:
The current state-of-the-art in multi-objective optimization assumes either a given utility function, learns a utility function interactively or tries to determine the complete Pareto front, requiring a post elicitation of the preferred result. However, result elicitation in real world problems is often based on implicit and explicit expert knowledge, making it difficult to define a utility functi…
▽ More
The current state-of-the-art in multi-objective optimization assumes either a given utility function, learns a utility function interactively or tries to determine the complete Pareto front, requiring a post elicitation of the preferred result. However, result elicitation in real world problems is often based on implicit and explicit expert knowledge, making it difficult to define a utility function, whereas interactive learning or post elicitation requires repeated and expensive expert involvement. To mitigate this, we learn a utility function offline, using expert knowledge by means of preference learning. In contrast to other works, we do not only use (pairwise) result preferences, but also coarse information about the utility function space. This enables us to improve the utility function estimate, especially when using very few results. Additionally, we model the occurring uncertainties in the utility function learning task and propagate them through the whole optimization chain. Our method to learn a utility function eliminates the need of repeated expert involvement while still leading to high-quality results. We show the sample efficiency and quality gains of the proposed method in 4 domains, especially in cases where the surrogate utility function is not able to exactly capture the true expert utility function. We also show that to obtain good results, it is important to consider the induced uncertainties and analyze the effect of biased samples, which is a common problem in real world domains.
△ Less
Submitted 25 April, 2023; v1 submitted 22 August, 2022;
originally announced August 2022.
-
Flaky Test Sanitisation via On-the-Fly Assumption Inference for Tests with Network Dependencies
Authors:
Jens Dietrich,
Shawn Rasheed,
Amjed Tahir
Abstract:
Flaky tests cause significant problems as they can interrupt automated build processes that rely on all tests succeeding and undermine the trustworthiness of tests. Numerous causes of test flakiness have been identified, and program analyses exist to detect such tests. Typically, these methods produce advice to developers on how to refactor tests in order to make test outcomes deterministic. We ar…
▽ More
Flaky tests cause significant problems as they can interrupt automated build processes that rely on all tests succeeding and undermine the trustworthiness of tests. Numerous causes of test flakiness have been identified, and program analyses exist to detect such tests. Typically, these methods produce advice to developers on how to refactor tests in order to make test outcomes deterministic. We argue that one source of flakiness is the lack of assumptions that precisely describe under which circumstances a test is meaningful. We devise a sanitisation technique that can isolate f laky tests quickly by inferring such assumptions on-the-fly, allowing automated builds to proceed as flaky tests are ignored. We demonstrate this approach for Java and Groovy programs by implementing it as extensions for three popular testing frameworks (JUnit4, JUnit5 and Spock) that can transparently inject the inferred assumptions. If JUnit5 is used, those extensions can be deployed without refactoring project source code. We demonstrate and evaluate the utility of our approach using a set of six popular real-world programs, addressing known test flakiness issues in these programs caused by dependencies of tests on network availability. We find that our method effectively sanitises failures induced by network connectivity problems with high precision and recall.
△ Less
Submitted 1 August, 2022;
originally announced August 2022.
-
A Study of Single Statement Bugs Involving Dynamic Language Features
Authors:
Li Sui,
Shawn Rasheed,
Amjed Tahir,
Jens Dietrich
Abstract:
Dynamic language features are widely available in programming languages to implement functionality that can adapt to multiple usage contexts, enabling reuse. Functionality such as data binding , object-relational map** and user interface builders can be heavily dependent on these features. However, their use has risks and downsides as they affect the soundness of static analyses and techniques t…
▽ More
Dynamic language features are widely available in programming languages to implement functionality that can adapt to multiple usage contexts, enabling reuse. Functionality such as data binding , object-relational map** and user interface builders can be heavily dependent on these features. However, their use has risks and downsides as they affect the soundness of static analyses and techniques that rely on such analyses (such as bug detection and automated program repair). They can also make software more error-prone due to potential difficulties in understanding reflective code, loss of compile-time safety and incorrect API usage. In this paper, we set out to quantify some of the effects of using dynamic language features in Java programs-that is, the error-proneness of using those features with respect to a particular type of bug known as single statement bugs. By mining 2,024 GitHub projects, we found 139 single statement bug instances (falling under 10 different bug patterns), with the highest number of bugs belonging to three specific patterns: Wrong Function Name, Same Function More Args and Change Identifier Used. These results can help practitioners to quantify the risk of using dynamic techniques over alternatives (such as code generation). We hope this classification raises attention on choosing dynamic APIs that are likely to be error-prone, and provides developers a better understanding when designing bug detection tools for such feature.
△ Less
Submitted 2 April, 2022;
originally announced April 2022.
-
A Partial Reproduction of A Guided Genetic Algorithm for Automated Crash Reproduction
Authors:
Philip Oliver,
Michael Homer,
Jens Dietrich,
Craig Anslow
Abstract:
This paper is a partial reproduction of work by Soltani et al. which presented EvoCrash, a tool for replicating software failures in Java by reproducing stack traces. EvoCrash uses a guided genetic algorithm to generate JUnit test cases capable of reproducing failures more reliably than existing coverage-based solutions. In this paper, we present the findings of our reproduction of the initial stu…
▽ More
This paper is a partial reproduction of work by Soltani et al. which presented EvoCrash, a tool for replicating software failures in Java by reproducing stack traces. EvoCrash uses a guided genetic algorithm to generate JUnit test cases capable of reproducing failures more reliably than existing coverage-based solutions. In this paper, we present the findings of our reproduction of the initial study exploring the effectiveness of EvoCrash and comparison to three existing solutions: STAR, JCHARMING, and MuCrash. We further explored the capabilities of EvoCrash on different programs to check for selection bias. We found that we can reproduce the crashes covered by EvoCrash in the original study while reproducing two additional crashes not reported as reproduced. We also find that EvoCrash was unsuccessful in reproducing several crashes from the JCHARMING paper, which were excluded from the original study. Both EvoCrash and JCHARMING could reproduce 73\% of the crashes from the JCHARMING paper. We found that there was potentially some selection bias in the dataset for EvoCrash. We also found that some crashes had been reported as non-reproducible even when EvoCrash could reproduce them. We suggest this may be due to EvoCrash becoming stuck in a local optimum.
△ Less
Submitted 1 August, 2021; v1 submitted 25 July, 2021;
originally announced July 2021.
-
Putting the Semantics into Semantic Versioning
Authors:
Patrick Lam,
Jens Dietrich,
David J. Pearce
Abstract:
The long-standing aspiration for software reuse has made astonishing strides in the past few years. Many modern software development ecosystems now come with rich sets of publicly-available components contributed by the community. Downstream developers can leverage these upstream components, boosting their productivity.
However, components evolve at their own pace. This imposes obligations on an…
▽ More
The long-standing aspiration for software reuse has made astonishing strides in the past few years. Many modern software development ecosystems now come with rich sets of publicly-available components contributed by the community. Downstream developers can leverage these upstream components, boosting their productivity.
However, components evolve at their own pace. This imposes obligations on and yields benefits for downstream developers, especially since changes can be breaking, requiring additional downstream work to adapt to. Upgrading too late leaves downstream vulnerable to security issues and missing out on useful improvements; upgrading too early results in excess work. Semantic versioning has been proposed as an elegant mechanism to communicate levels of compatibility, enabling downstream developers to automate dependency upgrades.
While it is questionable whether a version number can adequately characterize version compatibility in general, we argue that developers would greatly benefit from tools such as semantic version calculators to help them upgrade safely. The time is now for the research community to develop such tools: large component ecosystems exist and are accessible, component interactions have become observable through automated builds, and recent advances in program analysis make the development of relevant tools feasible. In particular, contracts (both traditional and lightweight) are a promising input to semantic versioning calculators, which can suggest whether an upgrade is likely to be safe.
△ Less
Submitted 16 August, 2020;
originally announced August 2020.
-
Generating Mock Skeletons for Lightweight Web-Service Testing
Authors:
Thilini Bhagya,
Jens Dietrich,
Hans Guesgen
Abstract:
Modern application development allows applications to be composed using lightweight HTTP services. Testing such an application requires the availability of services that the application makes requests to. However, access to dependent services during testing may be restrained. Simulating the behaviour of such services is, therefore, useful to address their absence and move on application testing. T…
▽ More
Modern application development allows applications to be composed using lightweight HTTP services. Testing such an application requires the availability of services that the application makes requests to. However, access to dependent services during testing may be restrained. Simulating the behaviour of such services is, therefore, useful to address their absence and move on application testing. This paper examines the appropriateness of Symbolic Machine Learning algorithms to automatically synthesise HTTP services' mock skeletons from network traffic recordings. These skeletons can then be customised to create mocks that can generate service responses suitable for testing. The mock skeletons have human-readable logic for key aspects of service responses, such as headers and status codes, and are highly accurate.
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
Visualizing Design Erosion: How Big Balls of Mud are Made
Authors:
David Baum,
Jens Dietrich,
Craig Anslow,
Richard Müller
Abstract:
Software systems are not static, they have to undergo frequent changes to stay fit for purpose, and in the process of doing so, their complexity increases. It has been observed that this process often leads to the erosion of the systems design and architecture and with it, the decline of many desirable quality attributes, such as maintainability. This process can be captured in terms of antipatter…
▽ More
Software systems are not static, they have to undergo frequent changes to stay fit for purpose, and in the process of doing so, their complexity increases. It has been observed that this process often leads to the erosion of the systems design and architecture and with it, the decline of many desirable quality attributes, such as maintainability. This process can be captured in terms of antipatterns-atomic violations of widely accepted design principles. We present a visualisation that exposes the design of evolving Java programs, highlighting instances of selected antipatterns including their emergence and cancerous growth. This visualisation assists software engineers and architects in assessing, tracing and therefore combating design erosion. We evaluated the effectiveness of the visualisation in four case studies with ten participants.
△ Less
Submitted 16 July, 2018;
originally announced July 2018.
-
GHTraffic: A Dataset for Reproducible Research in Service-Oriented Computing
Authors:
Thilini Bhagya,
Jens Dietrich,
Hans Guesgen,
Steve Versteeg
Abstract:
We present GHTraffic, a dataset of significant size comprising HTTP transactions extracted from GitHub data and augmented with synthetic transaction data. The dataset facilitates reproducible research on many aspects of service-oriented computing. This paper discusses use cases for such a dataset and extracts a set of requirements from these use cases. We then discuss the design of GHTraffic, and…
▽ More
We present GHTraffic, a dataset of significant size comprising HTTP transactions extracted from GitHub data and augmented with synthetic transaction data. The dataset facilitates reproducible research on many aspects of service-oriented computing. This paper discusses use cases for such a dataset and extracts a set of requirements from these use cases. We then discuss the design of GHTraffic, and the methods and tool used to construct it. We conclude our contribution with some selective metrics that characterise GHTraffic.
△ Less
Submitted 9 June, 2018;
originally announced June 2018.
-
What Java Developers Know About Compatibility, And Why This Matters
Authors:
Jens Dietrich,
Kamil Jezek,
Premek Brada
Abstract:
Real-world programs are neither monolithic nor static -- they are constructed using platform and third party libraries, and both programs and libraries continuously evolve in response to change pressure. In case of the Java language, rules defined in the Java Language and Java Virtual Machine Specifications define when library evolution is safe. These rules distinguish between three types of compa…
▽ More
Real-world programs are neither monolithic nor static -- they are constructed using platform and third party libraries, and both programs and libraries continuously evolve in response to change pressure. In case of the Java language, rules defined in the Java Language and Java Virtual Machine Specifications define when library evolution is safe. These rules distinguish between three types of compatibility - binary, source and behavioural. We claim that some of these rules are counter intuitive and not well-understood by many developers. We present the results of a survey where we quizzed developers about their understanding of the various types of compatibility. 414 developers responded to our survey. We find that while most programmers are familiar with the rules of source compatibility, they generally lack knowledge about the rules of binary and behavioural compatibility. This can be problematic when organisations switch from integration builds to technologies that require dynamic linking, such as OSGi. We have assessed the gravity of the problem by studying how often linkage-related problems are referenced in issue tracking systems, and find that they are common.
△ Less
Submitted 11 August, 2014;
originally announced August 2014.
-
On the Detection of High-Impact Refactoring Opportunities in Programs
Authors:
Jens Dietrich,
Catherine McCartin,
Ewan Tempero,
Syed M. Ali Shah
Abstract:
We present a novel approach to detect refactoring opportunities by measuring the participation of references between types in instances of patterns representing design flaws. This technique is validated using an experiment where we analyse a set of 95 open-source Java programs for instances of four patterns representing modularisation problems. It turns out that our algorithm can detect high impac…
▽ More
We present a novel approach to detect refactoring opportunities by measuring the participation of references between types in instances of patterns representing design flaws. This technique is validated using an experiment where we analyse a set of 95 open-source Java programs for instances of four patterns representing modularisation problems. It turns out that our algorithm can detect high impact refactorings opportunities - a small number of references such that the removal of those references removes the majority of patterns from the program.
△ Less
Submitted 15 March, 2011; v1 submitted 9 June, 2010;
originally announced June 2010.
-
Disentangling Visibility and Self-Promotion Bias in the arXiv:astro-ph Positional Citation Effect
Authors:
J. P. Dietrich
Abstract:
We established in an earlier study that articles listed at or near the top of the daily arXiv:astro-ph mailings receive on average significantly more citations than articles further down the list. In our earlier work we were not able to decide whether this positional citation effect was due to author self-promotion of intrinsically more citable papers or whether papers are cited more often simpl…
▽ More
We established in an earlier study that articles listed at or near the top of the daily arXiv:astro-ph mailings receive on average significantly more citations than articles further down the list. In our earlier work we were not able to decide whether this positional citation effect was due to author self-promotion of intrinsically more citable papers or whether papers are cited more often simply because they are at the top of the astro-ph listing. Using new data we can now disentangle both effects. Based on their submission times we separate articles into a self-promoted sample and a sample of articles that achieved a high rank on astro-ph by chance and compare their citation distributions with those of articles on lower astro-ph positions. We find that the positional citation effect is a superposition of self-promotion and visibility bias.
△ Less
Submitted 25 June, 2008; v1 submitted 2 May, 2008;
originally announced May 2008.
-
The Importance of Being First: Position Dependent Citation Rates on arXiv:astro-ph
Authors:
J. P. Dietrich
Abstract:
We study the dependence of citation counts of e-prints published on the arXiv:astro-ph server on their position in the daily astro-ph listing. Using the SPIRES literature database we reconstruct the astro-ph listings from July 2002 to December 2005 and determine citation counts for e-prints from their ADS entry. We use Zipf plots to analyze the citation distributions for each astro-ph position.…
▽ More
We study the dependence of citation counts of e-prints published on the arXiv:astro-ph server on their position in the daily astro-ph listing. Using the SPIRES literature database we reconstruct the astro-ph listings from July 2002 to December 2005 and determine citation counts for e-prints from their ADS entry. We use Zipf plots to analyze the citation distributions for each astro-ph position. We find that e-prints appearing at or near the top of the astro-ph mailings receive significantly more citations than those further down the list. This difference is significant at the 7 sigma level and on average amounts to two times more citations for papers at the top than those further down the listing. We propose three possible non-exclusive explanations for this positional citation effect and try to test them. We conclude that self-promotion by authors plays a role in the observed effect but cannot exclude that increased visibility at the top of the daily listings contributes to higher citation counts as well. We can rule out that the positional dependence of citations is caused by the coincidence of the submission deadline with the working hours of a geographically constrained set of intrinsically higher cited authors. We discuss several ways of mitigating the observed effect, including splitting astro-ph into several subject classes, randomizing the order of e-prints, and a novel approach to sorting entries by relevance to individual readers.
△ Less
Submitted 6 December, 2007;
originally announced December 2007.