Search | arXiv e-print repository

BayesFLo: Bayesian fault localization of complex software systems

Authors: Yi Ji, Simon Mak, Ryan Lekivetz, Joseph Morgan

Abstract: Software testing is essential for the reliable development of complex software systems. A key step in software testing is fault localization, which uses test data to pinpoint failure-inducing combinations for further diagnosis. Existing fault localization methods, however, are largely deterministic, and thus do not provide a principled approach for assessing probabilistic risk of potential root ca… ▽ More Software testing is essential for the reliable development of complex software systems. A key step in software testing is fault localization, which uses test data to pinpoint failure-inducing combinations for further diagnosis. Existing fault localization methods, however, are largely deterministic, and thus do not provide a principled approach for assessing probabilistic risk of potential root causes, or for integrating domain and/or structural knowledge from test engineers. To address this, we propose a novel Bayesian fault localization framework called BayesFLo, which leverages a flexible Bayesian model on potential root cause combinations. A key feature of BayesFLo is its integration of the principles of combination hierarchy and heredity, which capture the structured nature of failure-inducing combinations. A critical challenge, however, is the sheer number of potential root cause scenarios to consider, which renders the computation of posterior root cause probabilities infeasible even for small software systems. We thus develop new algorithms for efficient computation of such probabilities, leveraging recent tools from integer programming and graph representations. We then demonstrate the effectiveness of BayesFLo over state-of-the-art fault localization methods, in a suite of numerical experiments and in two motivating case studies on the JMP XGBoost interface. △ Less

Submitted 12 March, 2024; originally announced March 2024.

arXiv:2201.06465 [pdf, other]

doi 10.1109/SWC50871.2021.00098

Process Visualization of Manufacturing Execution System (MES) Data

Authors: Meadhbh O'Neill, Jeff Morgan, Kevin Burke

Abstract: Process visualizations of data from manufacturing execution systems (MESs) provide the ability to generate valuable insights for improved decision-making. Industry 4.0 is awakening a digital transformation where advanced analytics and visualizations are critical. Exploiting MESs with data-driven strategies can have a major impact on business outcomes. The advantages of employing process visualizat… ▽ More Process visualizations of data from manufacturing execution systems (MESs) provide the ability to generate valuable insights for improved decision-making. Industry 4.0 is awakening a digital transformation where advanced analytics and visualizations are critical. Exploiting MESs with data-driven strategies can have a major impact on business outcomes. The advantages of employing process visualizations are demonstrated through an application to real-world data. Visualizations, such as dashboards, enable the user to examine the performance of a production line at a high level. Furthermore, the addition of interactivity facilitates the user to customize the data they want to observe. Evidence of process variability between shifts and days of the week can be investigated with the goal of optimizing production. △ Less

Submitted 17 January, 2022; originally announced January 2022.

MSC Class: 62P30

Journal ref: 2021 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computing, Scalable Computing & Communications, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/IOP/SCI) (2021) 659-664

arXiv:1805.05002 [pdf, other]

How can the score test be consistent?

Authors: N. Karavarsamis, G. Guillera-Arroita, RM Huggins, B J T Morgan

Abstract: The score test statistic using the observed information is easy to compute numerically. Its large sample distribution under the null hypothesis is well known and is equivalent to that of the score test based on the expected information, the likelihood-ratio test and the Wald test. However, several authors have noted that under the alternative this no longer holds and in particular the statistic ca… ▽ More The score test statistic using the observed information is easy to compute numerically. Its large sample distribution under the null hypothesis is well known and is equivalent to that of the score test based on the expected information, the likelihood-ratio test and the Wald test. However, several authors have noted that under the alternative this no longer holds and in particular the statistic can take negative values. Here we examine the score test using the observed information in the context of comparing two binomial proportions under imperfect detection, a common problem in ecology when studying occurrence of species. We demonstrate through a combination of simulations and theoretical analysis that a new modified rule which we propose that rejects the null hypothesis when the observed score statistic is larger than the usual chi-square cut-off or is negative has power that is mostly greater to any other test. In addition consistency is largely restored. Our new test is easy to use and inference is always possible. △ Less

Submitted 9 August, 2018; v1 submitted 13 May, 2018; originally announced May 2018.

arXiv:1802.04350 [pdf, other]

Cost-Aware Learning for Improved Identifiability with Multiple Experiments

Authors: Longyun Guo, Jean Honorio, John Morgan

Abstract: We analyze the sample complexity of learning from multiple experiments where the experimenter has a total budget for obtaining samples. In this problem, the learner should choose a hypothesis that performs well with respect to multiple experiments, and their related data distributions. Each collected sample is associated with a cost which depends on the particular experiments. In our setup, a lear… ▽ More We analyze the sample complexity of learning from multiple experiments where the experimenter has a total budget for obtaining samples. In this problem, the learner should choose a hypothesis that performs well with respect to multiple experiments, and their related data distributions. Each collected sample is associated with a cost which depends on the particular experiments. In our setup, a learner performs $m$ experiments, while incurring a total cost $C$. We first show that learning from multiple experiments allows to improve identifiability. Additionally, by using a Rademacher complexity approach, we show that the gap between the training and generalization error is $O(C^{-1/2})$. We also provide some examples for linear prediction, two-layer neural networks and kernel methods. △ Less

Submitted 13 July, 2019; v1 submitted 12 February, 2018; originally announced February 2018.

Comments: 17 pages, 4 figures

Journal ref: IEEE International Symposium on Information Theory (ISIT) 2019

arXiv:1709.02046 [pdf, ps, other]

doi 10.1039/C7CP03346J

Properties of Kinetic Transition Networks for Atomic Clusters and Glassy Solids

Authors: John W R Morgan, Dhagash Mehta, David J Wales

Abstract: A database of minima and transition states corresponds to a network where the minima represent nodes and the transition states correspond to edges between the pairs of minima they connect via steepest-descent paths. Here we construct networks for small clusters bound by the Morse potential for a selection of physically relevant parameters, in two and three dimensions. The properties of these unwei… ▽ More A database of minima and transition states corresponds to a network where the minima represent nodes and the transition states correspond to edges between the pairs of minima they connect via steepest-descent paths. Here we construct networks for small clusters bound by the Morse potential for a selection of physically relevant parameters, in two and three dimensions. The properties of these unweighted and undirected networks are analysed to examine two features: whether they are small-world, where the shortest path between nodes involves only a small number or edges; and whether they are scale-free, having a degree distribution that follows a power law. Small-world character is present, but statistical tests show that a power law is not a good fit, so the networks are not scale-free. These results for clusters are compared with the corresponding properties for the molecular and atomic structural glass formers ortho-terphenyl and binary Lennard-Jones. These glassy systems do not show small-world properties, suggesting that such behaviour is linked to the structure-seeking landscapes of the Morse clusters. △ Less

Submitted 6 September, 2017; originally announced September 2017.

Comments: 23 pages, 19 figures. Accepted for publication in Physical Chemistry Chemical Physics

arXiv:1611.09829 [pdf]

doi 10.1088/0967-3334/36/1/107

A Statistical Index for Early Diagnosis of Ventricular Arrhythmia from the Trend Analysis of ECG Phase-portraits

Authors: Grazia Cappiello, Saptarshi Das, Evangelos B. Mazomenos, Koushik Maharatna, George Koulaouzidis, John Morgan, Paolo Emilio Puddu

Abstract: In this paper, we propose a novel statistical index for the early diagnosis of ventricular arrhythmia (VA) using the time delay phase-space reconstruction (PSR) technique, from the electrocardiogram (ECG) signal. Patients with two classes of fatal VA - with preceding ventricular premature beats (VPBs) and with no VPBs have been analysed using extensive simulations. Three subclasses of VA with VPBs… ▽ More In this paper, we propose a novel statistical index for the early diagnosis of ventricular arrhythmia (VA) using the time delay phase-space reconstruction (PSR) technique, from the electrocardiogram (ECG) signal. Patients with two classes of fatal VA - with preceding ventricular premature beats (VPBs) and with no VPBs have been analysed using extensive simulations. Three subclasses of VA with VPBs viz. ventricular tachycardia (VT), ventricular fibrillation (VF) and VT followed by VF are analyzed using the proposed technique. Measures of descriptive statistics like mean (μ), standard deviation (σ), coefficient of variation (CV = σ/μ), skewness (γ) and kurtosis (\{beta}) in phase-space diagrams are studied for a sliding window of 10 beats of ECG signal using the box-counting technique. Subsequently, a hybrid prediction index which is composed of a weighted sum of CV and kurtosis has been proposed for predicting the impending arrhythmia before its actual occurrence. The early diagnosis involves crossing the upper bound of a hybrid index which is capable of predicting an impending arrhythmia 356 ECG beats, on average (with 192 beats standard deviation) before its onset when tested with 32 VA patients (both with and without VPBs). The early diagnosis result is also verified using a leave out cross-validation (LOOCV) scheme with 96.88% sensitivity, 100% specificity and 98.44% accuracy. △ Less

Submitted 29 November, 2016; originally announced November 2016.

Comments: 25 pages, 16 figures, 2 tables

Journal ref: Physiological Measurement, vol. 36, no. 1, pp. 107-131, January 2015

arXiv:1512.05170 [pdf, other]

Bayesian analysis of Jolly-Seber type models; incorporating heterogeneity in arrival and departure

Authors: E. Matechou, G. Nicholls, B. J. T. Morgan, J. A. Collazo, J. E. Lyons

Abstract: We propose the use of finite mixtures of continuous distributions in modelling the process by which new individuals, that arrive in groups, become part of a wildlife population. We demonstrate this approach using a data set of migrating semipalmated sandpipers (Calidris pussila) for which we extend existing stopover models to allow for individuals to have different behaviour in terms of their stop… ▽ More We propose the use of finite mixtures of continuous distributions in modelling the process by which new individuals, that arrive in groups, become part of a wildlife population. We demonstrate this approach using a data set of migrating semipalmated sandpipers (Calidris pussila) for which we extend existing stopover models to allow for individuals to have different behaviour in terms of their stopover duration at the site. We demonstrate the use of reversible jump MCMC methods to derive posterior distributions for the model parameters and the models, simultaneously. The algorithm moves between models with different numbers of arrival groups as well as between models with different numbers of behavioural groups. The approach is shown to provide new ecological insights about the stopover behaviour of semipalmated sandpipers but is generally applicable to any population in which animals arrive in groups and potentially exhibit heterogeneity in terms of one or more other processes. △ Less

Submitted 16 December, 2015; originally announced December 2015.

Showing 1–7 of 7 results for author: Morgan, J