Search | arXiv e-print repository

The Untold Impact of Learning Approaches on Software Fault-Proneness Predictions

Authors: Mohammad Jamil Ahmad, Katerina Goseva-Popstojanova, Robyn R. Lutz

Abstract: Software fault-proneness prediction is an active research area, with many factors affecting prediction performance extensively studied. However, the impact of the learning approach (i.e., the specifics of the data used for training and the target variable being predicted) on the prediction performance has not been studied, except for one initial work. This paper explores the effects of two learnin… ▽ More Software fault-proneness prediction is an active research area, with many factors affecting prediction performance extensively studied. However, the impact of the learning approach (i.e., the specifics of the data used for training and the target variable being predicted) on the prediction performance has not been studied, except for one initial work. This paper explores the effects of two learning approaches, useAllPredictAll and usePrePredictPost, on the performance of software fault-proneness prediction, both within-release and across-releases. The empirical results are based on data extracted from 64 releases of twelve open-source projects. Results show that the learning approach has a substantial, and typically unacknowledged, impact on the classification performance. Specifically, using useAllPredictAll leads to significantly better performance than using usePrePredictPost learning approach, both within-release and across-releases. Furthermore, this paper uncovers that, for within-release predictions, this difference in classification performance is due to different levels of class imbalance in the two learning approaches. When class imbalance is addressed, the performance difference between the learning approaches is eliminated. Our findings imply that the learning approach should always be explicitly identified and its impact on software fault-proneness prediction considered. The paper concludes with a discussion of potential consequences of our results for both research and practice. △ Less

Submitted 12 July, 2022; originally announced July 2022.

arXiv:1810.03190 [pdf, ps, other]

doi 10.1145/3225058.3225101

Scalable Solutions for Automated Single Pulse Identification and Classification in Radio Astronomy

Authors: Thomas Devine, Katerina Goseva-Popstojanova, Di Pang

Abstract: Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta- and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extraso… ▽ More Data collection for scientific applications is increasing exponentially and is forecasted to soon reach peta- and exabyte scales. Applications which process and analyze scientific data must be scalable and focus on execution performance to keep pace. In the field of radio astronomy, in addition to increasingly large datasets, tasks such as the identification of transient radio signals from extrasolar sources are computationally expensive. We present a scalable approach to radio pulsar detection written in Scala that parallelizes candidate identification to take advantage of in-memory task processing using Apache Spark on a YARN distributed system. Furthermore, we introduce a novel automated multiclass supervised machine learning technique that we combine with feature selection to reduce the time required for candidate classification. Experimental testing on a Beowulf cluster with 15 data nodes shows that the parallel implementation of the identification algorithm offers a speedup of up to 5X that of a similar multithreaded implementation. Further, we show that the combination of automated multiclass classification and feature selection speeds up the execution performance of the RandomForest machine learning algorithm by an average of 54% with less than a 2% average reduction in the algorithm's ability to correctly classify pulsars. The generalizability of these results is demonstrated by using two real-world radio astronomy data sets. △ Less

Submitted 7 October, 2018; originally announced October 2018.

Comments: In Proceedings of the 47th International Conference on Parallel Processing (ICPP 2018). ACM, New York, NY, USA, Article 11, 11 pages

arXiv:1807.07164 [pdf, other]

doi 10.1093/mnras/sty1992

A novel single-pulse search approach to detection of dispersed radio pulses using clustering and supervised machine learning

Authors: Di Pang, Katerina Goseva-Popstojanova, Thomas Devine, Maura McLaughlin

Abstract: We present a novel two-stage approach which combines unsupervised and supervised machine learning to automatically identify and classify single pulses in radio pulsar search data. In the first stage, we identify astrophysical pulse candidates in the data, which were derived from the Pulsar Arecibo L-Band Feed Array (PALFA) survey and contain 47,042 independent beams, as trial single-pulse event gr… ▽ More We present a novel two-stage approach which combines unsupervised and supervised machine learning to automatically identify and classify single pulses in radio pulsar search data. In the first stage, we identify astrophysical pulse candidates in the data, which were derived from the Pulsar Arecibo L-Band Feed Array (PALFA) survey and contain 47,042 independent beams, as trial single-pulse event groups (SPEGs) by clustering single-pulse events and merging clusters that fall within the expected DM and time span of astrophysical pulses. We also present a new peak scoring algorithm, to identify astrophysical peaks in S/N versus DM curves. Furthermore, we group SPEGs detected at a consistent DM for they were likely emitted by the same source. In the second stage, we create a fully labelled benchmark data set by selecting a subset of data with SPEGs identified (using stage 1 procedures), their features extracted and individual SPEGs manually labelled, and then train classifiers using supervised machine learning. Next, using the best trained classifier, we automatically classify unlabelled SPEGs identified in the full data set. To aid the examination of dim SPEGs, we develop an algorithm that searches for an underlying periodicity among grouped SPEGs. The results showed that RandomForest with SMOTE treatment was the best learner, with a recall of 95.6% and a false positive rate of 2.0%. In total, besides all 60 known pulsars from the benchmark data set, the model found 32 additional (i.e., not included in the benchmark data set) known pulsars, and several potential discoveries. △ Less

Submitted 5 August, 2018; v1 submitted 18 July, 2018; originally announced July 2018.

Comments: 22 pages, accepted for publication in MNRAS, ref. MN-17-3830-MJ.R2

arXiv:1805.06541 [pdf, other]

Towards Malware Detection via CPU Power Consumption: Data Collection Design and Analytics (Extended Version)

Authors: Robert Bridges, Jarilyn Hernandez Jimenez, Jeffrey Nichols, Katerina Goseva-Popstojanova, Stacy Prowell

Abstract: This paper presents an experimental design and data analytics approach aimed at power-based malware detection on general-purpose computers. Leveraging the fact that malware executions must consume power, we explore the postulate that malware can be accurately detected via power data analytics. Our experimental design and implementation allow for programmatic collection of CPU power profiles for fi… ▽ More This paper presents an experimental design and data analytics approach aimed at power-based malware detection on general-purpose computers. Leveraging the fact that malware executions must consume power, we explore the postulate that malware can be accurately detected via power data analytics. Our experimental design and implementation allow for programmatic collection of CPU power profiles for fixed tasks during uninfected and infected states using five different rootkits. To characterize the power consumption profiles, we use both simple statistical and novel, sophisticated features. We test a one-class anomaly detection ensemble (that baselines non-infected power profiles) and several kernel-based SVM classifiers (that train on both uninfected and infected profiles) in detecting previously unseen malware and clean profiles. The anomaly detection system exhibits perfect detection when using all features and tasks, with smaller false detection rate than the supervised classifiers. The primary contribution is the proof of concept that baselining power of fixed tasks can provide accurate detection of rootkits. Moreover, our treatment presents engineering hurdles needed for experimentation and allows analysis of each statistical feature individually. This work appears to be the first step towards a viable power-based detection capability for general-purpose computers, and presents next steps toward this goal. △ Less

Submitted 16 May, 2018; originally announced May 2018.

Comments: Published version appearing in IEEE TrustCom-18. This version contains more details on mathematics and data collection

arXiv:1705.01977 [pdf, other]

Malware Detection on General-Purpose Computers Using Power Consumption Monitoring: A Proof of Concept and Case Study

Authors: Jarilyn M. Hernández Jiménez, Jeffrey A. Nichols, Katerina Goseva-Popstojanova, Stacy Prowell, Robert A. Bridges

Abstract: Malware detection is challenging when faced with automatically generated and polymorphic malware, as well as with rootkits, which are exceptionally hard to detect. In an attempt to contribute towards addressing these challenges, we conducted a proof of concept study that explored the use of power consumption for detection of malware presence in a general-purpose computer. The results of our experi… ▽ More Malware detection is challenging when faced with automatically generated and polymorphic malware, as well as with rootkits, which are exceptionally hard to detect. In an attempt to contribute towards addressing these challenges, we conducted a proof of concept study that explored the use of power consumption for detection of malware presence in a general-purpose computer. The results of our experiments indicate that malware indeed leaves a signal on the power consumption of a general-purpose computer. Specifically, for the case study based on two different rootkits, the data collected at the +12V rails on the motherboard showed the most noticeable increment of the power consumption after the computer was infected. Our future work includes experimenting with more malware examples and workloads, and develo** data analytics approach for automatic malware detection based on power consumption. △ Less

Submitted 4 May, 2017; originally announced May 2017.

arXiv:1603.09461 [pdf, ps, other]

doi 10.1093/mnras/stw655

Detection of Dispersed Radio Pulses: A machine learning approach to candidate identification and classification

Authors: Thomas Devine, Katerina Goseva-Popstojanova, Maura McLaughlin

Abstract: Searching for extraterrestrial, transient signals in astronomical data sets is an active area of current research. However, machine learning techniques are lacking in the literature concerning single-pulse detection. This paper presents a new, two-stage approach for identifying and classifying dispersed pulse groups (DPGs) in single-pulse search output. The first stage identified DPGs and extracte… ▽ More Searching for extraterrestrial, transient signals in astronomical data sets is an active area of current research. However, machine learning techniques are lacking in the literature concerning single-pulse detection. This paper presents a new, two-stage approach for identifying and classifying dispersed pulse groups (DPGs) in single-pulse search output. The first stage identified DPGs and extracted features to characterize them using a new peak identification algorithm which tracks slo** tendencies around local maxima in plots of signal-to-noise ratio vs. dispersion measure. The second stage used supervised machine learning to classify DPGs. We created four benchmark data sets: one unbalanced and three balanced versions using three different imbalance treatments.We empirically evaluated 48 classifiers by training and testing binary and multiclass versions of six machine learning algorithms on each of the four benchmark versions. While each classifier had advantages and disadvantages, all classifiers with imbalance treatments had higher recall values than those with unbalanced data, regardless of the machine learning algorithm used. Based on the benchmarking results, we selected a subset of classifiers to classify the full, unlabelled data set of over 1.5 million DPGs identified in 42,405 observations made by the Green Bank Telescope. Overall, the classifiers using a multiclass ensemble tree learner in combination with two oversampling imbalance treatments were the most efficient; they identified additional known pulsars not in the benchmark data set and provided six potential discoveries, with significantly less false positives than the other classifiers. △ Less

Submitted 31 March, 2016; originally announced March 2016.

Comments: 13 pages, accepted for publication in MNRAS, ref. MN-15-1713-MJ.R3

Showing 1–6 of 6 results for author: Goseva-Popstojanova, K