-
On Leakage in Machine Learning Pipelines
Authors:
Leonard Sasse,
Eliana Nicolaisen-Sobesky,
Juergen Dukart,
Simon B. Eickhoff,
Michael Götz,
Sami Hamdan,
Vera Komeyer,
Abhijit Kulkarni,
Juha Lahnakoski,
Bradley C. Love,
Federico Raimondo,
Kaustubh R. Patil
Abstract:
Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to ne…
▽ More
Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines.
△ Less
Submitted 5 March, 2024; v1 submitted 7 November, 2023;
originally announced November 2023.
-
Julearn: an easy-to-use library for leakage-free evaluation and inspection of ML models
Authors:
Sami Hamdan,
Shammi More,
Leonard Sasse,
Vera Komeyer,
Kaustubh R. Patil,
Federico Raimondo
Abstract:
The fast-paced development of machine learning (ML) methods coupled with its increasing adoption in research poses challenges for researchers without extensive training in ML. In neuroscience, for example, ML can help understand brain-behavior relationships, diagnose diseases, and develop biomarkers using various data sources like magnetic resonance imaging and electroencephalography. The primary…
▽ More
The fast-paced development of machine learning (ML) methods coupled with its increasing adoption in research poses challenges for researchers without extensive training in ML. In neuroscience, for example, ML can help understand brain-behavior relationships, diagnose diseases, and develop biomarkers using various data sources like magnetic resonance imaging and electroencephalography. The primary objective of ML is to build models that can make accurate predictions on unseen data. Researchers aim to prove the existence of such generalizable models by evaluating performance using techniques such as cross-validation (CV), which uses systematic subsampling to estimate the generalization performance. Choosing a CV scheme and evaluating an ML pipeline can be challenging and, if used improperly, can lead to overestimated results and incorrect interpretations.
We created julearn, an open-source Python library, that allow researchers to design and evaluate complex ML pipelines without encountering in common pitfalls. In this manuscript, we present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects that can be easily implemented using this novel library. Julearn aims to simplify the entry into the ML world by providing an easy-to-use environment with built in guards against some of the most common ML pitfalls. With its design, unique features and simple interface, it poses as a useful Python-based library for research projects.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Resource-Interaction Graph: Efficient Graph Representation for Anomaly Detection
Authors:
James Pope,
**yuan Liang,
Vijay Kumar,
Francesco Raimondo,
Xinyi Sun,
Ryan McConville,
Thomas Pasquier,
Rob Piechocki,
George Oikonomou,
Bo Luo,
Dan Howarth,
Ioannis Mavromatis,
Adrian Sanchez Mompo,
Pietro Carnelli,
Theodoros Spyridopoulos,
Aftab Khan
Abstract:
Security research has concentrated on converting operating system audit logs into suitable graphs, such as provenance graphs, for analysis. However, provenance graphs can grow very large requiring significant computational resources beyond what is necessary for many security tasks and are not feasible for resource constrained environments, such as edge devices. To address this problem, we present…
▽ More
Security research has concentrated on converting operating system audit logs into suitable graphs, such as provenance graphs, for analysis. However, provenance graphs can grow very large requiring significant computational resources beyond what is necessary for many security tasks and are not feasible for resource constrained environments, such as edge devices. To address this problem, we present the \textit{resource-interaction graph} that is built directly from the audit log. We show that the resource-interaction graph's storage requirements are significantly lower than provenance graphs using an open-source data set with two container escape attacks captured from an edge device. We use a graph autoencoder and graph clustering technique to evaluate the representation for an anomaly detection task. Both approaches are unsupervised and are thus suitable for detecting zero-day attacks. The approaches can achieve f1 scores typically over 80\% and in some cases over 90\% for the selected data set and attacks.
△ Less
Submitted 16 December, 2022;
originally announced December 2022.
-
LE3D: A Lightweight Ensemble Framework of Data Drift Detectors for Resource-Constrained Devices
Authors:
Ioannis Mavromatis,
Adrian Sanchez-Mompo,
Francesco Raimondo,
James Pope,
Marcello Bullo,
Ingram Weeks,
Vijay Kumar,
Pietro Carnelli,
George Oikonomou,
Theodoros Spyridopoulos,
Aftab Khan
Abstract:
Data integrity becomes paramount as the number of Internet of Things (IoT) sensor deployments increases. Sensor data can be altered by benign causes or malicious actions. Mechanisms that detect drifts and irregularities can prevent disruptions and data bias in the state of an IoT application. This paper presents LE3D, an ensemble framework of data drift estimators capable of detecting abnormal sen…
▽ More
Data integrity becomes paramount as the number of Internet of Things (IoT) sensor deployments increases. Sensor data can be altered by benign causes or malicious actions. Mechanisms that detect drifts and irregularities can prevent disruptions and data bias in the state of an IoT application. This paper presents LE3D, an ensemble framework of data drift estimators capable of detecting abnormal sensor behaviours. Working collaboratively with surrounding IoT devices, the type of drift (natural/abnormal) can also be identified and reported to the end-user. The proposed framework is a lightweight and unsupervised implementation able to run on resource-constrained IoT devices. Our framework is also generalisable, adapting to new sensor streams and environments with minimal online reconfiguration. We compare our method against state-of-the-art ensemble data drift detection frameworks, evaluating both the real-world detection accuracy as well as the resource utilisation of the implementation. Experimenting with real-world data and emulated drifts, we show the effectiveness of our method, which achieves up to 97% of detection accuracy while requiring minimal resources to run.
△ Less
Submitted 18 November, 2022; v1 submitted 3 November, 2022;
originally announced November 2022.