-
On Leakage in Machine Learning Pipelines
Authors:
Leonard Sasse,
Eliana Nicolaisen-Sobesky,
Juergen Dukart,
Simon B. Eickhoff,
Michael Götz,
Sami Hamdan,
Vera Komeyer,
Abhijit Kulkarni,
Juha Lahnakoski,
Bradley C. Love,
Federico Raimondo,
Kaustubh R. Patil
Abstract:
Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to ne…
▽ More
Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines.
△ Less
Submitted 5 March, 2024; v1 submitted 7 November, 2023;
originally announced November 2023.
-
Julearn: an easy-to-use library for leakage-free evaluation and inspection of ML models
Authors:
Sami Hamdan,
Shammi More,
Leonard Sasse,
Vera Komeyer,
Kaustubh R. Patil,
Federico Raimondo
Abstract:
The fast-paced development of machine learning (ML) methods coupled with its increasing adoption in research poses challenges for researchers without extensive training in ML. In neuroscience, for example, ML can help understand brain-behavior relationships, diagnose diseases, and develop biomarkers using various data sources like magnetic resonance imaging and electroencephalography. The primary…
▽ More
The fast-paced development of machine learning (ML) methods coupled with its increasing adoption in research poses challenges for researchers without extensive training in ML. In neuroscience, for example, ML can help understand brain-behavior relationships, diagnose diseases, and develop biomarkers using various data sources like magnetic resonance imaging and electroencephalography. The primary objective of ML is to build models that can make accurate predictions on unseen data. Researchers aim to prove the existence of such generalizable models by evaluating performance using techniques such as cross-validation (CV), which uses systematic subsampling to estimate the generalization performance. Choosing a CV scheme and evaluating an ML pipeline can be challenging and, if used improperly, can lead to overestimated results and incorrect interpretations.
We created julearn, an open-source Python library, that allow researchers to design and evaluate complex ML pipelines without encountering in common pitfalls. In this manuscript, we present the rationale behind julearn's design, its core features, and showcase three examples of previously-published research projects that can be easily implemented using this novel library. Julearn aims to simplify the entry into the ML world by providing an easy-to-use environment with built in guards against some of the most common ML pitfalls. With its design, unique features and simple interface, it poses as a useful Python-based library for research projects.
△ Less
Submitted 19 October, 2023;
originally announced October 2023.
-
Resource-Interaction Graph: Efficient Graph Representation for Anomaly Detection
Authors:
James Pope,
**yuan Liang,
Vijay Kumar,
Francesco Raimondo,
Xinyi Sun,
Ryan McConville,
Thomas Pasquier,
Rob Piechocki,
George Oikonomou,
Bo Luo,
Dan Howarth,
Ioannis Mavromatis,
Adrian Sanchez Mompo,
Pietro Carnelli,
Theodoros Spyridopoulos,
Aftab Khan
Abstract:
Security research has concentrated on converting operating system audit logs into suitable graphs, such as provenance graphs, for analysis. However, provenance graphs can grow very large requiring significant computational resources beyond what is necessary for many security tasks and are not feasible for resource constrained environments, such as edge devices. To address this problem, we present…
▽ More
Security research has concentrated on converting operating system audit logs into suitable graphs, such as provenance graphs, for analysis. However, provenance graphs can grow very large requiring significant computational resources beyond what is necessary for many security tasks and are not feasible for resource constrained environments, such as edge devices. To address this problem, we present the \textit{resource-interaction graph} that is built directly from the audit log. We show that the resource-interaction graph's storage requirements are significantly lower than provenance graphs using an open-source data set with two container escape attacks captured from an edge device. We use a graph autoencoder and graph clustering technique to evaluate the representation for an anomaly detection task. Both approaches are unsupervised and are thus suitable for detecting zero-day attacks. The approaches can achieve f1 scores typically over 80\% and in some cases over 90\% for the selected data set and attacks.
△ Less
Submitted 16 December, 2022;
originally announced December 2022.
-
LE3D: A Lightweight Ensemble Framework of Data Drift Detectors for Resource-Constrained Devices
Authors:
Ioannis Mavromatis,
Adrian Sanchez-Mompo,
Francesco Raimondo,
James Pope,
Marcello Bullo,
Ingram Weeks,
Vijay Kumar,
Pietro Carnelli,
George Oikonomou,
Theodoros Spyridopoulos,
Aftab Khan
Abstract:
Data integrity becomes paramount as the number of Internet of Things (IoT) sensor deployments increases. Sensor data can be altered by benign causes or malicious actions. Mechanisms that detect drifts and irregularities can prevent disruptions and data bias in the state of an IoT application. This paper presents LE3D, an ensemble framework of data drift estimators capable of detecting abnormal sen…
▽ More
Data integrity becomes paramount as the number of Internet of Things (IoT) sensor deployments increases. Sensor data can be altered by benign causes or malicious actions. Mechanisms that detect drifts and irregularities can prevent disruptions and data bias in the state of an IoT application. This paper presents LE3D, an ensemble framework of data drift estimators capable of detecting abnormal sensor behaviours. Working collaboratively with surrounding IoT devices, the type of drift (natural/abnormal) can also be identified and reported to the end-user. The proposed framework is a lightweight and unsupervised implementation able to run on resource-constrained IoT devices. Our framework is also generalisable, adapting to new sensor streams and environments with minimal online reconfiguration. We compare our method against state-of-the-art ensemble data drift detection frameworks, evaluating both the real-world detection accuracy as well as the resource utilisation of the implementation. Experimenting with real-world data and emulated drifts, we show the effectiveness of our method, which achieves up to 97% of detection accuracy while requiring minimal resources to run.
△ Less
Submitted 18 November, 2022; v1 submitted 3 November, 2022;
originally announced November 2022.
-
Autoreject: Automated artifact rejection for MEG and EEG data
Authors:
Mainak Jas,
Denis A. Engemann,
Yousra Bekhti,
Federico Raimondo,
Alexandre Gramfort
Abstract:
We present an automated algorithm for unified rejection and repair of bad trials in magnetoencephalography (MEG) and electroencephalography (EEG) signals. Our method capitalizes on cross-validation in conjunction with a robust evaluation metric to estimate the optimal peak-to-peak threshold -- a quantity commonly used for identifying bad trials in M/EEG. This approach is then extended to a more so…
▽ More
We present an automated algorithm for unified rejection and repair of bad trials in magnetoencephalography (MEG) and electroencephalography (EEG) signals. Our method capitalizes on cross-validation in conjunction with a robust evaluation metric to estimate the optimal peak-to-peak threshold -- a quantity commonly used for identifying bad trials in M/EEG. This approach is then extended to a more sophisticated algorithm which estimates this threshold for each sensor yielding trial-wise bad sensors. Depending on the number of bad sensors, the trial is then repaired by interpolation or by excluding it from subsequent analysis. All steps of the algorithm are fully automated thus lending itself to the name Autoreject.
In order to assess the practical significance of the algorithm, we conducted extensive validation and comparison with state-of-the-art methods on four public datasets containing MEG and EEG recordings from more than 200 subjects. Comparison include purely qualitative efforts as well as quantitatively benchmarking against human supervised and semi-automated preprocessing pipelines. The algorithm allowed us to automate the preprocessing of MEG data from the Human Connectome Project (HCP) going up to the computation of the evoked responses. The automated nature of our method minimizes the burden of human inspection, hence supporting scalability and reliability demanded by data analysis in modern neuroscience.
△ Less
Submitted 12 July, 2017; v1 submitted 24 December, 2016;
originally announced December 2016.