Search | arXiv e-print repository

doi 10.1002/cpe.7635

Performance comparison of Dask and Apache Spark on HPC systems for Neuroimaging

Authors: Mathieu Dugré, Valérie Hayot-Sasson, Tristan Glatard

Abstract: The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelin… ▽ More The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelines. Our experiments use three synthetic \HL{neuroimaging} applications to process the \SI{606}{\gibi\byte} BigBrain image and an actual pipeline to process data from thousands of anatomical images. We benchmark these applications on a dedicated HPC cluster running the Lustre file system while using varying combinations of the number of nodes, file size, and task duration. Our results show that although there are slight differences between Dask and Spark, the performance of the engines is comparable for data-intensive applications. However, Spark requires more memory than Dask, which can lead to slower runtime depending on configuration and infrastructure. In general, the limiting factor was the data transfer time. While both engines are suitable for neuroimaging, more efforts need to be put to reduce the data transfer time and the memory footprint of applications. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: 16 pages, 10 figures, 2 tables

Journal ref: Concurrency and Computation: Practice and Experience (2023) 35(21):e7635

arXiv:2405.17650 [pdf, other]

An Analysis of Performance Bottlenecks in MRI Pre-Processing

Authors: Mathieu Dugré, Yohan Chatelain, Tristan Glatard

Abstract: Magnetic Resonance Image (MRI) pre-processing is a critical step for neuroimaging analysis. However, the computational cost of MRI pre-processing pipelines is a major bottleneck for large cohort studies and some clinical applications. While High-Performance Computing (HPC) and, more recently, Deep Learning have been adopted to accelerate the computations, these techniques require costly hardware a… ▽ More Magnetic Resonance Image (MRI) pre-processing is a critical step for neuroimaging analysis. However, the computational cost of MRI pre-processing pipelines is a major bottleneck for large cohort studies and some clinical applications. While High-Performance Computing (HPC) and, more recently, Deep Learning have been adopted to accelerate the computations, these techniques require costly hardware and are not accessible to all researchers. Therefore, it is important to understand the performance bottlenecks of MRI pre-processing pipelines to improve their performance. Using Intel VTune profiler, we characterized the bottlenecks of several commonly used MRI-preprocessing pipelines from the ANTs, FSL, and FreeSurfer toolboxes. We found that few functions contributed to most of the CPU time, and that linear interpolation was the largest contributor. Data access was also a substantial bottleneck. We identified a bug in the ITK library that impacts the performance of ANTs pipeline in single-precision and a potential issue with the OpenMP scaling in FreeSurfer recon-all. Our results provide a reference for future efforts to optimize MRI pre-processing pipelines. △ Less

Submitted 27 May, 2024; originally announced May 2024.

Comments: 8 pages, 7 figures, 3 tables, 1 listing

arXiv:2404.11556 [pdf, other]

Hierarchical storage management in user space for neuroimaging applications

Authors: Valérie Hayot-Sasson, Tristan Glatard

Abstract: Neuroimaging open-data initiatives have led to increased availability of large scientific datasets. While these datasets are shifting the processing bottleneck from compute-intensive to data-intensive, current standardized analysis tools have yet to adopt strategies that mitigate the costs associated with large data transfers. A major challenge in adapting neuroimaging applications for data-intens… ▽ More Neuroimaging open-data initiatives have led to increased availability of large scientific datasets. While these datasets are shifting the processing bottleneck from compute-intensive to data-intensive, current standardized analysis tools have yet to adopt strategies that mitigate the costs associated with large data transfers. A major challenge in adapting neuroimaging applications for data-intensive processing is that they must be entirely rewritten. To facilitate data management for standardized neuroimaging tools, we developed Sea, a library that intercepts and redirects application read and write calls to minimize data transfer time. In this paper, we investigate the performance of Sea on three preprocessing pipelines implemented using standard toolboxes (FSL, SPM and AFNI), using three neuroimaging datasets of different sizes (OpenNeuro's ds001545, PREVENT-AD and the HCP dataset) on two high-performance computing clusters. Our results demonstrate that Sea provides large speedups (up to 32X) when the shared file system's (e.g. Lustre) performance is deteriorated. When the shared file system is not overburdened by other users, performance is unaffected by Sea, suggesting that Sea's overhead is minimal even in cases where its benefits are limited. Overall, Sea is beneficial, even when performance gain is minimal, as it can be used to limit the number of files created on parallel file systems. △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2403.19421 [pdf, other]

Scaling up ridge regression for brain encoding in a massive individual fMRI dataset

Authors: Sana Ahmadi, Pierre Bellec, Tristan Glatard

Abstract: Brain encoding with neuroimaging data is an established analysis aimed at predicting human brain activity directly from complex stimuli features such as movie frames. Typically, these features are the latent space representation from an artificial neural network, and the stimuli are image, audio, or text inputs. Ridge regression is a popular prediction model for brain encoding due to its good out-… ▽ More Brain encoding with neuroimaging data is an established analysis aimed at predicting human brain activity directly from complex stimuli features such as movie frames. Typically, these features are the latent space representation from an artificial neural network, and the stimuli are image, audio, or text inputs. Ridge regression is a popular prediction model for brain encoding due to its good out-of-sample generalization performance. However, training a ridge regression model can be highly time-consuming when dealing with large-scale deep functional magnetic resonance imaging (fMRI) datasets that include many space-time samples of brain activity. This paper evaluates different parallelization techniques to reduce the training time of brain encoding with ridge regression on the CNeuroMod Friends dataset, one of the largest deep fMRI resource currently available. With multi-threading, our results show that the Intel Math Kernel Library (MKL) significantly outperforms the OpenBLAS library, being 1.9 times faster using 32 threads on a single machine. We then evaluated the Dask multi-CPU implementation of ridge regression readily available in scikit-learn (MultiOutput), and we proposed a new "batch" version of Dask parallelization, motivated by a time complexity analysis. In line with our theoretical analysis, MultiOutput parallelization was found to be impractical, i.e., slower than multi-threading on a single machine. In contrast, the Batch-MultiOutput regression scaled well across compute nodes and threads, providing speed-ups of up to 33 times with 8 compute nodes and 32 threads compared to a single-threaded scikit-learn execution. Batch parallelization using Dask thus emerges as a scalable approach for brain encoding with ridge regression on high-performance computing systems using scikit-learn and large fMRI datasets. △ Less

Submitted 28 March, 2024; originally announced March 2024.

arXiv:2403.15405 [pdf, other]

Predicting Parkinson's disease trajectory using clinical and functional MRI features: a reproduction and replication study

Authors: Elodie Germani, Nikhil Baghwat, Mathieu Dugré, Rémi Gau, Albert Montillo, Kevin Nguyen, Andrzej Sokolowski, Madeleine Sharp, Jean-Baptiste Poline, Tristan Glatard

Abstract: Parkinson's disease (PD) is a common neurodegenerative disorder with a poorly understood physiopathology and no established biomarkers for the diagnosis of early stages and for prediction of disease progression. Several neuroimaging biomarkers have been studied recently, but these are susceptible to several sources of variability. In this context, an evaluation of the robustness of such biomarkers… ▽ More Parkinson's disease (PD) is a common neurodegenerative disorder with a poorly understood physiopathology and no established biomarkers for the diagnosis of early stages and for prediction of disease progression. Several neuroimaging biomarkers have been studied recently, but these are susceptible to several sources of variability. In this context, an evaluation of the robustness of such biomarkers is essential. This study is part of a larger project investigating the replicability of potential neuroimaging biomarkers of PD. Here, we attempt to reproduce (same data, same method) and replicate (different data or method) the models described in Nguyen et al., 2021 to predict individual's PD current state and progression using demographic, clinical and neuroimaging features (fALFF and ReHo extracted from resting-state fMRI). We use the Parkinson's Progression Markers Initiative dataset (PPMI, ppmi-info.org), as in Nguyen et al.,2021 and aim to reproduce the original cohort, imaging features and machine learning models as closely as possible using the information available in the paper and the code. We also investigated methodological variations in cohort selection, feature extraction pipelines and sets of input features. The success of the reproduction was assessed using different criteria. Notably, we obtained significantly better than chance performance using the analysis pipeline closest to that in the original study (R2 > 0), which is consistent with its findings. The challenges encountered while reproducing and replicating the original work are likely explained by the complexity of neuroimaging studies, in particular in clinical settings. We provide recommendations to further facilitate the reproducibility of such studies in the future. △ Less

Submitted 24 May, 2024; v1 submitted 20 February, 2024; originally announced March 2024.

arXiv:2308.16279 [pdf, other]

Classification of Anomalies in Telecommunication Network KPI Time Series

Authors: Korantin Bordeau-Aubert, Justin Whatley, Sylvain Nadeau, Tristan Glatard, Brigitte Jaumard

Abstract: The increasing complexity and scale of telecommunication networks have led to a growing interest in automated anomaly detection systems. However, the classification of anomalies detected on network Key Performance Indicators (KPI) has received less attention, resulting in a lack of information about anomaly characteristics and classification processes. To address this gap, this paper proposes a mo… ▽ More The increasing complexity and scale of telecommunication networks have led to a growing interest in automated anomaly detection systems. However, the classification of anomalies detected on network Key Performance Indicators (KPI) has received less attention, resulting in a lack of information about anomaly characteristics and classification processes. To address this gap, this paper proposes a modular anomaly classification framework. The framework assumes separate entities for the anomaly classifier and the detector, allowing for a distinct treatment of anomaly detection and classification tasks on time series. The objectives of this study are (1) to develop a time series simulator that generates synthetic time series resembling real-world network KPI behavior, (2) to build a detection model to identify anomalies in the time series, (3) to build classification models that accurately categorize detected anomalies into predefined classes (4) to evaluate the classification framework performance on simulated and real-world network KPI time series. This study has demonstrated the good performance of the anomaly classification models trained on simulated anomalies when applied to real-world network time series data. △ Less

Submitted 30 August, 2023; originally announced August 2023.

arXiv:2308.01939 [pdf, ps, other]

Numerical Uncertainty of Convolutional Neural Networks Inference for Structural Brain MRI Analysis

Authors: Inés Gonzalez Pepe, Vinuyan Sivakolunthu, Hae Lang Park, Yohan Chatelain, Tristan Glatard

Abstract: This paper investigates the numerical uncertainty of Convolutional Neural Networks (CNNs) inference for structural brain MRI analysis. It applies Random Rounding -- a stochastic arithmetic technique -- to CNN models employed in non-linear registration (SynthMorph) and whole-brain segmentation (FastSurfer), and compares the resulting numerical uncertainty to the one measured in a reference image-pr… ▽ More This paper investigates the numerical uncertainty of Convolutional Neural Networks (CNNs) inference for structural brain MRI analysis. It applies Random Rounding -- a stochastic arithmetic technique -- to CNN models employed in non-linear registration (SynthMorph) and whole-brain segmentation (FastSurfer), and compares the resulting numerical uncertainty to the one measured in a reference image-processing pipeline (FreeSurfer recon-all). Results obtained on 32 representative subjects show that CNN predictions are substantially more accurate numerically than traditional image-processing results (non-linear registration: 19 vs 13 significant bits on average; whole-brain segmentation: 0.99 vs 0.92 Sørensen-Dice score on average), which suggests a better reproducibility of CNN results across execution environments. △ Less

Submitted 2 August, 2023; originally announced August 2023.

arXiv:2307.01373 [pdf, other]

A numerical variability approach to results stability tests and its application to neuroimaging

Authors: Yohan Chatelain, Loïc Tetrel, Christopher J. Markiewicz, Mathias Goncalves, Gregory Kiar, Oscar Esteban, Pierre Bellec, Tristan Glatard

Abstract: Ensuring the long-term reproducibility of data analyses requires results stability tests to verify that analysis results remain within acceptable variation bounds despite inevitable software updates and hardware evolutions. This paper introduces a numerical variability approach for results stability tests, which determines acceptable variation bounds using random rounding of floating-point calcula… ▽ More Ensuring the long-term reproducibility of data analyses requires results stability tests to verify that analysis results remain within acceptable variation bounds despite inevitable software updates and hardware evolutions. This paper introduces a numerical variability approach for results stability tests, which determines acceptable variation bounds using random rounding of floating-point calculations. By applying the resulting stability test to \fmriprep, a widely-used neuroimaging tool, we show that the test is sensitive enough to detect subtle updates in image processing methods while remaining specific enough to accept numerical variations within a reference version of the application. This result contributes to enhancing the reliability and reproducibility of data analyses by providing a robust and flexible method for stability testing. △ Less

Submitted 10 July, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

ACM Class: D.2.5

arXiv:2212.06361 [pdf, ps, other]

doi 10.1371/journal.pone.0296725

Numerical Stability of DeepGOPlus Inference

Authors: Inés Gonzalez Pepe, Yohan Chatelain, Gregory Kiar, Tristan Glatard

Abstract: Convolutional neural networks (CNNs) are currently among the most widely-used deep neural network (DNN) architectures available and achieve state-of-the-art performance for many problems. Originally applied to computer vision tasks, CNNs work well with any data with a spatial relationship, besides images, and have been applied to different fields. However, recent works have highlighted numerical s… ▽ More Convolutional neural networks (CNNs) are currently among the most widely-used deep neural network (DNN) architectures available and achieve state-of-the-art performance for many problems. Originally applied to computer vision tasks, CNNs work well with any data with a spatial relationship, besides images, and have been applied to different fields. However, recent works have highlighted numerical stability challenges in DNNs, which also relates to their known sensitivity to noise injection. These challenges can jeopardise their performance and reliability. This paper investigates DeepGOPlus, a CNN that predicts protein function. DeepGOPlus has achieved state-of-the-art performance and can successfully take advantage and annotate the abounding protein sequences emerging in proteomics. We determine the numerical stability of the model's inference stage by quantifying the numerical uncertainty resulting from perturbations of the underlying floating-point data. In addition, we explore the opportunity to use reduced-precision floating point formats for DeepGOPlus inference, to reduce memory consumption and latency. This is achieved by instrumenting DeepGOPlus' execution using Monte Carlo Arithmetic, a technique that experimentally quantifies floating point operation errors and VPREC, a tool that emulates results with customizable floating point precision formats. Focus is placed on the inference stage as it is the primary deliverable of the DeepGOPlus model, widely applicable across different environments. All in all, our results show that although the DeepGOPlus CNN is very stable numerically, it can only be selectively implemented with lower-precision floating-point formats. We conclude that predictions obtained from the pre-trained DeepGOPlus model are very reliable numerically, and use existing floating-point formats efficiently. △ Less

Submitted 28 February, 2024; v1 submitted 12 December, 2022; originally announced December 2022.

Comments: 17 pages, 5 figures, 4 tables with 3 figures, 2 tables in Appendix

Journal ref: Vol 19, no. 1 (2024): e0296725

arXiv:2210.05704 [pdf, other]

Dynamic Ensemble Size Adjustment for Memory Constrained Mondrian Forest

Authors: Martin Khannouz, Tristan Glatard

Abstract: Supervised learning algorithms generally assume the availability of enough memory to store data models during the training and test phases. However, this assumption is unrealistic when data comes in the form of infinite data streams, or when learning algorithms are deployed on devices with reduced amounts of memory. Such memory constraints impact the model behavior and assumptions. In this paper,… ▽ More Supervised learning algorithms generally assume the availability of enough memory to store data models during the training and test phases. However, this assumption is unrealistic when data comes in the form of infinite data streams, or when learning algorithms are deployed on devices with reduced amounts of memory. Such memory constraints impact the model behavior and assumptions. In this paper, we show that under memory constraints, increasing the size of a tree-based ensemble classifier can worsen its performance. In particular, we experimentally show the existence of an optimal ensemble size for a memory-bounded Mondrian forest on data streams and we design an algorithm to guide the forest toward that optimal number by using an estimation of overfitting. We tested different variations for this algorithm on a variety of real and simulated datasets, and we conclude that our method can achieve up to 95% of the performance of an optimally-sized Mondrian forest for stable datasets, and can even outperform it for datasets with concept drifts. All our methods are implemented in the OrpailleCC open-source library and are ready to be used on embedded systems and connected objects. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Comments: arXiv admin note: text overlap with arXiv:2205.07871

arXiv:2207.01737 [pdf, other]

Sea: A lightweight data-placement library for Big Data scientific computing

Authors: Valérie Hayot-Sasson, Mathieu Dugré, Tristan Glatard

Abstract: The recent influx of open scientific data has contributed to the transitioning of scientific computing from compute intensive to data intensive. Whereas many Big Data frameworks exist that minimize the cost of data transfers, few scientific applications integrate these frameworks or adopt data-placement strategies to mitigate the costs. Scientific applications commonly rely on well-established com… ▽ More The recent influx of open scientific data has contributed to the transitioning of scientific computing from compute intensive to data intensive. Whereas many Big Data frameworks exist that minimize the cost of data transfers, few scientific applications integrate these frameworks or adopt data-placement strategies to mitigate the costs. Scientific applications commonly rely on well-established command-line tools that would require complete reinstrumentation in order to incorporate existing frameworks. We developed Sea as a means to enable data-placement strategies for scientific applications executing on HPC clusters without the need to reinstrument workflows. Sea leverages GNU C library interception to intercept POSIX-compliant file system calls made by the applications. We designed a performance model and evaluated the performance of Sea on a synthetic data-intensive application processing a representative neuroimaging dataset (the Big Brain). Our results demonstrate that Sea significantly improves performance, up to a factor of 3$\times$. △ Less

Submitted 4 July, 2022; originally announced July 2022.

arXiv:2205.07871 [pdf, other]

Mondrian Forest for Data Stream Classification Under Memory Constraints

Authors: Martin Khannouz, Tristan Glatard

Abstract: Supervised learning algorithms generally assume the availability of enough memory to store their data model during the training and test phases. However, in the Internet of Things, this assumption is unrealistic when data comes in the form of infinite data streams, or when learning algorithms are deployed on devices with reduced amounts of memory. In this paper, we adapt the online Mondrian forest… ▽ More Supervised learning algorithms generally assume the availability of enough memory to store their data model during the training and test phases. However, in the Internet of Things, this assumption is unrealistic when data comes in the form of infinite data streams, or when learning algorithms are deployed on devices with reduced amounts of memory. In this paper, we adapt the online Mondrian forest classification algorithm to work with memory constraints on data streams. In particular, we design five out-of-memory strategies to update Mondrian trees with new data points when the memory limit is reached. Moreover, we design trimming mechanisms to make Mondrian trees more robust to concept drifts under memory constraints. We evaluate our algorithms on a variety of real and simulated datasets, and we conclude with recommendations on their use in different situations: the Extend Node strategy appears as the best out-of-memory strategy in all configurations, whereas different trimming mechanisms should be adopted depending on whether a concept drift is expected. All our methods are implemented in the OrpailleCC open-source library and are ready to be used on embedded systems and connected objects. △ Less

Submitted 4 August, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

arXiv:2112.11508 [pdf, other]

PyTracer: Automatically profiling numerical instabilities in Python

Authors: Yohan Chatelain, Nigel Yong, Gregory Kiar, Tristan Glatard

Abstract: Numerical stability is a crucial requirement of reliable scientific computing. However, despite the pervasiveness of Python in data science, analyzing large Python programs remains challenging due to the lack of scalable numerical analysis tools available for this language. To fill this gap, we developed PyTracer, a profiler to quantify numerical instability in Python applications. PyTracer transp… ▽ More Numerical stability is a crucial requirement of reliable scientific computing. However, despite the pervasiveness of Python in data science, analyzing large Python programs remains challenging due to the lack of scalable numerical analysis tools available for this language. To fill this gap, we developed PyTracer, a profiler to quantify numerical instability in Python applications. PyTracer transparently instruments Python code to produce numerical traces and visualize them interactively in a Plotly dashboard. We designed PyTracer to be agnostic to numerical noise model, allowing for tool evaluation through Monte-Carlo Arithmetic, random rounding, random data perturbation, or structured noise for a particular application. We illustrate PyTracer's capabilities by testing the numerical stability of key functions in both SciPy and Scikit-learn, two dominant Python libraries for mathematical modeling. Through these evaluations, we demonstrate PyTracer as a scalable, automatic, and generic framework for numerical profiling in Python. △ Less

Submitted 8 February, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2109.09649 [pdf, other]

Data Augmentation Through Monte Carlo Arithmetic Leads to More Generalizable Classification in Connectomics

Authors: Gregory Kiar, Yohan Chatelain, Ali Salari, Alan C. Evans, Tristan Glatard

Abstract: Machine learning models are commonly applied to human brain imaging datasets in an effort to associate function or structure with behaviour, health, or other individual phenotypes. Such models often rely on low-dimensional maps generated by complex processing pipelines. However, the numerical instabilities inherent to pipelines limit the fidelity of these maps and introduce computational bias. Mon… ▽ More Machine learning models are commonly applied to human brain imaging datasets in an effort to associate function or structure with behaviour, health, or other individual phenotypes. Such models often rely on low-dimensional maps generated by complex processing pipelines. However, the numerical instabilities inherent to pipelines limit the fidelity of these maps and introduce computational bias. Monte Carlo Arithmetic, a technique for introducing controlled amounts of numerical noise, was used to perturb a structural connectome estimation pipeline, ultimately producing a range of plausible networks for each sample. The variability in the perturbed networks was captured in an augmented dataset, which was then used for an age classification task. We found that resampling brain networks across a series of such numerically perturbed outcomes led to improved performance in all tested classifiers, preprocessing strategies, and dimensionality reduction techniques. Importantly, we find that this benefit does not hinge on a large number of perturbations, suggesting that even minimally perturbing a dataset adds meaningful variance which can be captured in the subsequently designed models. △ Less

Submitted 20 September, 2021; originally announced September 2021.

arXiv:2108.10496 [pdf, other]

The benefits of prefetching for large-scale cloud-based neuroimaging analysis workflows

Authors: Valerie Hayot-Sasson, Tristan Glatard, Ariel Rokem

Abstract: To support the growing demands of neuroscience applications, researchers are transitioning to cloud computing for its scalable, robust and elastic infrastructure. Nevertheless, large datasets residing in object stores may result in significant data transfer overheads during workflow execution. Prefetching, a method to mitigate the cost of reading in mixed workloads, masks data transfer costs withi… ▽ More To support the growing demands of neuroscience applications, researchers are transitioning to cloud computing for its scalable, robust and elastic infrastructure. Nevertheless, large datasets residing in object stores may result in significant data transfer overheads during workflow execution. Prefetching, a method to mitigate the cost of reading in mixed workloads, masks data transfer costs within processing time of prior tasks. We present an implementation of "Rolling Prefetch", a Python library that implements a particular form of prefetching from AWS S3 object store, and we quantify its benefits. Rolling Prefetch extends S3Fs, a Python library exposing AWS S3 functionality via a file object, to add prefetch capabilities. In measured analysis performance of a 500 GB brain connectivity dataset stored on S3, we found that prefetching provides significant speed-ups of up to 1.86x, even in applications consisting entirely of data loading. The observed speed-up values are consistent with our theoretical analysis. Our results demonstrate the usefulness of prefetching for scientific data processing on cloud infrastructures and provide an implementation applicable to various application domains. △ Less

Submitted 23 August, 2021; originally announced August 2021.

arXiv:2108.09275 [pdf, other]

A Recommender System for Scientific Datasets and Analysis Pipelines

Authors: Mandana Mazaheri, Gregory Kiar, Tristan Glatard

Abstract: Scientific datasets and analysis pipelines are increasingly being shared publicly in the interest of open science. However, mechanisms are lacking to reliably identify which pipelines and datasets can appropriately be used together. Given the increasing number of high-quality public datasets and pipelines, this lack of clear compatibility threatens the findability and reusability of these resource… ▽ More Scientific datasets and analysis pipelines are increasingly being shared publicly in the interest of open science. However, mechanisms are lacking to reliably identify which pipelines and datasets can appropriately be used together. Given the increasing number of high-quality public datasets and pipelines, this lack of clear compatibility threatens the findability and reusability of these resources. We investigate the feasibility of a collaborative filtering system to recommend pipelines and datasets based on provenance records from previous executions. We evaluate our system using datasets and pipelines extracted from the Canadian Open Neuroscience Platform, a national initiative for open neuroscience. The recommendations provided by our system (AUC$=0.83$) are significantly better than chance and outperform recommendations made by domain experts using their previous knowledge as well as pipeline and dataset descriptions (AUC$=0.63$). In particular, domain experts often neglect low-level technical aspects of a pipeline-dataset interaction, such as the level of pre-processing, which are captured by a provenance-based system. We conclude that provenance-based pipeline and dataset recommenders are feasible and beneficial to the sharing and usage of open-science resources. Future work will focus on the collection of more comprehensive provenance traces, and on deploying the system in production. △ Less

Submitted 20 August, 2021; originally announced August 2021.

arXiv:2108.03129 [pdf, other]

Accurate simulation of operating system updates in neuroimaging using Monte-Carlo arithmetic

Authors: Ali Salari, Yohan Chatelain, Gregory Kiar, Tristan Glatard

Abstract: Operating system (OS) updates introduce numerical perturbations that impact the reproducibility of computational pipelines. In neuroimaging, this has important practical implications on the validity of computational results, particularly when obtained in systems such as high-performance computing clusters where the experimenter does not control software updates. We present a framework to reproduce… ▽ More Operating system (OS) updates introduce numerical perturbations that impact the reproducibility of computational pipelines. In neuroimaging, this has important practical implications on the validity of computational results, particularly when obtained in systems such as high-performance computing clusters where the experimenter does not control software updates. We present a framework to reproduce the variability induced by OS updates in controlled conditions. We hypothesize that OS updates impact computational pipelines mainly through numerical perturbations originating in mathematical libraries, which we simulate using Monte-Carlo arithmetic in a framework called "fuzzy libmath" (FL). We applied this methodology to pre-processing pipelines of the Human Connectome Project, a flagship open-data project in neuroimaging. We found that FL-perturbed pipelines accurately reproduce the variability induced by OS updates and that this similarity is only mildly dependent on simulation parameters. Importantly, we also found between-subject differences were preserved in both cases, though the between-run variability was of comparable magnitude for both FL and OS perturbations. We found the numerical precision in the HCP pre-processed images to be relatively low, with less than 8 significant bits among the 24 available, which motivates further investigation of the numerical stability of components in the tested pipeline. Overall, our results establish that FL accurately simulates results variability due to OS updates, and is a practical framework to quantify numerical uncertainty in neuroimaging. △ Less

Submitted 6 August, 2021; originally announced August 2021.

Comments: 10 pages, 4 figures, 19 references

arXiv:2106.14340 [pdf, other]

Reducing numerical precision preserves classification accuracy in Mondrian Forests

Authors: Marc Vicuna, Martin Khannouz, Gregory Kiar, Yohan Chatelain, Tristan Glatard

Abstract: Mondrian Forests are a powerful data stream classification method, but their large memory footprint makes them ill-suited for low-resource platforms such as connected objects. We explored using reduced-precision floating-point representations to lower memory consumption and evaluated its effect on classification performance. We applied the Mondrian Forest implementation provided by OrpailleCC, a C… ▽ More Mondrian Forests are a powerful data stream classification method, but their large memory footprint makes them ill-suited for low-resource platforms such as connected objects. We explored using reduced-precision floating-point representations to lower memory consumption and evaluated its effect on classification performance. We applied the Mondrian Forest implementation provided by OrpailleCC, a C++ collection of data stream algorithms, to two canonical datasets in human activity recognition: Recofit and Banos \emph{et al}. Results show that the precision of floating-point values used by tree nodes can be reduced from 64 bits to 8 bits with no significant difference in F1 score. In some cases, reduced precision was shown to improve classification performance, presumably due to its regularization effect. We conclude that numerical precision is a relevant hyperparameter in the Mondrian Forest, and that commonly-used double precision values may not be necessary for optimal performance. Future work will evaluate the generalizability of these findings to other data stream classifiers. △ Less

Submitted 27 June, 2021; originally announced June 2021.

Comments: 6 pages, 3 tables, 2 figures. Keywords: numerical precision, memory footprint, Mondrian Forests, human activity, recognition, data streams, supervised classification, floating-point representation

arXiv:2101.01335 [pdf, other]

Modeling the Linux page cache for accurate simulation of data-intensive applications

Authors: Hoang-Dung Do, Valerie Hayot-Sasson, Rafael Ferreira da Silva, Christopher Steele, Henri Casanova, Tristan Glatard

Abstract: The emergence of Big Data in recent years has resulted in a growing need for efficient data processing solutions. While infrastructures with sufficient compute power are available, the I/O bottleneck remains. The Linux page cache is an efficient approach to reduce I/O overheads, but few experimental studies of its interactions with Big Data applications exist, partly due to limitations of real-wor… ▽ More The emergence of Big Data in recent years has resulted in a growing need for efficient data processing solutions. While infrastructures with sufficient compute power are available, the I/O bottleneck remains. The Linux page cache is an efficient approach to reduce I/O overheads, but few experimental studies of its interactions with Big Data applications exist, partly due to limitations of real-world experiments. Simulation is a popular approach to address these issues, however, existing simulation frameworks do not simulate page caching fully, or even at all. As a result, simulation-based performance studies of data-intensive applications lead to inaccurate results. In this paper, we propose an I/O simulation model that includes the key features of the Linux page cache. We have implemented this model as part of the WRENCH workflow simulation framework, which itself builds on the popular SimGrid distributed systems simulation framework. Our model and its implementation enable the simulation of both single-threaded and multithreaded applications, and of both writeback and writethrough caches for local or network-based filesystems. We evaluate the accuracy of our model in different conditions, including sequential and concurrent applications, as well as local and remote I/Os. We find that our page cache model reduces the simulation error by up to an order of magnitude when compared to state-of-the-art, cacheless simulations. △ Less

Submitted 4 January, 2021; originally announced January 2021.

Comments: 10 pages, 8 figures, CCGrid

arXiv:2010.13970 [pdf, other]

An Analysis of Security Vulnerabilities in Container Images for Scientific Data Analysis

Authors: Bhupinder Kaur, Mathieu Dugré, Aiman Hanna, Tristan Glatard

Abstract: Software containers greatly facilitate the deployment and reproducibility of scientific data analyses in various platforms. However, container images often contain outdated or unnecessary software packages, which increases the number of security vulnerabilities in the images, widens the attack surface in the container host, and creates substantial security risks for computing infrastructures at la… ▽ More Software containers greatly facilitate the deployment and reproducibility of scientific data analyses in various platforms. However, container images often contain outdated or unnecessary software packages, which increases the number of security vulnerabilities in the images, widens the attack surface in the container host, and creates substantial security risks for computing infrastructures at large. This paper presents a vulnerability analysis of container images for scientific data analysis. We compare results obtained with four vulnerability scanners, focusing on the use case of neuroscience data analysis, and quantifying the effect of image update and minification on the number of vulnerabilities. We find that container images used for neuroscience data analysis contain hundreds of vulnerabilities, that software updates remove about two thirds of these vulnerabilities, and that removing unused packages is also effective. We conclude with recommendations on how to build container images with a reduced amount of vulnerabilities. △ Less

Submitted 17 March, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

arXiv:2008.11880 [pdf, other]

A benchmark of data stream classification for human activity recognition on connected objects

Authors: Martin Khannouz, Tristan Glatard

Abstract: This paper evaluates data stream classifiers from the perspective of connected devices, focusing on the use case of HAR. We measure both classification performance and resource consumption (runtime, memory, and power) of five usual stream classification algorithms, implemented in a consistent library, and applied to two real human activity datasets and to three synthetic datasets. Regarding classi… ▽ More This paper evaluates data stream classifiers from the perspective of connected devices, focusing on the use case of HAR. We measure both classification performance and resource consumption (runtime, memory, and power) of five usual stream classification algorithms, implemented in a consistent library, and applied to two real human activity datasets and to three synthetic datasets. Regarding classification performance, results show an overall superiority of the HT, the MF, and the NB classifiers over the FNN and the Micro Cluster Nearest Neighbor (MCNN) classifiers on 4 datasets out of 6, including the real ones. In addition, the HT, and to some extent MCNN, are the only classifiers that can recover from a concept drift. Overall, the three leading classifiers still perform substantially lower than an offline classifier on the real datasets. Regarding resource consumption, the HT and the MF are the most memory intensive and have the longest runtime, however, no difference in power consumption is found between classifiers. We conclude that stream learning for HAR on connected objects is challenged by two factors which could lead to interesting future work: a high memory consumption and low F1 scores overall. △ Less

Submitted 26 August, 2020; originally announced August 2020.

Comments: 8 pages, 9 figures, for a journal

arXiv:2007.09167 [pdf, other]

doi 10.1109/BigData52589.2021.9671967

Can we Estimate Truck Accident Risk from Telemetric Data using Machine Learning?

Authors: Antoine Hébert, Ian Marineau, Gilles Gervais, Tristan Glatard, Brigitte Jaumard

Abstract: Road accidents have a high societal cost that could be reduced through improved risk predictions using machine learning. This study investigates whether telemetric data collected on long-distance trucks can be used to predict the risk of accidents associated with a driver. We use a dataset provided by a truck transportation company containing the driving data of 1,141 drivers for 18 months. We e… ▽ More Road accidents have a high societal cost that could be reduced through improved risk predictions using machine learning. This study investigates whether telemetric data collected on long-distance trucks can be used to predict the risk of accidents associated with a driver. We use a dataset provided by a truck transportation company containing the driving data of 1,141 drivers for 18 months. We evaluate two different machine learning approaches to perform this task. In the first approach, features are extracted from the time series data using the FRESH algorithm and then used to estimate the risk using Random Forests. In the second approach, we use a convolutional neural network to directly estimate the risk from the time-series data. We find that neither approach is able to successfully estimate the risk of accidents on this dataset, in spite of many methodological attempts. We discuss the difficulties of using telemetric data for the estimation of the risk of accidents that could explain this negative result. △ Less

Submitted 17 July, 2020; originally announced July 2020.

Journal ref: 2021 IEEE International Conference on Big Data, pp. 1827-1836

arXiv:2006.04684 [pdf]

File-based localization of numerical perturbations in data analysis pipelines

Authors: Ali Salari, Gregory Kiar, Lindsay Lewis, Alan C. Evans, Tristan Glatard

Abstract: Data analysis pipelines are known to be impacted by computational conditions, presumably due to the creation and propagation of numerical errors. While this process could play a major role in the current reproducibility crisis, the precise causes of such instabilities and the path along which they propagate in pipelines are unclear. We present Spot, a tool to identify which processes in a pipeline… ▽ More Data analysis pipelines are known to be impacted by computational conditions, presumably due to the creation and propagation of numerical errors. While this process could play a major role in the current reproducibility crisis, the precise causes of such instabilities and the path along which they propagate in pipelines are unclear. We present Spot, a tool to identify which processes in a pipeline create numerical differences when executed in different computational conditions. Spot leverages system-call interception through ReproZip to reconstruct and compare provenance graphs without pipeline instrumentation. By applying Spot to the structural pre-processing pipelines of the Human Connectome Project, we found that linear and non-linear registration are the cause of most numerical instabilities in these pipelines, which confirms previous findings. △ Less

Submitted 28 September, 2020; v1 submitted 3 June, 2020; originally announced June 2020.

Comments: 10 pages, 6 figures, 2 tables

arXiv:1912.11794 [pdf, other]

Performance benefits of Intel(R) OptaneTM DC persistent memory for the parallel processing of large neuroimaging data

Authors: Valerie Hayot-Sasson, Shawn T Brown, Tristan Glatard

Abstract: Open-access neuroimaging datasets have reached petabyte scale, and continue to grow. The ability to leverage the entirety of these datasets is limited to a restricted number of labs with both the capacity and infrastructure to process the data. Whereas Big Data engines have significantly reduced application performance penalties with respect to data movement, their applied strategies (e.g. data lo… ▽ More Open-access neuroimaging datasets have reached petabyte scale, and continue to grow. The ability to leverage the entirety of these datasets is limited to a restricted number of labs with both the capacity and infrastructure to process the data. Whereas Big Data engines have significantly reduced application performance penalties with respect to data movement, their applied strategies (e.g. data locality, in-memory computing and lazy evaluation) are not necessarily practical within neuroimaging workflows where intermediary results may need to be materialized to shared storage for post-processing analysis. In this paper we evaluate the performance advantage brought by Intel(R) OptaneTM DC persistent memory for the processing of large neuroimaging datasets using the two available configurations modes: Memory mode and App Direct mode. We employ a synthetic algorithm on the 76 GiB and 603 GiB BigBrain, as well as apply a standard neuroimaging application on the Consortium for Reliability and Reproducibility (CoRR) dataset using 25 and 96 parallel processes in both cases. Our results show that the performance of applications leveraging persistent memory is superior to that of other storage devices,with the exception of DRAM. This is the case in both Memory and App Direct mode and irrespective of the amount of data and parallelism. Furthermore, persistent memory in App Direct mode is believed to benefit from the use of DRAM as a cache for writing when output data is significantly smaller than available memory. We believe the use of persistent memory will be beneficial to both neuroimaging applications running on HPC or visualization of large, high-resolution images. △ Less

Submitted 26 December, 2019; originally announced December 2019.

arXiv:1908.10922 [pdf, other]

Comparing Perturbation Models for Evaluating Stability of Neuroimaging Pipelines

Authors: Gregory Kiar, Pablo de Oliveira Castro, Pierre Rioux, Eric Petit, Shawn T. Brown, Alan C. Evans, Tristan Glatard

Abstract: A lack of software reproducibility has become increasingly apparent in the last several years, calling into question the validity of scientific findings affected by published tools. Reproducibility issues may have numerous sources of error, including the underlying numerical stability of algorithms and implementations employed. Various forms of instability have been observed in neuroimaging, inclu… ▽ More A lack of software reproducibility has become increasingly apparent in the last several years, calling into question the validity of scientific findings affected by published tools. Reproducibility issues may have numerous sources of error, including the underlying numerical stability of algorithms and implementations employed. Various forms of instability have been observed in neuroimaging, including across operating system versions, minor noise injections, and implementation of theoretically equivalent algorithms. In this paper we explore the effect of various perturbation methods on a typical neuroimaging pipeline through the use of i) targeted noise injections, ii) Monte Carlo Arithmetic, and iii) varying operating systems to identify the quality and severity of their impact. The work presented here demonstrates that even low order computational models such as the connectome estimation pipeline that we used are susceptible to noise. This suggests that stability is a relevant axis upon which tools should be compared, developed, or improved, alongside more commonly considered axes such as accuracy/biological feasibility or performance. The heterogeneity observed across participants clearly illustrates that stability is a property of not just the data or tools independently, but their interaction. Characterization of stability should therefore be evaluated for specific analyses and performed on a representative set of subjects for consideration in subsequent statistical testing. Additionally, identifying how this relationship scales to higher-order models is an exciting next step which will be explored. Finally, the joint application of perturbation methods with post-processing approaches such as bagging or signal normalization may lead to the development of more numerically stable analyses while maintaining sensitivity to meaningful variation. △ Less

Submitted 22 April, 2020; v1 submitted 28 August, 2019; originally announced August 2019.

Comments: 9 pages, 5 figures, 1 table, paper published at IJHPCA

arXiv:1907.13030 [pdf, other]

A performance comparison of Dask and Apache Spark for data-intensive neuroimaging pipelines

Authors: Mathieu Dugré, Valérie Hayot-Sasson, Tristan Glatard

Abstract: In the past few years, neuroimaging has entered the Big Data era due to the joint increase in image resolution, data sharing, and study sizes. However, no particular Big Data engines have emerged in this field, and several alternatives remain available. We compare two popular Big Data engines with Python APIs, Apache Spark and Dask, for their runtime performance in processing neuroimaging pipeline… ▽ More In the past few years, neuroimaging has entered the Big Data era due to the joint increase in image resolution, data sharing, and study sizes. However, no particular Big Data engines have emerged in this field, and several alternatives remain available. We compare two popular Big Data engines with Python APIs, Apache Spark and Dask, for their runtime performance in processing neuroimaging pipelines. Our evaluation uses two synthetic pipelines processing the 81GB BigBrain image, and a real pipeline processing anatomical data from more than 1,000 subjects. We benchmark these pipelines using various combinations of task durations, data sizes, and numbers of workers, deployed on an 8-node (8 cores ea.) compute cluster in Compute Canada's Arbutus cloud. We evaluate PySpark's RDD API against Dask's Bag, Delayed and Futures. Results show that despite slight differences between Spark and Dask, both engines perform comparably. However, Dask pipelines risk being limited by Python's GIL depending on task type and cluster configuration. In all cases, the major limiting factor was data transfer. While either engine is suitable for neuroimaging pipelines, more effort needs to be placed in reducing data transfer time. △ Less

Submitted 5 October, 2019; v1 submitted 30 July, 2019; originally announced July 2019.

Comments: 10 pages, 15 figures, 1 tables. To appear in the proceeding of the 14th WORKS Workshop on Topics in Workflows in Support of Large-Scale Science, 17 November 2019, Denver, CO, USA

arXiv:1907.03047 [pdf, other]

A Conceptual Marketplace Model for IoT Generated Personal Data

Authors: Victor Molina, Marta Kersten-Oertel, Tristan Glatard

Abstract: We propose a decentralized conceptual marketplace model for IoT generated personal data. Our model is based on a thorough analysis of personal data in a marketplace context, with specific focus on the challenges presented by commercializing IoT generated personal data. Our model introduces a novel perspective on the commercialization of personal data for a marketplace context via risk evaluation a… ▽ More We propose a decentralized conceptual marketplace model for IoT generated personal data. Our model is based on a thorough analysis of personal data in a marketplace context, with specific focus on the challenges presented by commercializing IoT generated personal data. Our model introduces a novel perspective on the commercialization of personal data for a marketplace context via risk evaluation and a data licensing framework. We have designed our model to be centered around protecting the privacy and data rights of data generators through model components that effectively assess and modify transaction risks, and formalize transaction agreements by establishing rights of data use and access between buyer and seller. Our model could serve as a blueprint to inform the implementation of a personal data marketplace that respects privacy and ownership. △ Less

Submitted 5 July, 2019; originally announced July 2019.

arXiv:1905.12720 [pdf, other]

Evaluation of pilot jobs for Apache Spark applications on HPC clusters

Authors: Valerie Hayot-Sasson, Tristan Glatard

Abstract: Big Data has become prominent throughout many scientific fields and, as a result, scientific communities have sought out Big Data frameworks to accelerate the processing of their increasingly data-intensive pipelines. However, while scientific communities typically rely on High-Performance Computing (HPC) clusters for the parallelization of their pipelines, many popular Big Data frameworks such as… ▽ More Big Data has become prominent throughout many scientific fields and, as a result, scientific communities have sought out Big Data frameworks to accelerate the processing of their increasingly data-intensive pipelines. However, while scientific communities typically rely on High-Performance Computing (HPC) clusters for the parallelization of their pipelines, many popular Big Data frameworks such as Hadoop and Apache Spark were primarily designed to be executed on dedicated commodity infrastructures. This paper evaluates the benefits of pilot jobs over traditional batch submission for Apache Spark on HPC clusters. Surprisingly, our results show that the speed-up provided by pilot jobs over batch scheduling is moderate to inexistent (0.98 on average) despite the presence of long queuing times. In addition, pilot jobs provide an extra layer of scheduling that complexifies debugging and deployment. We conclude that traditional batch scheduling should remain the default strategy to deploy Apache Spark applications on HPC clusters. △ Less

Submitted 29 May, 2019; originally announced May 2019.

arXiv:1905.08770 [pdf, other]

doi 10.1109/BigData47090.2019.9006009

High-Resolution Road Vehicle Collision Prediction for the City of Montreal

Authors: Antoine Hébert, Timothée Guédon, Tristan Glatard, Brigitte Jaumard

Abstract: Road accidents are an important issue of our modern societies, responsible for millions of deaths and injuries every year in the world. In Quebec only, in 2018, road accidents are responsible for 359 deaths and 33 thousands of injuries. In this paper, we show how one can leverage open datasets of a city like Montreal, Canada, to create high-resolution accident prediction models, using big data ana… ▽ More Road accidents are an important issue of our modern societies, responsible for millions of deaths and injuries every year in the world. In Quebec only, in 2018, road accidents are responsible for 359 deaths and 33 thousands of injuries. In this paper, we show how one can leverage open datasets of a city like Montreal, Canada, to create high-resolution accident prediction models, using big data analytics. Compared to other studies in road accident prediction, we have a much higher prediction resolution, i.e., our models predict the occurrence of an accident within an hour, on road segments defined by intersections. Such models could be used in the context of road accident prevention, but also to identify key factors that can lead to a road accident, and consequently, help elaborate new policies. We tested various machine learning methods to deal with the severe class imbalance inherent to accident prediction problems. In particular, we implemented the Balanced Random Forest algorithm, a variant of the Random Forest machine learning algorithm in Apache Spark. Interestingly, we found that in our case, Balanced Random Forest does not perform significantly better than Random Forest. Experimental results show that 85% of road vehicle collisions are detected by our model with a false positive rate of 13%. The examples identified as positive are likely to correspond to high-risk situations. In addition, we identify the most important predictors of vehicle collisions for the area of Montreal: the count of accidents on the same road segment during previous years, the temperature, the day of the year, the hour and the visibility. △ Less

Submitted 11 November, 2019; v1 submitted 21 May, 2019; originally announced May 2019.

Journal ref: 2019 IEEE International Conference on Big Data, pp. 1804-1813

arXiv:1904.02666 [pdf, other]

Subject Cross Validation in Human Activity Recognition

Authors: Akbar Dehghani, Tristan Glatard, Emad Shihab

Abstract: K-fold Cross Validation is commonly used to evaluate classifiers and tune their hyperparameters. However, it assumes that data points are Independent and Identically Distributed (i.i.d.) so that samples used in the training and test sets can be selected randomly and uniformly. In Human Activity Recognition datasets, we note that the samples produced by the same subjects are likely to be correlated… ▽ More K-fold Cross Validation is commonly used to evaluate classifiers and tune their hyperparameters. However, it assumes that data points are Independent and Identically Distributed (i.i.d.) so that samples used in the training and test sets can be selected randomly and uniformly. In Human Activity Recognition datasets, we note that the samples produced by the same subjects are likely to be correlated due to diverse factors. Hence, k-fold cross validation may overestimate the performance of activity recognizers, in particular when overlap** sliding windows are used. In this paper, we investigate the effect of Subject Cross Validation on the performance of Human Activity Recognition, both with non-overlap** and with overlap** sliding windows. Results show that k-fold cross validation artificially increases the performance of recognizers by about 10%, and even by 16% when overlap** windows are used. In addition, we do not observe any performance gain from the use of overlap** windows. We conclude that Human Activity Recognition systems should be evaluated by Subject Cross Validation, and that overlap** windows are not worth their extra computational cost. △ Less

Submitted 9 April, 2019; v1 submitted 4 April, 2019; originally announced April 2019.

arXiv:1812.06492 [pdf, other]

Performance Evaluation of Big Data Processing Strategies for Neuroimaging

Authors: Valérie Hayot-Sasson, Shawn T Brown, Tristan Glatard

Abstract: Neuroimaging datasets are rapidly growing in size as a result of advancements in image acquisition methods, open-science and data sharing. However, the adoption of Big Data processing strategies by neuroimaging processing engines remains limited. Here, we evaluate three Big Data processing strategies (in-memory computing, data locality and lazy evaluation) on typical neuroimaging use cases, repres… ▽ More Neuroimaging datasets are rapidly growing in size as a result of advancements in image acquisition methods, open-science and data sharing. However, the adoption of Big Data processing strategies by neuroimaging processing engines remains limited. Here, we evaluate three Big Data processing strategies (in-memory computing, data locality and lazy evaluation) on typical neuroimaging use cases, represented by the BigBrain dataset. We contrast these various strategies using Apache Spark and Nipype as our representative Big Data and neuroimaging processing engines, on Dell EMC's Top-500 cluster. Big Data thresholds were modelled by comparing the data-write rate of the application to the filesystem bandwidth and number of concurrent processes. This model acknowledges the fact that page caching provided by the Linux kernel is critical to the performance of Big Data applications. Results show that in-memory computing alone speeds-up executions by a factor of up to 1.6, whereas when combined with data locality, this factor reaches 5.3. Lazy evaluation strategies were found to increase the likelihood of cache hits, further improving processing time. Such important speed-up values are likely to be observed on typical image processing operations performed on images of size larger than 75GB. A ballpark speculation from our model showed that in-memory computing alone will not speed-up current functional MRI analyses unless coupled with data locality and processing around 280 subjects concurrently. Furthermore, we observe that emulating in-memory computing using in-memory file systems (tmpfs) does not reach the performance of an in-memory engine, presumably due to swap** to disk and the lack of data cleanup. We conclude that Big Data processing strategies are worth develo** for neuroimaging applications. △ Less

Submitted 2 April, 2019; v1 submitted 16 December, 2018; originally announced December 2018.

arXiv:1811.09930 [pdf, other]

A multi-dimensional extension of the Lightweight Temporal Compression method

Authors: Bo Li, Omid Sarbishei, Hosein Nourani, Tristan Glatard

Abstract: Lightweight Temporal Compression (LTC) is among the lossy stream compression methods that provide the highest compression rate for the lowest CPU and memory consumption. As such, it is well suited to compress data streams in energy-constrained systems such as connected objects. The current formulation of LTC, however, is one-dimensional while data acquired in connected objects is often multi-dimen… ▽ More Lightweight Temporal Compression (LTC) is among the lossy stream compression methods that provide the highest compression rate for the lowest CPU and memory consumption. As such, it is well suited to compress data streams in energy-constrained systems such as connected objects. The current formulation of LTC, however, is one-dimensional while data acquired in connected objects is often multi-dimensional: for instance, accelerometers and gyroscopes usually measure variables along 3 directions. In this paper, we investigate the extension of LTC to higher dimensions. First, we provide a formulation of the algorithm in an arbitrary vectorial space of dimension n. Then, we implement the algorithm for the infinity and Euclidean norms, in spaces of dimension 2D+t and 3D+t. We evaluate our implementation on 3D acceleration streams of human activities. Results show that the 3D implementation of LTC can save up to 20% in energy consumption for low-paced activities, with a memory usage of about 100 B. △ Less

Submitted 24 November, 2018; originally announced November 2018.

arXiv:1810.09944 [pdf, other]

Data models for service failure prediction in supply-chain networks

Authors: Monika Sharma, Tristan Glatard, Eric Gelinas, Mariam Tagmouti, Brigitte Jaumard

Abstract: We aim to predict and explain service failures in supply-chain networks, more precisely among last-mile pickup and delivery services to customers. We analyze a dataset of 500,000 services using (1) supervised classification with Random Forests, and (2) Association Rules. Our classifier reaches an average sensitivity of 0.7 and an average specificity of 0.7 for the 5 studied types of failure. Assoc… ▽ More We aim to predict and explain service failures in supply-chain networks, more precisely among last-mile pickup and delivery services to customers. We analyze a dataset of 500,000 services using (1) supervised classification with Random Forests, and (2) Association Rules. Our classifier reaches an average sensitivity of 0.7 and an average specificity of 0.7 for the 5 studied types of failure. Association Rules reassert the importance of confirmation calls to prevent failures due to customers not at home, show the importance of the time window size, slack time, and geographical location of the customer for the other failure types, and highlight the effect of the retailer company on several failure types. To reduce the occurrence of service failures, our data models could be coupled to optimizers, or used to define counter-measures to be taken by human dispatchers. △ Less

Submitted 20 October, 2018; originally announced October 2018.

arXiv:1809.10139 [pdf, other]

Predicting computational reproducibility of data analysis pipelines in large population studies using collaborative filtering

Authors: Soudabeh Barghi, Lalet Scaria, Ali Salari, Tristan Glatard

Abstract: Evaluating the computational reproducibility of data analysis pipelines has become a critical issue. It is, however, a cumbersome process for analyses that involve data from large populations of subjects, due to their computational and storage requirements. We present a method to predict the computational reproducibility of data analysis pipelines in large population studies. We formulate the prob… ▽ More Evaluating the computational reproducibility of data analysis pipelines has become a critical issue. It is, however, a cumbersome process for analyses that involve data from large populations of subjects, due to their computational and storage requirements. We present a method to predict the computational reproducibility of data analysis pipelines in large population studies. We formulate the problem as a collaborative filtering process, with constraints on the construction of the training set. We propose 6 different strategies to build the training set, which we evaluate on 2 datasets, a synthetic one modeling a population with a growing number of subject types, and a real one obtained with neuroinformatics pipelines. Results show that one sampling method, "Random File Numbers (Uniform)" is able to predict computational reproducibility with a good accuracy. We also analyze the relevance of including file and subject biases in the collaborative filtering model. We conclude that the proposed method is able to speedup reproducibility evaluations substantially, with a reduced accuracy loss. △ Less

Submitted 26 September, 2018; originally announced September 2018.

arXiv:1809.07693 [pdf]

A Serverless Tool for Platform Agnostic Computational Experiment Management

Authors: Gregory Kiar, Shawn T Brown, Tristan Glatard, Alan C Evans

Abstract: Neuroscience has been carried into the domain of big data and high performance computing (HPC) on the backs of initiatives in data collection and an increasingly compute-intensive tools. While managing HPC experiments requires considerable technical acumen, platforms and standards have been developed to ease this burden on scientists. While web-portals make resources widely accessible, data organi… ▽ More Neuroscience has been carried into the domain of big data and high performance computing (HPC) on the backs of initiatives in data collection and an increasingly compute-intensive tools. While managing HPC experiments requires considerable technical acumen, platforms and standards have been developed to ease this burden on scientists. While web-portals make resources widely accessible, data organizations such as the Brain Imaging Data Structure and tool description languages such as Boutiques provide researchers with a foothold to tackle these problems using their own datasets, pipelines, and environments. While these standards lower the barrier to adoption of HPC and cloud systems for neuroscience applications, they still require the consolidation of disparate domain-specific knowledge. We present Clowdr, a lightweight tool to launch experiments on HPC systems and clouds, record rich execution records, and enable the accessible sharing of experimental summaries and results. Clowdr uniquely sits between web platforms and bare-metal applications for experiment management by preserving the flexibility of do-it-yourself solutions while providing a low barrier for develo**, deploying and disseminating neuroscientific analysis. △ Less

Submitted 2 September, 2018; originally announced September 2018.

Comments: 12 pages, 3 figures, 1 tool

arXiv:1803.03367 [pdf, other]

NeuroStorm: Accelerating Brain Science Discovery in the Cloud

Authors: Gregory Kiar, Robert J. Anderson, Alex Baden, Alexandra Badea, Eric W. Bridgeford, Andrew Champion, Vikram Chandrashekhar, Forrest Collman, Brandon Duderstadt, Alan C. Evans, Florian Engert, Benjamin Falk, Tristan Glatard, William R. Gray Roncal, David N. Kennedy, Jeremy Maitin-Shepard, Ryan A. Marren, Onyeka Nnaemeka, Eric Perlman, Sharmishtaas Seshamani, Eric T. Trautman, Daniel J. Tward, Pedro Antonio Valdés-Sosa, Qing Wang, Michael I. Miller , et al. (2 additional authors not shown)

Abstract: Neuroscientists are now able to acquire data at staggering rates across spatiotemporal scales. However, our ability to capitalize on existing datasets, tools, and intellectual capacities is hampered by technical challenges. The key barriers to accelerating scientific discovery correspond to the FAIR data principles: findability, global access to data, software interoperability, and reproducibility… ▽ More Neuroscientists are now able to acquire data at staggering rates across spatiotemporal scales. However, our ability to capitalize on existing datasets, tools, and intellectual capacities is hampered by technical challenges. The key barriers to accelerating scientific discovery correspond to the FAIR data principles: findability, global access to data, software interoperability, and reproducibility/re-usability. We conducted a hackathon dedicated to making strides in those steps. This manuscript is a technical report summarizing these achievements, and we hope serves as an example of the effectiveness of focused, deliberate hackathons towards the advancement of our quickly-evolving field. △ Less

Submitted 20 March, 2018; v1 submitted 8 March, 2018; originally announced March 2018.

Comments: 10 pages, 4 figures, hackathon report

arXiv:1711.09713 [pdf, other]

Boutiques: a flexible framework for automated application integration in computing platforms

Authors: Tristan Glatard, Gregory Kiar, Tristan Aumentado-Armstrong, Natacha Beck, Pierre Bellec, Rémi Bernard, Axel Bonnet, Sorina Camarasu-Pop, Frédéric Cervenansky, Samir Das, Rafael Ferreira da Silva, Guillaume Flandin, Pascal Girard, Krzysztof J. Gorgolewski, Charles R. G. Guttmann, Valérie Hayot-Sasson, Pierre-Olivier Quirion, Pierre Rioux, Marc-Eienne Rousseau, Alan C. Evans

Abstract: We present Boutiques, a system to automatically publish, integrate and execute applications across computational platforms. Boutiques applications are installed through software containers described in a rich and flexible JSON language. A set of core tools facilitate the construction, validation, import, execution, and publishing of applications. Boutiques is currently supported by several distinc… ▽ More We present Boutiques, a system to automatically publish, integrate and execute applications across computational platforms. Boutiques applications are installed through software containers described in a rich and flexible JSON language. A set of core tools facilitate the construction, validation, import, execution, and publishing of applications. Boutiques is currently supported by several distinct virtual research platforms, and it has been used to describe dozens of applications in the neuroinformatics domain. We expect Boutiques to improve the quality of application integration in computational platforms, to reduce redundancy of effort, to contribute to computational reproducibility, and to foster Open Science. △ Less

Submitted 7 November, 2017; originally announced November 2017.

Comments: 10 pages

arXiv:1203.2366 [pdf]

Technical support for Life Sciences communities on a production grid infrastructure

Authors: Franck Michel, Johan Montagnat, Tristan Glatard

Abstract: Production operation of large distributed computing infrastructures (DCI) still requires a lot of human intervention to reach acceptable quality of service. This may be achievable for scientific communities with solid IT support, but it remains a show-stopper for others. Some application execution environments are used to hide runtime technical issues from end users. But they mostly aim at fault-t… ▽ More Production operation of large distributed computing infrastructures (DCI) still requires a lot of human intervention to reach acceptable quality of service. This may be achievable for scientific communities with solid IT support, but it remains a show-stopper for others. Some application execution environments are used to hide runtime technical issues from end users. But they mostly aim at fault-tolerance rather than incident resolution, and their operation still requires substantial manpower. A longer-term support activity is thus needed to ensure sustained quality of service for Virtual Organisations (VO). This paper describes how the biomed VO has addressed this challenge by setting up a technical support team. Its organisation, tooling, daily tasks, and procedures are described. Results are shown in terms of resource usage by end users, amount of reported incidents, and developed software tools. Based on our experience, we suggest ways to measure the impact of the technical support, perspectives to decrease its human cost and make it more community-specific. △ Less

Submitted 11 March, 2012; originally announced March 2012.

Comments: HealthGrid'12, Amsterdam : Netherlands (2012)

Showing 1–38 of 38 results for author: Glatard, T