Skip to main content

Showing 1–33 of 33 results for author: Glatard, T

Searching in archive cs. Search in all archives.
.
  1. Performance comparison of Dask and Apache Spark on HPC systems for Neuroimaging

    Authors: Mathieu Dugré, Valérie Hayot-Sasson, Tristan Glatard

    Abstract: The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelin… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: 16 pages, 10 figures, 2 tables

    Journal ref: Concurrency and Computation: Practice and Experience (2023) 35(21):e7635

  2. arXiv:2405.17650  [pdf, other

    cs.PF

    An Analysis of Performance Bottlenecks in MRI Pre-Processing

    Authors: Mathieu Dugré, Yohan Chatelain, Tristan Glatard

    Abstract: Magnetic Resonance Image (MRI) pre-processing is a critical step for neuroimaging analysis. However, the computational cost of MRI pre-processing pipelines is a major bottleneck for large cohort studies and some clinical applications. While High-Performance Computing (HPC) and, more recently, Deep Learning have been adopted to accelerate the computations, these techniques require costly hardware a… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

    Comments: 8 pages, 7 figures, 3 tables, 1 listing

  3. arXiv:2404.11556  [pdf, other

    cs.DC

    Hierarchical storage management in user space for neuroimaging applications

    Authors: Valérie Hayot-Sasson, Tristan Glatard

    Abstract: Neuroimaging open-data initiatives have led to increased availability of large scientific datasets. While these datasets are shifting the processing bottleneck from compute-intensive to data-intensive, current standardized analysis tools have yet to adopt strategies that mitigate the costs associated with large data transfers. A major challenge in adapting neuroimaging applications for data-intens… ▽ More

    Submitted 17 April, 2024; originally announced April 2024.

  4. arXiv:2403.19421  [pdf, other

    cs.LG cs.AI q-bio.NC q-bio.QM

    Scaling up ridge regression for brain encoding in a massive individual fMRI dataset

    Authors: Sana Ahmadi, Pierre Bellec, Tristan Glatard

    Abstract: Brain encoding with neuroimaging data is an established analysis aimed at predicting human brain activity directly from complex stimuli features such as movie frames. Typically, these features are the latent space representation from an artificial neural network, and the stimuli are image, audio, or text inputs. Ridge regression is a popular prediction model for brain encoding due to its good out-… ▽ More

    Submitted 28 March, 2024; originally announced March 2024.

  5. arXiv:2403.15405  [pdf, other

    q-bio.NC cs.AI eess.IV

    Predicting Parkinson's disease trajectory using clinical and functional MRI features: a reproduction and replication study

    Authors: Elodie Germani, Nikhil Baghwat, Mathieu Dugré, Rémi Gau, Albert Montillo, Kevin Nguyen, Andrzej Sokolowski, Madeleine Sharp, Jean-Baptiste Poline, Tristan Glatard

    Abstract: Parkinson's disease (PD) is a common neurodegenerative disorder with a poorly understood physiopathology and no established biomarkers for the diagnosis of early stages and for prediction of disease progression. Several neuroimaging biomarkers have been studied recently, but these are susceptible to several sources of variability. In this context, an evaluation of the robustness of such biomarkers… ▽ More

    Submitted 24 May, 2024; v1 submitted 20 February, 2024; originally announced March 2024.

  6. arXiv:2308.16279  [pdf, other

    cs.LG

    Classification of Anomalies in Telecommunication Network KPI Time Series

    Authors: Korantin Bordeau-Aubert, Justin Whatley, Sylvain Nadeau, Tristan Glatard, Brigitte Jaumard

    Abstract: The increasing complexity and scale of telecommunication networks have led to a growing interest in automated anomaly detection systems. However, the classification of anomalies detected on network Key Performance Indicators (KPI) has received less attention, resulting in a lack of information about anomaly characteristics and classification processes. To address this gap, this paper proposes a mo… ▽ More

    Submitted 30 August, 2023; originally announced August 2023.

  7. arXiv:2307.01373  [pdf, other

    physics.med-ph cs.SE

    A numerical variability approach to results stability tests and its application to neuroimaging

    Authors: Yohan Chatelain, Loïc Tetrel, Christopher J. Markiewicz, Mathias Goncalves, Gregory Kiar, Oscar Esteban, Pierre Bellec, Tristan Glatard

    Abstract: Ensuring the long-term reproducibility of data analyses requires results stability tests to verify that analysis results remain within acceptable variation bounds despite inevitable software updates and hardware evolutions. This paper introduces a numerical variability approach for results stability tests, which determines acceptable variation bounds using random rounding of floating-point calcula… ▽ More

    Submitted 10 July, 2023; v1 submitted 3 July, 2023; originally announced July 2023.

    ACM Class: D.2.5

  8. Numerical Stability of DeepGOPlus Inference

    Authors: Inés Gonzalez Pepe, Yohan Chatelain, Gregory Kiar, Tristan Glatard

    Abstract: Convolutional neural networks (CNNs) are currently among the most widely-used deep neural network (DNN) architectures available and achieve state-of-the-art performance for many problems. Originally applied to computer vision tasks, CNNs work well with any data with a spatial relationship, besides images, and have been applied to different fields. However, recent works have highlighted numerical s… ▽ More

    Submitted 28 February, 2024; v1 submitted 12 December, 2022; originally announced December 2022.

    Comments: 17 pages, 5 figures, 4 tables with 3 figures, 2 tables in Appendix

    Journal ref: Vol 19, no. 1 (2024): e0296725

  9. arXiv:2210.05704  [pdf, other

    cs.LG

    Dynamic Ensemble Size Adjustment for Memory Constrained Mondrian Forest

    Authors: Martin Khannouz, Tristan Glatard

    Abstract: Supervised learning algorithms generally assume the availability of enough memory to store data models during the training and test phases. However, this assumption is unrealistic when data comes in the form of infinite data streams, or when learning algorithms are deployed on devices with reduced amounts of memory. Such memory constraints impact the model behavior and assumptions. In this paper,… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

    Comments: arXiv admin note: text overlap with arXiv:2205.07871

  10. arXiv:2207.01737  [pdf, other

    cs.DC

    Sea: A lightweight data-placement library for Big Data scientific computing

    Authors: Valérie Hayot-Sasson, Mathieu Dugré, Tristan Glatard

    Abstract: The recent influx of open scientific data has contributed to the transitioning of scientific computing from compute intensive to data intensive. Whereas many Big Data frameworks exist that minimize the cost of data transfers, few scientific applications integrate these frameworks or adopt data-placement strategies to mitigate the costs. Scientific applications commonly rely on well-established com… ▽ More

    Submitted 4 July, 2022; originally announced July 2022.

  11. arXiv:2205.07871  [pdf, other

    cs.LG cs.AI

    Mondrian Forest for Data Stream Classification Under Memory Constraints

    Authors: Martin Khannouz, Tristan Glatard

    Abstract: Supervised learning algorithms generally assume the availability of enough memory to store their data model during the training and test phases. However, in the Internet of Things, this assumption is unrealistic when data comes in the form of infinite data streams, or when learning algorithms are deployed on devices with reduced amounts of memory. In this paper, we adapt the online Mondrian forest… ▽ More

    Submitted 4 August, 2023; v1 submitted 12 May, 2022; originally announced May 2022.

  12. arXiv:2112.11508  [pdf, other

    cs.MS cs.SE math.NA

    PyTracer: Automatically profiling numerical instabilities in Python

    Authors: Yohan Chatelain, Nigel Yong, Gregory Kiar, Tristan Glatard

    Abstract: Numerical stability is a crucial requirement of reliable scientific computing. However, despite the pervasiveness of Python in data science, analyzing large Python programs remains challenging due to the lack of scalable numerical analysis tools available for this language. To fill this gap, we developed PyTracer, a profiler to quantify numerical instability in Python applications. PyTracer transp… ▽ More

    Submitted 8 February, 2022; v1 submitted 21 December, 2021; originally announced December 2021.

    Comments: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

  13. arXiv:2109.09649  [pdf, other

    q-bio.QM cs.LG

    Data Augmentation Through Monte Carlo Arithmetic Leads to More Generalizable Classification in Connectomics

    Authors: Gregory Kiar, Yohan Chatelain, Ali Salari, Alan C. Evans, Tristan Glatard

    Abstract: Machine learning models are commonly applied to human brain imaging datasets in an effort to associate function or structure with behaviour, health, or other individual phenotypes. Such models often rely on low-dimensional maps generated by complex processing pipelines. However, the numerical instabilities inherent to pipelines limit the fidelity of these maps and introduce computational bias. Mon… ▽ More

    Submitted 20 September, 2021; originally announced September 2021.

  14. arXiv:2108.10496  [pdf, other

    cs.DC

    The benefits of prefetching for large-scale cloud-based neuroimaging analysis workflows

    Authors: Valerie Hayot-Sasson, Tristan Glatard, Ariel Rokem

    Abstract: To support the growing demands of neuroscience applications, researchers are transitioning to cloud computing for its scalable, robust and elastic infrastructure. Nevertheless, large datasets residing in object stores may result in significant data transfer overheads during workflow execution. Prefetching, a method to mitigate the cost of reading in mixed workloads, masks data transfer costs withi… ▽ More

    Submitted 23 August, 2021; originally announced August 2021.

  15. arXiv:2108.09275  [pdf, other

    cs.IR cs.LG

    A Recommender System for Scientific Datasets and Analysis Pipelines

    Authors: Mandana Mazaheri, Gregory Kiar, Tristan Glatard

    Abstract: Scientific datasets and analysis pipelines are increasingly being shared publicly in the interest of open science. However, mechanisms are lacking to reliably identify which pipelines and datasets can appropriately be used together. Given the increasing number of high-quality public datasets and pipelines, this lack of clear compatibility threatens the findability and reusability of these resource… ▽ More

    Submitted 20 August, 2021; originally announced August 2021.

  16. arXiv:2106.14340  [pdf, other

    cs.LG

    Reducing numerical precision preserves classification accuracy in Mondrian Forests

    Authors: Marc Vicuna, Martin Khannouz, Gregory Kiar, Yohan Chatelain, Tristan Glatard

    Abstract: Mondrian Forests are a powerful data stream classification method, but their large memory footprint makes them ill-suited for low-resource platforms such as connected objects. We explored using reduced-precision floating-point representations to lower memory consumption and evaluated its effect on classification performance. We applied the Mondrian Forest implementation provided by OrpailleCC, a C… ▽ More

    Submitted 27 June, 2021; originally announced June 2021.

    Comments: 6 pages, 3 tables, 2 figures. Keywords: numerical precision, memory footprint, Mondrian Forests, human activity, recognition, data streams, supervised classification, floating-point representation

  17. arXiv:2101.01335  [pdf, other

    cs.DC cs.PF

    Modeling the Linux page cache for accurate simulation of data-intensive applications

    Authors: Hoang-Dung Do, Valerie Hayot-Sasson, Rafael Ferreira da Silva, Christopher Steele, Henri Casanova, Tristan Glatard

    Abstract: The emergence of Big Data in recent years has resulted in a growing need for efficient data processing solutions. While infrastructures with sufficient compute power are available, the I/O bottleneck remains. The Linux page cache is an efficient approach to reduce I/O overheads, but few experimental studies of its interactions with Big Data applications exist, partly due to limitations of real-wor… ▽ More

    Submitted 4 January, 2021; originally announced January 2021.

    Comments: 10 pages, 8 figures, CCGrid

  18. arXiv:2010.13970  [pdf, other

    cs.CR

    An Analysis of Security Vulnerabilities in Container Images for Scientific Data Analysis

    Authors: Bhupinder Kaur, Mathieu Dugré, Aiman Hanna, Tristan Glatard

    Abstract: Software containers greatly facilitate the deployment and reproducibility of scientific data analyses in various platforms. However, container images often contain outdated or unnecessary software packages, which increases the number of security vulnerabilities in the images, widens the attack surface in the container host, and creates substantial security risks for computing infrastructures at la… ▽ More

    Submitted 17 March, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

  19. arXiv:2008.11880  [pdf, other

    cs.LG stat.ML

    A benchmark of data stream classification for human activity recognition on connected objects

    Authors: Martin Khannouz, Tristan Glatard

    Abstract: This paper evaluates data stream classifiers from the perspective of connected devices, focusing on the use case of HAR. We measure both classification performance and resource consumption (runtime, memory, and power) of five usual stream classification algorithms, implemented in a consistent library, and applied to two real human activity datasets and to three synthetic datasets. Regarding classi… ▽ More

    Submitted 26 August, 2020; originally announced August 2020.

    Comments: 8 pages, 9 figures, for a journal

  20. Can we Estimate Truck Accident Risk from Telemetric Data using Machine Learning?

    Authors: Antoine Hébert, Ian Marineau, Gilles Gervais, Tristan Glatard, Brigitte Jaumard

    Abstract: Road accidents have a high societal cost that could be reduced through improved risk predictions using machine learning. This study investigates whether telemetric data collected on long-distance trucks can be used to predict the risk of accidents associated with a driver. We use a dataset provided by a truck transportation company containing the driving data of 1,141 drivers for 18 months. We e… ▽ More

    Submitted 17 July, 2020; originally announced July 2020.

    Journal ref: 2021 IEEE International Conference on Big Data, pp. 1827-1836

  21. arXiv:1912.11794  [pdf, other

    cs.PF

    Performance benefits of Intel(R) OptaneTM DC persistent memory for the parallel processing of large neuroimaging data

    Authors: Valerie Hayot-Sasson, Shawn T Brown, Tristan Glatard

    Abstract: Open-access neuroimaging datasets have reached petabyte scale, and continue to grow. The ability to leverage the entirety of these datasets is limited to a restricted number of labs with both the capacity and infrastructure to process the data. Whereas Big Data engines have significantly reduced application performance penalties with respect to data movement, their applied strategies (e.g. data lo… ▽ More

    Submitted 26 December, 2019; originally announced December 2019.

  22. arXiv:1907.13030  [pdf, other

    cs.DC cs.PF

    A performance comparison of Dask and Apache Spark for data-intensive neuroimaging pipelines

    Authors: Mathieu Dugré, Valérie Hayot-Sasson, Tristan Glatard

    Abstract: In the past few years, neuroimaging has entered the Big Data era due to the joint increase in image resolution, data sharing, and study sizes. However, no particular Big Data engines have emerged in this field, and several alternatives remain available. We compare two popular Big Data engines with Python APIs, Apache Spark and Dask, for their runtime performance in processing neuroimaging pipeline… ▽ More

    Submitted 5 October, 2019; v1 submitted 30 July, 2019; originally announced July 2019.

    Comments: 10 pages, 15 figures, 1 tables. To appear in the proceeding of the 14th WORKS Workshop on Topics in Workflows in Support of Large-Scale Science, 17 November 2019, Denver, CO, USA

  23. arXiv:1907.03047  [pdf, other

    cs.CY cs.SI

    A Conceptual Marketplace Model for IoT Generated Personal Data

    Authors: Victor Molina, Marta Kersten-Oertel, Tristan Glatard

    Abstract: We propose a decentralized conceptual marketplace model for IoT generated personal data. Our model is based on a thorough analysis of personal data in a marketplace context, with specific focus on the challenges presented by commercializing IoT generated personal data. Our model introduces a novel perspective on the commercialization of personal data for a marketplace context via risk evaluation a… ▽ More

    Submitted 5 July, 2019; originally announced July 2019.

  24. arXiv:1905.12720  [pdf, other

    cs.DC cs.PF

    Evaluation of pilot jobs for Apache Spark applications on HPC clusters

    Authors: Valerie Hayot-Sasson, Tristan Glatard

    Abstract: Big Data has become prominent throughout many scientific fields and, as a result, scientific communities have sought out Big Data frameworks to accelerate the processing of their increasingly data-intensive pipelines. However, while scientific communities typically rely on High-Performance Computing (HPC) clusters for the parallelization of their pipelines, many popular Big Data frameworks such as… ▽ More

    Submitted 29 May, 2019; originally announced May 2019.

  25. High-Resolution Road Vehicle Collision Prediction for the City of Montreal

    Authors: Antoine Hébert, Timothée Guédon, Tristan Glatard, Brigitte Jaumard

    Abstract: Road accidents are an important issue of our modern societies, responsible for millions of deaths and injuries every year in the world. In Quebec only, in 2018, road accidents are responsible for 359 deaths and 33 thousands of injuries. In this paper, we show how one can leverage open datasets of a city like Montreal, Canada, to create high-resolution accident prediction models, using big data ana… ▽ More

    Submitted 11 November, 2019; v1 submitted 21 May, 2019; originally announced May 2019.

    Journal ref: 2019 IEEE International Conference on Big Data, pp. 1804-1813

  26. arXiv:1904.02666  [pdf, other

    cs.LG stat.ML

    Subject Cross Validation in Human Activity Recognition

    Authors: Akbar Dehghani, Tristan Glatard, Emad Shihab

    Abstract: K-fold Cross Validation is commonly used to evaluate classifiers and tune their hyperparameters. However, it assumes that data points are Independent and Identically Distributed (i.i.d.) so that samples used in the training and test sets can be selected randomly and uniformly. In Human Activity Recognition datasets, we note that the samples produced by the same subjects are likely to be correlated… ▽ More

    Submitted 9 April, 2019; v1 submitted 4 April, 2019; originally announced April 2019.

  27. arXiv:1812.06492  [pdf, other

    cs.DC

    Performance Evaluation of Big Data Processing Strategies for Neuroimaging

    Authors: Valérie Hayot-Sasson, Shawn T Brown, Tristan Glatard

    Abstract: Neuroimaging datasets are rapidly growing in size as a result of advancements in image acquisition methods, open-science and data sharing. However, the adoption of Big Data processing strategies by neuroimaging processing engines remains limited. Here, we evaluate three Big Data processing strategies (in-memory computing, data locality and lazy evaluation) on typical neuroimaging use cases, repres… ▽ More

    Submitted 2 April, 2019; v1 submitted 16 December, 2018; originally announced December 2018.

  28. arXiv:1811.09930  [pdf, other

    cs.IT physics.data-an

    A multi-dimensional extension of the Lightweight Temporal Compression method

    Authors: Bo Li, Omid Sarbishei, Hosein Nourani, Tristan Glatard

    Abstract: Lightweight Temporal Compression (LTC) is among the lossy stream compression methods that provide the highest compression rate for the lowest CPU and memory consumption. As such, it is well suited to compress data streams in energy-constrained systems such as connected objects. The current formulation of LTC, however, is one-dimensional while data acquired in connected objects is often multi-dimen… ▽ More

    Submitted 24 November, 2018; originally announced November 2018.

  29. arXiv:1810.09944  [pdf, other

    cs.LG stat.ML

    Data models for service failure prediction in supply-chain networks

    Authors: Monika Sharma, Tristan Glatard, Eric Gelinas, Mariam Tagmouti, Brigitte Jaumard

    Abstract: We aim to predict and explain service failures in supply-chain networks, more precisely among last-mile pickup and delivery services to customers. We analyze a dataset of 500,000 services using (1) supervised classification with Random Forests, and (2) Association Rules. Our classifier reaches an average sensitivity of 0.7 and an average specificity of 0.7 for the 5 studied types of failure. Assoc… ▽ More

    Submitted 20 October, 2018; originally announced October 2018.

  30. arXiv:1809.10139  [pdf, other

    stat.ME cs.LG stat.ML

    Predicting computational reproducibility of data analysis pipelines in large population studies using collaborative filtering

    Authors: Soudabeh Barghi, Lalet Scaria, Ali Salari, Tristan Glatard

    Abstract: Evaluating the computational reproducibility of data analysis pipelines has become a critical issue. It is, however, a cumbersome process for analyses that involve data from large populations of subjects, due to their computational and storage requirements. We present a method to predict the computational reproducibility of data analysis pipelines in large population studies. We formulate the prob… ▽ More

    Submitted 26 September, 2018; originally announced September 2018.

  31. arXiv:1809.07693  [pdf

    cs.DC cs.SE

    A Serverless Tool for Platform Agnostic Computational Experiment Management

    Authors: Gregory Kiar, Shawn T Brown, Tristan Glatard, Alan C Evans

    Abstract: Neuroscience has been carried into the domain of big data and high performance computing (HPC) on the backs of initiatives in data collection and an increasingly compute-intensive tools. While managing HPC experiments requires considerable technical acumen, platforms and standards have been developed to ease this burden on scientists. While web-portals make resources widely accessible, data organi… ▽ More

    Submitted 2 September, 2018; originally announced September 2018.

    Comments: 12 pages, 3 figures, 1 tool

  32. arXiv:1711.09713  [pdf, other

    cs.SE cs.DC

    Boutiques: a flexible framework for automated application integration in computing platforms

    Authors: Tristan Glatard, Gregory Kiar, Tristan Aumentado-Armstrong, Natacha Beck, Pierre Bellec, Rémi Bernard, Axel Bonnet, Sorina Camarasu-Pop, Frédéric Cervenansky, Samir Das, Rafael Ferreira da Silva, Guillaume Flandin, Pascal Girard, Krzysztof J. Gorgolewski, Charles R. G. Guttmann, Valérie Hayot-Sasson, Pierre-Olivier Quirion, Pierre Rioux, Marc-Eienne Rousseau, Alan C. Evans

    Abstract: We present Boutiques, a system to automatically publish, integrate and execute applications across computational platforms. Boutiques applications are installed through software containers described in a rich and flexible JSON language. A set of core tools facilitate the construction, validation, import, execution, and publishing of applications. Boutiques is currently supported by several distinc… ▽ More

    Submitted 7 November, 2017; originally announced November 2017.

    Comments: 10 pages

  33. arXiv:1203.2366  [pdf

    cs.DC

    Technical support for Life Sciences communities on a production grid infrastructure

    Authors: Franck Michel, Johan Montagnat, Tristan Glatard

    Abstract: Production operation of large distributed computing infrastructures (DCI) still requires a lot of human intervention to reach acceptable quality of service. This may be achievable for scientific communities with solid IT support, but it remains a show-stopper for others. Some application execution environments are used to hide runtime technical issues from end users. But they mostly aim at fault-t… ▽ More

    Submitted 11 March, 2012; originally announced March 2012.

    Comments: HealthGrid'12, Amsterdam : Netherlands (2012)