-
Predicting Parkinson's disease trajectory using clinical and functional MRI features: a reproduction and replication study
Authors:
Elodie Germani,
Nikhil Baghwat,
Mathieu Dugré,
Rémi Gau,
Albert Montillo,
Kevin Nguyen,
Andrzej Sokolowski,
Madeleine Sharp,
Jean-Baptiste Poline,
Tristan Glatard
Abstract:
Parkinson's disease (PD) is a common neurodegenerative disorder with a poorly understood physiopathology and no established biomarkers for the diagnosis of early stages and for prediction of disease progression. Several neuroimaging biomarkers have been studied recently, but these are susceptible to several sources of variability. In this context, an evaluation of the robustness of such biomarkers…
▽ More
Parkinson's disease (PD) is a common neurodegenerative disorder with a poorly understood physiopathology and no established biomarkers for the diagnosis of early stages and for prediction of disease progression. Several neuroimaging biomarkers have been studied recently, but these are susceptible to several sources of variability. In this context, an evaluation of the robustness of such biomarkers is essential. This study is part of a larger project investigating the replicability of potential neuroimaging biomarkers of PD. Here, we attempt to reproduce (same data, same method) and replicate (different data or method) the models described in Nguyen et al., 2021 to predict individual's PD current state and progression using demographic, clinical and neuroimaging features (fALFF and ReHo extracted from resting-state fMRI). We use the Parkinson's Progression Markers Initiative dataset (PPMI, ppmi-info.org), as in Nguyen et al.,2021 and aim to reproduce the original cohort, imaging features and machine learning models as closely as possible using the information available in the paper and the code. We also investigated methodological variations in cohort selection, feature extraction pipelines and sets of input features. The success of the reproduction was assessed using different criteria. Notably, we obtained significantly better than chance performance using the analysis pipeline closest to that in the original study (R2 > 0), which is consistent with its findings. The challenges encountered while reproducing and replicating the original work are likely explained by the complexity of neuroimaging studies, in particular in clinical settings. We provide recommendations to further facilitate the reproducibility of such studies in the future.
△ Less
Submitted 24 May, 2024; v1 submitted 20 February, 2024;
originally announced March 2024.
-
Benchmarking missing-values approaches for predictive models on health databases
Authors:
Alexandre Perez-Lebel,
Gaël Varoquaux,
Marine Le Morvan,
Julie Josse,
Jean-Baptiste Poline
Abstract:
BACKGROUND: As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative -- rather than generative -- modeling,…
▽ More
BACKGROUND: As databases grow larger, it becomes harder to fully control their collection, and they frequently come with missing values: incomplete observations. These large databases are well suited to train machine-learning models, for instance for forecasting or to extract biomarkers in biomedical settings. Such predictive approaches can use discriminative -- rather than generative -- modeling, and thus open the door to new missing-values strategies. Yet existing empirical evaluations of strategies to handle missing values have focused on inferential statistics. RESULTS: Here we conduct a systematic benchmark of missing-values strategies in predictive models with a focus on large health databases: four electronic health record datasets, a population brain imaging one, a health survey and two intensive care ones. Using gradient-boosted trees, we compare native support for missing values with simple and state-of-the-art imputation prior to learning. We investigate prediction accuracy and computational time. For prediction after imputation, we find that adding an indicator to express which values have been imputed is important, suggesting that the data are missing not at random. Elaborate missing values imputation can improve prediction compared to simple strategies but requires longer computational time on large data. Learning trees that model missing values-with missing incorporated attribute-leads to robust, fast, and well-performing predictive modeling. CONCLUSIONS: Native support for missing values in supervised machine learning predicts better than state-of-the-art imputation with much less computational cost. When using imputation, it is important to add indicator columns expressing which values have been imputed.
△ Less
Submitted 17 February, 2022;
originally announced February 2022.
-
Recommendations for repositories and scientific gateways from a neuroscience perspective
Authors:
Malin Sandström,
Mathew Abrams,
Jan Bjaalie,
Mona Hicks,
David Kennedy,
Arvind Kumar,
JB Poline,
Prasun Roy,
Paul Tiesinga,
Thomas Wachtler,
Wojtek Goscinski
Abstract:
Digital services such as repositories and science gateways have become key resources for the neuroscience community, but users often have a hard time orienting themselves in the service landscape to find the best fit for their particular needs. INCF (International Neuroinformatics Coordinating Facility) has developed a set of recommendations and associated criteria for choosing or setting up and r…
▽ More
Digital services such as repositories and science gateways have become key resources for the neuroscience community, but users often have a hard time orienting themselves in the service landscape to find the best fit for their particular needs. INCF (International Neuroinformatics Coordinating Facility) has developed a set of recommendations and associated criteria for choosing or setting up and running a repository or scientific gateway, intended for the neuroscience community, with a FAIR neuroscience perspective. These recommendations have neurosciences as their primary use case but are often general. Considering the perspectives of researchers and providers of repositories as well as scientific gateways, the recommendations harmonize and complement existing work on criteria for repositories and best practices. The recommendations cover a range of important areas including accessibility, licensing, community responsibility and technical and financial sustainability of a service.
△ Less
Submitted 3 January, 2022;
originally announced January 2022.
-
Preventing dataset shift from breaking machine-learning biomarkers
Authors:
Jéroôme Dockès,
Gaël Varoquaux,
Jean-Baptiste Poline
Abstract:
Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can undermine the application of the biomarker to new indiv…
▽ More
Machine learning brings the hope of finding new biomarkers extracted from cohorts with rich biomedical measurements. A good biomarker is one that gives reliable detection of the corresponding condition. However, biomarkers are often extracted from a cohort that differs from the target population. Such a mismatch, known as a dataset shift, can undermine the application of the biomarker to new individuals. Dataset shifts are frequent in biomedical research, e.g. because of recruitment biases. When a dataset shift occurs, standard machine-learning techniques do not suffice to extract and validate biomarkers. This article provides an overview of when and how dataset shifts breaks machine-learning extracted biomarkers, as well as detection and correction strategies.
△ Less
Submitted 21 July, 2021;
originally announced July 2021.
-
PyXNAT: XNAT in Python
Authors:
Yannick Schwartz,
Alexis Barbot,
Benjamin Thyreau,
Vincent Frouin,
Gaël Varoquaux,
Aditya Siram,
Daniel Marcus,
Jean-Baptiste Poline
Abstract:
As neuroimaging databases grow in size and complexity, the time researchers spend investigating and managing the data increases to the expense of data analysis. As a result, investigators rely more and more heavily on scripting using high-level languages to automate data management and processing tasks. For this, a structured and programmatic access to the data store is necessary. Web services are…
▽ More
As neuroimaging databases grow in size and complexity, the time researchers spend investigating and managing the data increases to the expense of data analysis. As a result, investigators rely more and more heavily on scripting using high-level languages to automate data management and processing tasks. For this, a structured and programmatic access to the data store is necessary. Web services are a first step toward this goal. They however lack in functionality and ease of use because they provide only low level interfaces to databases. We introduce here PyXNAT, a Python module that interacts with The Extensible Neuroimaging Archive Toolkit (XNAT) through native Python calls across multiple operating systems. The choice of Python enables PyXNAT to expose the XNAT Web Services and unify their features with a higher level and more expressive language. PyXNAT provides XNAT users direct access to all the scientific packages in Python. Finally PyXNAT aims to be efficient and easy to use, both as a backend library to build XNAT clients and as an alternative frontend from the command line.
△ Less
Submitted 29 January, 2013;
originally announced January 2013.
-
CanICA: Model-based extraction of reproducible group-level ICA patterns from fMRI time series
Authors:
Gaël Varoquaux,
Sepideh Sadaghiani,
Jean Baptiste Poline,
Bertrand Thirion
Abstract:
Spatial Independent Component Analysis (ICA) is an increasingly used data-driven method to analyze functional Magnetic Resonance Imaging (fMRI) data. To date, it has been used to extract meaningful patterns without prior information. However, ICA is not robust to mild data variation and remains a parameter-sensitive algorithm. The validity of the extracted patterns is hard to establish, as well…
▽ More
Spatial Independent Component Analysis (ICA) is an increasingly used data-driven method to analyze functional Magnetic Resonance Imaging (fMRI) data. To date, it has been used to extract meaningful patterns without prior information. However, ICA is not robust to mild data variation and remains a parameter-sensitive algorithm. The validity of the extracted patterns is hard to establish, as well as the significance of differences between patterns extracted from different groups of subjects. We start from a generative model of the fMRI group data to introduce a probabilistic ICA pattern-extraction algorithm, called CanICA (Canonical ICA). Thanks to an explicit noise model and canonical correlation analysis, our method is auto-calibrated and identifies the group-reproducible data subspace before performing ICA. We compare our method to state-of-the-art multi-subject fMRI ICA methods and show that the features extracted are more reproducible.
△ Less
Submitted 24 November, 2009;
originally announced November 2009.