Evaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism Project
Authors:
A. Llera,
M. Brammer,
B. Oakley,
J. Tillmann,
M. Zabihi,
T. Mei,
T. Charman,
C. Ecker,
F. Dell Acqua,
T. Banaschewski,
C. Moessnang,
S. Baron-Cohen,
R. Holt,
S. Durston,
D. Murphy,
E. Loth,
J. K. Buitelaar,
D. L. Floris,
C. F. Beckmann
Abstract:
An increasing number of large-scale multi-modal research initiatives has been conducted in the typically develo** population, as well as in psychiatric cohorts. Missing data is a common problem in such datasets due to the difficulty of assessing multiple measures on a large number of participants. The consequences of missing data accumulate when researchers aim to explore relationships between m…
▽ More
An increasing number of large-scale multi-modal research initiatives has been conducted in the typically develo** population, as well as in psychiatric cohorts. Missing data is a common problem in such datasets due to the difficulty of assessing multiple measures on a large number of participants. The consequences of missing data accumulate when researchers aim to explore relationships between multiple measures. Here we aim to evaluate different imputation strategies to fill in missing values in clinical data from a large (total N=764) and deeply characterised (i.e. range of clinical and cognitive instruments administered) sample of N=453 autistic individuals and N=311 control individuals recruited as part of the EU-AIMS Longitudinal European Autism Project (LEAP) consortium. In particular we consider a total of 160 clinical measures divided in 15 overlap** subsets of participants. We use two simple but common univariate strategies, mean and median imputation, as well as a Round Robin regression approach involving four independent multivariate regression models including a linear model, Bayesian Ridge regression, as well as several non-linear models, Decision Trees, Extra Trees and K-Neighbours regression. We evaluate the models using the traditional mean square error towards removed available data, and consider in addition the KL divergence between the observed and the imputed distributions. We show that all of the multivariate approaches tested provide a substantial improvement compared to typical univariate approaches. Further, our analyses reveal that across all 15 data-subsets tested, an Extra Trees regression approach provided the best global results. This allows the selection of a unique model to impute missing data for the LEAP project and deliver a fixed set of imputed clinical data to be used by researchers working with the LEAP dataset in the future.
△ Less
Submitted 20 January, 2022;
originally announced January 2022.
Group Analysis of Self-organizing Maps based on Functional MRI using Restricted Frechet Means
Authors:
Arnaud P. Fournel,
Emanuelle Reynaud,
Michael J. Brammer,
Andrew Simmons,
Cedric E. Ginestet
Abstract:
Studies of functional MRI data are increasingly concerned with the estimation of differences in spatio-temporal networks across groups of subjects or experimental conditions. Unsupervised clustering and independent component analysis (ICA) have been used to identify such spatio-temporal networks. While these approaches have been useful for estimating these networks at the subject-level, comparison…
▽ More
Studies of functional MRI data are increasingly concerned with the estimation of differences in spatio-temporal networks across groups of subjects or experimental conditions. Unsupervised clustering and independent component analysis (ICA) have been used to identify such spatio-temporal networks. While these approaches have been useful for estimating these networks at the subject-level, comparisons over groups or experimental conditions require further methodological development. In this paper, we tackle this problem by showing how self-organizing maps (SOMs) can be compared within a Frechean inferential framework. Here, we summarize the mean SOM in each group as a Frechet mean with respect to a metric on the space of SOMs. We consider the use of different metrics, and introduce two extensions of the classical sum of minimum distance (SMD) between two SOMs, which take into account the spatio-temporal pattern of the fMRI data. The validity of these methods is illustrated on synthetic data. Through these simulations, we show that the three metrics of interest behave as expected, in the sense that the ones capturing temporal, spatial and spatio-temporal aspects of the SOMs are more likely to reach significance under simulated scenarios characterized by temporal, spatial and spatio-temporal differences, respectively. In addition, a re-analysis of a classical experiment on visually-triggered emotions demonstrates the usefulness of this methodology. In this study, the multivariate functional patterns typical of the subjects exposed to pleasant and unpleasant stimuli are found to be more similar than the ones of the subjects exposed to emotionally neutral stimuli. Taken together, these results indicate that our proposed methods can cast new light on existing data by adopting a global analytical perspective on functional MRI paradigms.
△ Less
Submitted 13 August, 2012; v1 submitted 28 May, 2012;
originally announced May 2012.