Search | arXiv e-print repository

Classification ensembles for multivariate functional data with application to mouse movements in web surveys

Authors: Amanda Fernández-Fontelo, Felix Henninger, Pascal J. Kieslich, Frauke Kreuter, Sonja Greven

Abstract: We propose new ensemble models for multivariate functional data classification as combinations of semi-metric-based weak learners. Our models extend current semi-metric-type methods from the univariate to the multivariate case, propose new semi-metrics to compute distances between functions, and consider more flexible options for combining weak learners using stacked generalisation methods. We app… ▽ More We propose new ensemble models for multivariate functional data classification as combinations of semi-metric-based weak learners. Our models extend current semi-metric-type methods from the univariate to the multivariate case, propose new semi-metrics to compute distances between functions, and consider more flexible options for combining weak learners using stacked generalisation methods. We apply these ensemble models to identify respondents' difficulty with survey questions, with the aim to improve survey data quality. As predictors of difficulty, we use mouse movement trajectories from the respondents' interaction with a web survey, in which several questions were manipulated to create two scenarios with different levels of difficulty. △ Less

Submitted 26 May, 2022; originally announced May 2022.

Comments: 24 pages, 3 tables, 0 figures

arXiv:2105.01557 [pdf, other]

Good distribution modelling with the R package good

Authors: Jordi Tur, David Moriña, Pedro Puig, Alejandra Cabaña, Argimiro Arratia, Amanda Fernández-Fontelo

Abstract: Although models for count data with over-dispersion have been widely considered in the literature, models for under-dispersion -- the opposite phenomenon -- have received less attention as it is only relatively common in particular research fields such as biodosimetry and ecology. The Good distribution is a flexible alternative for modelling count data showing either over-dispersion or under-dispe… ▽ More Although models for count data with over-dispersion have been widely considered in the literature, models for under-dispersion -- the opposite phenomenon -- have received less attention as it is only relatively common in particular research fields such as biodosimetry and ecology. The Good distribution is a flexible alternative for modelling count data showing either over-dispersion or under-dispersion, although no R packages are still available to the best of our knowledge. We aim to present in the following the R package good that computes the standard probabilistic functions (i.e., probability density function, cumulative distribution function, and quantile function) and generates random samples from a population following a Good distribution. The package also considers a function for Good regression, including covariates in a similar way to that of the standard glm function. We finally show the use of such a package with some real-world data examples addressing both over-dispersion and especially under-dispersion. △ Less

Submitted 4 May, 2021; originally announced May 2021.

Comments: 15 pages, 2 figures

arXiv:2104.07575 [pdf, other]

Bayesian Synthetic Likelihood Estimation for Underreported Non-Stationary Time Series: Covid-19 Incidence in Spain

Authors: David Moriña, Amanda Fernández-Fontelo, Alejandra Cabaña, Argimiro Arratia, Pedro Puig

Abstract: The problem of dealing with misreported data is very common in a wide range of contexts for different reasons. The current situation caused by the Covid-19 worldwide pandemic is a clear example, where the data provided by official sources were not always reliable due to data collection issues and to the high proportion of asymptomatic cases. In this work, we explore the performance of Bayesian Syn… ▽ More The problem of dealing with misreported data is very common in a wide range of contexts for different reasons. The current situation caused by the Covid-19 worldwide pandemic is a clear example, where the data provided by official sources were not always reliable due to data collection issues and to the high proportion of asymptomatic cases. In this work, we explore the performance of Bayesian Synthetic Likelihood to estimate the parameters of a model capable of dealing with misreported information and to reconstruct the most likely evolution of the phenomenon. The performance of the proposed methodology is evaluated through a comprehensive simulation study and illustrated by reconstructing the weekly Covid-19 incidence in each Spanish Autonomous Community in 2020. △ Less

Submitted 19 July, 2022; v1 submitted 15 April, 2021; originally announced April 2021.

arXiv:2011.06916 [pdf]

Predicting respondent difficulty in web surveys: A machine-learning approach based on mouse movement features

Authors: Amanda Fernández-Fontelo, Pascal J. Kieslich, Felix Henninger, Frauke Kreuter, Sonja Greven

Abstract: A central goal of survey research is to collect robust and reliable data from respondents. However, despite researchers' best efforts in designing questionnaires, respondents may experience difficulty understanding questions' intent and therefore may struggle to respond appropriately. If it were possible to detect such difficulty, this knowledge could be used to inform real-time interventions thro… ▽ More A central goal of survey research is to collect robust and reliable data from respondents. However, despite researchers' best efforts in designing questionnaires, respondents may experience difficulty understanding questions' intent and therefore may struggle to respond appropriately. If it were possible to detect such difficulty, this knowledge could be used to inform real-time interventions through responsive questionnaire design, or to indicate and correct measurement error after the fact. Previous research in the context of web surveys has used paradata, specifically response times, to detect difficulties and to help improve user experience and data quality. However, richer data sources are now available, in the form of the movements respondents make with the mouse, as an additional and far more detailed indicator for the respondent-survey interaction. This paper uses machine learning techniques to explore the predictive value of mouse-tracking data with regard to respondents' difficulty. We use data from a survey on respondents' employment history and demographic information, in which we experimentally manipulate the difficulty of several questions. Using features derived from the cursor movements, we predict whether respondents answered the easy or difficult version of a question, using and comparing several state-of-the-art supervised learning methods. In addition, we develop a personalization method that adjusts for respondents' baseline mouse behavior and evaluate its performance. For all three manipulated survey questions, we find that including the full set of mouse movement features improved prediction performance over response-time-only models in nested cross-validation. Accounting for individual differences in mouse movements led to further improvements. △ Less

Submitted 5 November, 2020; originally announced November 2020.

Comments: 40 pages, 2 Figures, 3 Tables

arXiv:2008.00262 [pdf, other]

doi 10.1371/journal.pone.0242956

Estimating the real burden of disease under a pandemic situation: The SARS-CoV2 case

Authors: Amanda Fernández-Fontelo, David Moriña, Alejandra Cabaña, Argimiro Arratia, Pere Puig

Abstract: The present paper introduces a new model used to study and analyse the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) epidemic-reported-data from Spain. This is a Hidden Markov Model whose hidden layer is a regeneration process with Poisson immigration, Po-INAR(1), together with a mechanism that allows the estimation of the under-reporting in non-stationary count time series. A novelt… ▽ More The present paper introduces a new model used to study and analyse the severe acute respiratory syndrome coronavirus 2 (SARS-CoV2) epidemic-reported-data from Spain. This is a Hidden Markov Model whose hidden layer is a regeneration process with Poisson immigration, Po-INAR(1), together with a mechanism that allows the estimation of the under-reporting in non-stationary count time series. A novelty of the model is that the expectation of the innovations in the unobserved process is a time-dependent function defined in such a way that information about the spread of an epidemic, as modelled through a Susceptible-Infectious-Removed dynamical system, is incorporated into the model. In addition, the parameter controlling the intensity of the under-reporting is also made to vary with time to adjust to possible seasonality or trend in the data. Maximum likelihood methods are used to estimate the parameters of the model. △ Less

Submitted 1 August, 2020; originally announced August 2020.

Comments: 18 pages, 4 figures

arXiv:2007.15727 [pdf, other]

doi 10.1093/eurpub/ckab118

Cumulated burden of Covid-19 in Spain from a Bayesian perspective

Authors: David Moriña, Amanda Fernández-Fontelo, Alejandra Cabaña, Argimiro Arratia, Gustavo Ávalos, Pedro Puig

Abstract: The main goal of this work is to estimate the actual number of cases of Covid-19 in Spain in the period 01-31-2020 / 06-01-2020 by Autonomous Communities. Based on these estimates, this work allows us to accurately re-estimate the lethality of the disease in Spain, taking into account unreported cases. A hierarchical Bayesian model recently proposed in the literature has been adapted to model the… ▽ More The main goal of this work is to estimate the actual number of cases of Covid-19 in Spain in the period 01-31-2020 / 06-01-2020 by Autonomous Communities. Based on these estimates, this work allows us to accurately re-estimate the lethality of the disease in Spain, taking into account unreported cases. A hierarchical Bayesian model recently proposed in the literature has been adapted to model the actual number of Covid-19 cases in Spain. The results of this work show that the real load of Covid-19 in Spain in the period considered is well above the data registered by the public health system. Specifically, the model estimates show that, cumulatively until June 1st, 2020, there were 2,425,930 cases of Covid-19 in Spain with characteristics similar to those reported (95\% credibility interval: 2,148,261 - 2,813,864), from which were actually registered only 518,664. Considering the results obtained from the second wave of the Spanish seroprevalence study, which estimates 2,350,324 cases of Covid-19 produced in Spain, in the period of time considered, it can be seen that the estimates provided by the model are quite good. This work clearly shows the key importance of having good quality data to optimize decision-making in the critical context of dealing with a pandemic. △ Less

Submitted 30 July, 2020; originally announced July 2020.

arXiv:2003.09213 [pdf, ps, other]

doi 10.1186/s12874-020-01188-4

Quantifying the under-reporting of genital warts cases

Authors: David Moriña, Amanda Fernández-Fontelo, Alejandra Cabaña, Pedro Puig, Laura Monfil, Maria Brotons, Mireia Diaz

Abstract: Genital warts are a common and highly contagious sexually transmitted disease. They have a large economic burden and affect several aspects of quality of life. Incidence data underestimate the real occurrence of genital warts because this infection is often under-reported, mostly due to their specific characteristics such as the asymptomatic course. Genital warts cases for the analysis were obtain… ▽ More Genital warts are a common and highly contagious sexually transmitted disease. They have a large economic burden and affect several aspects of quality of life. Incidence data underestimate the real occurrence of genital warts because this infection is often under-reported, mostly due to their specific characteristics such as the asymptomatic course. Genital warts cases for the analysis were obtained from the catalan public health system database (SIDIAP) for the period 2009-2016, covering 74\% of the Catalan population. People under 15 and over 94 years old were excluded from the analysis as the incidence of genital warts in this population is negligible. This work introduces a time series model based on a mixture of two distributions, capable of detecting the presence of under-reporting in the data. In order to identify potential differences in the magnitude of the under-reporting issue depending on sex and age, these covariates were included in the model. This work shows that only about 80\% in average of genital warts incidence in Catalunya in the period 2009-2016 was registered, although the frequency of under-reporting has been decreasing over the study period. It can also be seen that the under-reported issue has a deeper impact on women over 30 years old. The registered incidence in the Catalan public health system is underestimating the real burden in almost 10,000 cases in Catalunya, around 23\% of the registered cases. The total annual cost in Catalunya is underestimated in at least about 10 million Euros respect the 54 million Euros annually devoted to genital warts in Catalunya, representing 0.4\% of the total budget of the public health system. △ Less

Submitted 20 March, 2020; originally announced March 2020.

arXiv:2003.09202 [pdf, ps, other]

New statistical model for misreported data with application to current public health challenges

Authors: David Moriña, Amanda Fernández-Fontelo, Alejandra Cabaña, Pedro Puig

Abstract: The main goal of this work is to present a new model able to deal with potentially misreported continuous time series. The proposed model is able to handle the autocorrelation structure in continuous time series data, which might be partially or totally underreported or overreported. Its performance is illustrated through a comprehensive simulation study considering several autocorrelation structu… ▽ More The main goal of this work is to present a new model able to deal with potentially misreported continuous time series. The proposed model is able to handle the autocorrelation structure in continuous time series data, which might be partially or totally underreported or overreported. Its performance is illustrated through a comprehensive simulation study considering several autocorrelation structures and two real data applications on human papillomavirus incidence in Girona (Catalunya, Spain) and COVID-19 incidence in the Chinese region of Heilongjiang. △ Less

Submitted 17 June, 2021; v1 submitted 20 March, 2020; originally announced March 2020.

Showing 1–8 of 8 results for author: Fernández-Fontelo, A