Search | arXiv e-print repository

Comparing estimators of discriminative performance of time-to-event models

Abstract: Predicting the timing and occurrence of events is a major focus of data science applications, especially in the context of biomedical research. Performance for models estimating these outcomes, often referred to as time-to-event or survival outcomes, is frequently summarized using measures of discrimination, in particular time-dependent AUC and concordance. Many estimators for these quantities hav… ▽ More Predicting the timing and occurrence of events is a major focus of data science applications, especially in the context of biomedical research. Performance for models estimating these outcomes, often referred to as time-to-event or survival outcomes, is frequently summarized using measures of discrimination, in particular time-dependent AUC and concordance. Many estimators for these quantities have been proposed which can be broadly categorized as either semi-parametric estimators or non-parametric estimators. In this paper, we review various estimators' mathematical construction and compare the behavior of the two classes of estimators. Importantly, we identify a previously unknown feature of the class of semi-parametric estimators that can result in vastly over-optimistic out-of-sample estimation of discriminative performance in common applied tasks. Although these semi-parametric estimators are popular in practice, the phenomenon we identify here suggests this class of estimators may be inappropriate for use in model assessment and selection based on out-of-sample evaluation criteria. This is due to the semi-parametric estimators' bias in favor of models that are overfit when using out-of-sample prediction criteria (e.g., cross validation). Non-parametric estimators, which do not exhibit this behavior, are highly variable for local discrimination. We propose to address the high variability problem through penalized regression splines smoothing. The behavior of various estimators of time-dependent AUC and concordance are illustrated via a simulation study using two different mechanisms that produce over-optimistic out-of-sample estimates using semi-parametric estimators. Estimators are further compared using a case study using data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014. △ Less

Submitted 6 June, 2024; originally announced June 2024.

arXiv:2311.14054 [pdf, other]

Analysis of Active/Inactive Patterns in the NHANES Data using Generalized Multilevel Functional Principal Component Analysis

Authors: Xinkai Zhou, Julia Wrobel, Ciprian M. Crainiceanu, Andrew Leroux

Abstract: Between 2011 and 2014 NHANES collected objectively measured physical activity data using wrist-worn accelerometers for tens of thousands of individuals for up to seven days. Here we analyze the minute-level indicators of being active, which can be viewed as binary (because there is an active indicator at every minute), multilevel (because there are multiple days of data for each study participant)… ▽ More Between 2011 and 2014 NHANES collected objectively measured physical activity data using wrist-worn accelerometers for tens of thousands of individuals for up to seven days. Here we analyze the minute-level indicators of being active, which can be viewed as binary (because there is an active indicator at every minute), multilevel (because there are multiple days of data for each study participant), functional (because within-day data can be viewed as a function of time) data. To extract within- and between-participant directions of variation in the data, we introduce Generalized Multilevel Functional Principal Component Analysis (GM-FPCA), an approach based on the dimension reduction of the linear predictor. Scores associated with specific patterns of activity are shown to be strongly associated with time to death. Extensive simulation studies indicate that GM-FPCA provides accurate estimation of model parameters, is computationally stable, and is scalable in the number of study participants, visits, and observations within visits. R code for implementing the method is provided. △ Less

Submitted 10 May, 2024; v1 submitted 23 November, 2023; originally announced November 2023.

arXiv:2309.09897 [pdf, other]

Walking fingerprinting

Authors: Lily Koffman, Ciprian Crainiceanu, Andrew Leroux

Abstract: We consider the problem of predicting an individual's identity from accelerometry data collected during walking. In a previous paper we introduced an approach that transforms the accelerometry time series into an image by constructing its complete empirical autocorrelation distribution. Predictors derived by partitioning this image into grid cells were used in logistic regression to predict indivi… ▽ More We consider the problem of predicting an individual's identity from accelerometry data collected during walking. In a previous paper we introduced an approach that transforms the accelerometry time series into an image by constructing its complete empirical autocorrelation distribution. Predictors derived by partitioning this image into grid cells were used in logistic regression to predict individuals. Here we: (1) implement machine learning methods for prediction using the grid cell-derived predictors; (2) derive inferential methods to screen for the most predictive grid cells; and (3) develop a novel multivariate functional regression model that avoids partitioning of the predictor space into cells. Prediction methods are compared on two open source data sets: (1) accelerometry data collected from $32$ individuals walking on a $1.06$ kilometer path; and (2) accelerometry data collected from six repetitions of walking on a $20$ meter path on two separate occasions at least one week apart for $153$ study participants. In the $32$-individual study, all methods achieve at least $95$% rank-1 accuracy, while in the $153$-individual study, accuracy varies from $41$% to $98$%, depending on the method and prediction task. Methods provide insights into why some individuals are easier to predict than others. △ Less

Submitted 18 September, 2023; originally announced September 2023.

Comments: 37 pages, 6 figures, 2 tables. Submitted to Journal of the American Statistical Association

arXiv:2305.19897 [pdf, ps, other]

Hidden Stabilizers, the Isogeny To Endomorphism Ring Problem and the Cryptanalysis of pSIDH

Authors: Mingjie Chen, Muhammad Imran, Gábor Ivanyos, Péter Kutas, Antonin Leroux, Christophe Petit

Abstract: The Isogeny to Endomorphism Ring Problem (IsERP) asks to compute the endomorphism ring of the codomain of an isogeny between supersingular curves in characteristic $p$ given only a representation for this isogeny, i.e. some data and an algorithm to evaluate this isogeny on any torsion point. This problem plays a central role in isogeny-based cryptography; it underlies the security of pSIDH protoco… ▽ More The Isogeny to Endomorphism Ring Problem (IsERP) asks to compute the endomorphism ring of the codomain of an isogeny between supersingular curves in characteristic $p$ given only a representation for this isogeny, i.e. some data and an algorithm to evaluate this isogeny on any torsion point. This problem plays a central role in isogeny-based cryptography; it underlies the security of pSIDH protocol (ASIACRYPT 2022) and it is at the heart of the recent attacks that broke the SIDH key exchange. Prior to this work, no efficient algorithm was known to solve IsERP for a generic isogeny degree, the hardest case seemingly when the degree is prime. In this paper, we introduce a new quantum polynomial-time algorithm to solve IsERP for isogenies whose degrees are odd and have $O(\log\log p)$ many prime factors. As main technical tools, our algorithm uses a quantum algorithm for computing hidden Borel subgroups, a group action on supersingular isogenies from EUROCRYPT 2021, various algorithms for the Deuring correspondence and a new algorithm to lift arbitrary quaternion order elements modulo an odd integer $N$ with $O(\log\log p)$ many prime factors to powersmooth elements. As a main consequence for cryptography, we obtain a quantum polynomial-time key recovery attack on pSIDH. The technical tools we use may also be of independent interest. △ Less

Submitted 31 May, 2023; originally announced May 2023.

arXiv:2305.02389 [pdf, other]

Fast Generalized Functional Principal Components Analysis

Authors: Andrew Leroux, Ciprian Crainiceanu, Julia Wrobel

Abstract: We propose a new fast generalized functional principal components analysis (fast-GFPCA) algorithm for dimension reduction of non-Gaussian functional data. The method consists of: (1) binning the data within the functional domain; (2) fitting local random intercept generalized linear mixed models in every bin to obtain the initial estimates of the person-specific functional linear predictors; (3) u… ▽ More We propose a new fast generalized functional principal components analysis (fast-GFPCA) algorithm for dimension reduction of non-Gaussian functional data. The method consists of: (1) binning the data within the functional domain; (2) fitting local random intercept generalized linear mixed models in every bin to obtain the initial estimates of the person-specific functional linear predictors; (3) using fast functional principal component analysis to smooth the linear predictors and obtain their eigenfunctions; and (4) estimating the global model conditional on the eigenfunctions of the linear predictors. An extensive simulation study shows that fast-GFPCA performs as well or better than existing state-of-the-art approaches, it is orders of magnitude faster than existing general purpose GFPCA methods, and scales up well with both the number of observed curves and observations per curve. Methods were motivated by and applied to a study of active/inactive physical activity profiles obtained from wearable accelerometers in the NHANES 2011-2014 study. The method can be implemented by any user familiar with mixed model software, though the R package fastGFPCA is provided for convenience. △ Less

Submitted 3 June, 2023; v1 submitted 3 May, 2023; originally announced May 2023.

arXiv:2301.08531 [pdf, ps, other]

Computation of Hilbert class polynomials and modular polynomials from supersingular elliptic curves

Authors: Antonin Leroux

Abstract: We present several new heuristic algorithms to compute class polynomials and modular polynomials modulo a prime $p$ by revisiting the idea of working with supersingular elliptic curves. The best known algorithms to this date are based on ordinary curves, due to the supposed inefficiency of the supersingular case. While this was true a decade ago, the recent advances in the study of supersingular… ▽ More We present several new heuristic algorithms to compute class polynomials and modular polynomials modulo a prime $p$ by revisiting the idea of working with supersingular elliptic curves. The best known algorithms to this date are based on ordinary curves, due to the supposed inefficiency of the supersingular case. While this was true a decade ago, the recent advances in the study of supersingular curves through the Deuring correspondence motivated by isogeny-based cryptography has provided all the tools to perform the necessary tasks efficiently. △ Less

Submitted 15 December, 2023; v1 submitted 20 January, 2023; originally announced January 2023.

arXiv:2208.13936 [pdf, other]

Empirical Likelihood Inference of Variance Components in Linear Mixed-Effects Models

Authors: J. Zhang, W. Guo, J. S. Carpenter, Andrew Leroux, K. R. Merikangas, N. G. Martin, I. B. Hickie, H. Shou, H. Li

Abstract: Linear mixed-effects models are widely used in analyzing repeated measures data, including clustered and longitudinal data, where inferences of both fixed effects and variance components are of importance. Unlike the fixed effect inference that has been well studied, inference on the variance components is more challenging due to null value being on the boundary and the nuisance parameters of the… ▽ More Linear mixed-effects models are widely used in analyzing repeated measures data, including clustered and longitudinal data, where inferences of both fixed effects and variance components are of importance. Unlike the fixed effect inference that has been well studied, inference on the variance components is more challenging due to null value being on the boundary and the nuisance parameters of the fixed effects. Existing methods often require strong distributional assumptions on the random effects and random errors. In this paper, we develop empirical likelihood-based methods for the inference of the variance components in the presence of fixed effects. A nonparametric version of the Wilks' theorem for the proposed empirical likelihood ratio statistics for variance components is derived. We also develop an empirical likelihood test for multiple variance components related to a sequence of correlated outcomes. Simulation studies demonstrate that the proposed methods exhibit better type 1 error control than the commonly used likelihood ratio tests when the Gaussian distributional assumptions of the random effects are violated. We apply the methods to investigate the heritability of physical activity as measured by wearable device in the Australian Twin study and observe that such activity is heritable only in the quantile range from 0.375 to 0.514. △ Less

Submitted 29 August, 2022; originally announced August 2022.

arXiv:2205.08439 [pdf, other]

A case study of glucose levels during sleep using fast function on scalar regression inference

Authors: Renat Sergazinov, Andrew Leroux, Erjia Cui, Ciprian Crainiceanu, R. Nisha Aurora, Naresh M. Punjabi, Irina Gaynanova

Abstract: Continuous glucose monitors (CGMs) are increasingly used to measure blood glucose levels and provide information about the treatment and management of diabetes. Our motivating study contains CGM data during sleep for 174 study participants with type II diabetes mellitus measured at a 5-minute frequency for an average of 10 nights. We aim to quantify the effects of diabetes medications and sleep ap… ▽ More Continuous glucose monitors (CGMs) are increasingly used to measure blood glucose levels and provide information about the treatment and management of diabetes. Our motivating study contains CGM data during sleep for 174 study participants with type II diabetes mellitus measured at a 5-minute frequency for an average of 10 nights. We aim to quantify the effects of diabetes medications and sleep apnea severity on glucose levels. Statistically, this is an inference question about the association between scalar covariates and functional responses. However, many characteristics of the data make analyses difficult, including (1) non-stationary within-day patterns; (2) substantial between-day heterogeneity, non-Gaussianity, and outliers; 3) large dimensionality due to the number of study participants, sleep periods, and time points. We evaluate and compare two methods: fast univariate inference (FUI) and functional additive mixed models (FAMM). We introduce a new approach for calculating p-values for testing a global null effect of covariates using FUI, and provide practical guidelines for speeding up FAMM computations, making it feasible for our data. While FUI and FAMM are philosophically different, they lead to similar point estimators in our study. In contrast to FAMM, FUI is fast, accounts for within-day correlations, and enables the construction of joint confidence intervals. Our analyses reveal that: (1) biguanide medication and sleep apnea severity significantly affect glucose trajectories during sleep, and (2) the estimated effects are time-invariant. △ Less

Submitted 17 May, 2022; originally announced May 2022.

arXiv:2003.10118 [pdf, ps, other]

Faster computation of isogenies of large prime degree

Authors: Daniel Bernstein, Luca de Feo, Antonin Leroux, Benjamin Smith

Abstract: Let $\mathcal{E}/\mathbb{F}_q$ be an elliptic curve, and $P$ a point in $\mathcal{E}(\mathbb{F}_q)$ of prime order $\ell$. Vélu's formulae let us compute a quotient curve $\mathcal{E}' = \mathcal{E}/\langle{P}\rangle$ and rational maps defining a quotient isogeny $φ: \mathcal{E} \to \mathcal{E}'$ in $\tilde{O}(\ell)$ $\mathbb{F}_q$-operations, where the $\tilde{O}$ is uniform in $q$.This article s… ▽ More Let $\mathcal{E}/\mathbb{F}_q$ be an elliptic curve, and $P$ a point in $\mathcal{E}(\mathbb{F}_q)$ of prime order $\ell$. Vélu's formulae let us compute a quotient curve $\mathcal{E}' = \mathcal{E}/\langle{P}\rangle$ and rational maps defining a quotient isogeny $φ: \mathcal{E} \to \mathcal{E}'$ in $\tilde{O}(\ell)$ $\mathbb{F}_q$-operations, where the $\tilde{O}$ is uniform in $q$.This article shows how to compute $\mathcal{E}'$, and $φ(Q)$ for $Q$ in $\mathcal{E}(\mathbb{F}_q)$, using only $\tilde{O}(\sqrt{\ell})$ $\mathbb{F}_q$-operations, where the $\tilde{O}$ is again uniform in $q$.As an application, this article speeds up some computations used in the isogeny-based cryptosystems CSIDH and CSURF. △ Less

Submitted 23 March, 2020; originally announced March 2020.

arXiv:1801.08310 [pdf, ps, other]

Information gain ratio correction: Improving prediction with more balanced decision tree splits

Authors: Antonin Leroux, Matthieu Boussard, Remi Dès

Abstract: Decision trees algorithms use a gain function to select the best split during the tree's induction. This function is crucial to obtain trees with high predictive accuracy. Some gain functions can suffer from a bias when it compares splits of different arities. Quinlan proposed a gain ratio in C4.5's information gain function to fix this bias. In this paper, we present an updated version of the gai… ▽ More Decision trees algorithms use a gain function to select the best split during the tree's induction. This function is crucial to obtain trees with high predictive accuracy. Some gain functions can suffer from a bias when it compares splits of different arities. Quinlan proposed a gain ratio in C4.5's information gain function to fix this bias. In this paper, we present an updated version of the gain ratio that performs better as it tries to fix the gain ratio's bias for unbalanced trees and some splits with low predictive interest. △ Less

Submitted 25 January, 2018; originally announced January 2018.

Comments: 7 pages

arXiv:1706.05416 [pdf]

Epidemiology of Objectively Measured Bedtime and Chronotype in the US adolescents and adults: NHANES 2003-2006

Authors: Jacek K. Urbanek, Adam Spira, Junrui Di, Andrew Leroux, Ciprian Crainiceanu, Vadim Zipunnikov

Abstract: Background: We propose a method for estimating the timing of in-bed intervals using objective data in a large representative U.S. sample, and quantify the association between these intervals and age, sex, and day of the week. Methods: The study included 11,951 participants six years and older from the National Health and Nutrition Examination Survey (NHANES) 2003-2006, who wore accelerometers to m… ▽ More Background: We propose a method for estimating the timing of in-bed intervals using objective data in a large representative U.S. sample, and quantify the association between these intervals and age, sex, and day of the week. Methods: The study included 11,951 participants six years and older from the National Health and Nutrition Examination Survey (NHANES) 2003-2006, who wore accelerometers to measure physical activity for seven consecutive days. Participants were instructed to remove the device just before the nighttime sleep period and put it back on immediately after. This nighttime period of non-wear was defined in this paper as the objective bedtime (OBT), an objectively estimated record of the in-bed-interval. For each night of the week, we estimated two measures: the duration of the OBT (OBT-D) and, as a measure of the chronotype, the midpoint of the OBT (OBT-M). We estimated day-of-the-week-specific OBT-D and OBT-M using gender-specific population percentile curves. Differences in OBT-M (chronotype) and OBT-D (the amount of time spent in bed) by age and sex were estimated using regression models. Results: The estimates of OBT-M and their differences among age groups were consistent with the estimates of chronotype obtained via self-report in European populations. The average OBT-M varied significantly by age, while OBT-D was less variable with age. The most pronounced differences were observed between OBT-M of weekday and weekend nights. Conclusions: The proposed measures, OBT-D and OBT-M, provide useful information of time in bed and chronotype in NHANES 2003-2006. They identify within-week patterns of bedtime and can be used to study associations between the bedtime and the large number of health outcomes collected in NHANES 2003-2006. △ Less

Submitted 16 June, 2017; originally announced June 2017.

Showing 1–11 of 11 results for author: Leroux, A