-
Comparing estimators of discriminative performance of time-to-event models
Authors:
Ying **,
Andrew Leroux
Abstract:
Predicting the timing and occurrence of events is a major focus of data science applications, especially in the context of biomedical research. Performance for models estimating these outcomes, often referred to as time-to-event or survival outcomes, is frequently summarized using measures of discrimination, in particular time-dependent AUC and concordance. Many estimators for these quantities hav…
▽ More
Predicting the timing and occurrence of events is a major focus of data science applications, especially in the context of biomedical research. Performance for models estimating these outcomes, often referred to as time-to-event or survival outcomes, is frequently summarized using measures of discrimination, in particular time-dependent AUC and concordance. Many estimators for these quantities have been proposed which can be broadly categorized as either semi-parametric estimators or non-parametric estimators. In this paper, we review various estimators' mathematical construction and compare the behavior of the two classes of estimators. Importantly, we identify a previously unknown feature of the class of semi-parametric estimators that can result in vastly over-optimistic out-of-sample estimation of discriminative performance in common applied tasks. Although these semi-parametric estimators are popular in practice, the phenomenon we identify here suggests this class of estimators may be inappropriate for use in model assessment and selection based on out-of-sample evaluation criteria. This is due to the semi-parametric estimators' bias in favor of models that are overfit when using out-of-sample prediction criteria (e.g., cross validation). Non-parametric estimators, which do not exhibit this behavior, are highly variable for local discrimination. We propose to address the high variability problem through penalized regression splines smoothing. The behavior of various estimators of time-dependent AUC and concordance are illustrated via a simulation study using two different mechanisms that produce over-optimistic out-of-sample estimates using semi-parametric estimators. Estimators are further compared using a case study using data from the National Health and Nutrition Examination Survey (NHANES) 2011-2014.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Analysis of Active/Inactive Patterns in the NHANES Data using Generalized Multilevel Functional Principal Component Analysis
Authors:
Xinkai Zhou,
Julia Wrobel,
Ciprian M. Crainiceanu,
Andrew Leroux
Abstract:
Between 2011 and 2014 NHANES collected objectively measured physical activity data using wrist-worn accelerometers for tens of thousands of individuals for up to seven days. Here we analyze the minute-level indicators of being active, which can be viewed as binary (because there is an active indicator at every minute), multilevel (because there are multiple days of data for each study participant)…
▽ More
Between 2011 and 2014 NHANES collected objectively measured physical activity data using wrist-worn accelerometers for tens of thousands of individuals for up to seven days. Here we analyze the minute-level indicators of being active, which can be viewed as binary (because there is an active indicator at every minute), multilevel (because there are multiple days of data for each study participant), functional (because within-day data can be viewed as a function of time) data. To extract within- and between-participant directions of variation in the data, we introduce Generalized Multilevel Functional Principal Component Analysis (GM-FPCA), an approach based on the dimension reduction of the linear predictor. Scores associated with specific patterns of activity are shown to be strongly associated with time to death. Extensive simulation studies indicate that GM-FPCA provides accurate estimation of model parameters, is computationally stable, and is scalable in the number of study participants, visits, and observations within visits. R code for implementing the method is provided.
△ Less
Submitted 10 May, 2024; v1 submitted 23 November, 2023;
originally announced November 2023.
-
Walking fingerprinting
Authors:
Lily Koffman,
Ciprian Crainiceanu,
Andrew Leroux
Abstract:
We consider the problem of predicting an individual's identity from accelerometry data collected during walking. In a previous paper we introduced an approach that transforms the accelerometry time series into an image by constructing its complete empirical autocorrelation distribution. Predictors derived by partitioning this image into grid cells were used in logistic regression to predict indivi…
▽ More
We consider the problem of predicting an individual's identity from accelerometry data collected during walking. In a previous paper we introduced an approach that transforms the accelerometry time series into an image by constructing its complete empirical autocorrelation distribution. Predictors derived by partitioning this image into grid cells were used in logistic regression to predict individuals. Here we: (1) implement machine learning methods for prediction using the grid cell-derived predictors; (2) derive inferential methods to screen for the most predictive grid cells; and (3) develop a novel multivariate functional regression model that avoids partitioning of the predictor space into cells. Prediction methods are compared on two open source data sets: (1) accelerometry data collected from $32$ individuals walking on a $1.06$ kilometer path; and (2) accelerometry data collected from six repetitions of walking on a $20$ meter path on two separate occasions at least one week apart for $153$ study participants. In the $32$-individual study, all methods achieve at least $95$% rank-1 accuracy, while in the $153$-individual study, accuracy varies from $41$% to $98$%, depending on the method and prediction task. Methods provide insights into why some individuals are easier to predict than others.
△ Less
Submitted 18 September, 2023;
originally announced September 2023.
-
Hidden Stabilizers, the Isogeny To Endomorphism Ring Problem and the Cryptanalysis of pSIDH
Authors:
Mingjie Chen,
Muhammad Imran,
Gábor Ivanyos,
Péter Kutas,
Antonin Leroux,
Christophe Petit
Abstract:
The Isogeny to Endomorphism Ring Problem (IsERP) asks to compute the endomorphism ring of the codomain of an isogeny between supersingular curves in characteristic $p$ given only a representation for this isogeny, i.e. some data and an algorithm to evaluate this isogeny on any torsion point. This problem plays a central role in isogeny-based cryptography; it underlies the security of pSIDH protoco…
▽ More
The Isogeny to Endomorphism Ring Problem (IsERP) asks to compute the endomorphism ring of the codomain of an isogeny between supersingular curves in characteristic $p$ given only a representation for this isogeny, i.e. some data and an algorithm to evaluate this isogeny on any torsion point. This problem plays a central role in isogeny-based cryptography; it underlies the security of pSIDH protocol (ASIACRYPT 2022) and it is at the heart of the recent attacks that broke the SIDH key exchange. Prior to this work, no efficient algorithm was known to solve IsERP for a generic isogeny degree, the hardest case seemingly when the degree is prime.
In this paper, we introduce a new quantum polynomial-time algorithm to solve IsERP for isogenies whose degrees are odd and have $O(\log\log p)$ many prime factors. As main technical tools, our algorithm uses a quantum algorithm for computing hidden Borel subgroups, a group action on supersingular isogenies from EUROCRYPT 2021, various algorithms for the Deuring correspondence and a new algorithm to lift arbitrary quaternion order elements modulo an odd integer $N$ with $O(\log\log p)$ many prime factors to powersmooth elements.
As a main consequence for cryptography, we obtain a quantum polynomial-time key recovery attack on pSIDH. The technical tools we use may also be of independent interest.
△ Less
Submitted 31 May, 2023;
originally announced May 2023.
-
Fast Generalized Functional Principal Components Analysis
Authors:
Andrew Leroux,
Ciprian Crainiceanu,
Julia Wrobel
Abstract:
We propose a new fast generalized functional principal components analysis (fast-GFPCA) algorithm for dimension reduction of non-Gaussian functional data. The method consists of: (1) binning the data within the functional domain; (2) fitting local random intercept generalized linear mixed models in every bin to obtain the initial estimates of the person-specific functional linear predictors; (3) u…
▽ More
We propose a new fast generalized functional principal components analysis (fast-GFPCA) algorithm for dimension reduction of non-Gaussian functional data. The method consists of: (1) binning the data within the functional domain; (2) fitting local random intercept generalized linear mixed models in every bin to obtain the initial estimates of the person-specific functional linear predictors; (3) using fast functional principal component analysis to smooth the linear predictors and obtain their eigenfunctions; and (4) estimating the global model conditional on the eigenfunctions of the linear predictors. An extensive simulation study shows that fast-GFPCA performs as well or better than existing state-of-the-art approaches, it is orders of magnitude faster than existing general purpose GFPCA methods, and scales up well with both the number of observed curves and observations per curve. Methods were motivated by and applied to a study of active/inactive physical activity profiles obtained from wearable accelerometers in the NHANES 2011-2014 study. The method can be implemented by any user familiar with mixed model software, though the R package fastGFPCA is provided for convenience.
△ Less
Submitted 3 June, 2023; v1 submitted 3 May, 2023;
originally announced May 2023.
-
Computation of Hilbert class polynomials and modular polynomials from supersingular elliptic curves
Authors:
Antonin Leroux
Abstract:
We present several new heuristic algorithms to compute class polynomials and modular polynomials modulo a prime $p$ by revisiting the idea of working with supersingular elliptic curves.
The best known algorithms to this date are based on ordinary curves, due to the supposed inefficiency of the supersingular case. While this was true a decade ago, the recent advances in the study of supersingular…
▽ More
We present several new heuristic algorithms to compute class polynomials and modular polynomials modulo a prime $p$ by revisiting the idea of working with supersingular elliptic curves.
The best known algorithms to this date are based on ordinary curves, due to the supposed inefficiency of the supersingular case. While this was true a decade ago, the recent advances in the study of supersingular curves through the Deuring correspondence motivated by isogeny-based cryptography has provided all the tools to perform the necessary tasks efficiently.
△ Less
Submitted 15 December, 2023; v1 submitted 20 January, 2023;
originally announced January 2023.
-
Empirical Likelihood Inference of Variance Components in Linear Mixed-Effects Models
Authors:
J. Zhang,
W. Guo,
J. S. Carpenter,
Andrew Leroux,
K. R. Merikangas,
N. G. Martin,
I. B. Hickie,
H. Shou,
H. Li
Abstract:
Linear mixed-effects models are widely used in analyzing repeated measures data, including clustered and longitudinal data, where inferences of both fixed effects and variance components are of importance. Unlike the fixed effect inference that has been well studied, inference on the variance components is more challenging due to null value being on the boundary and the nuisance parameters of the…
▽ More
Linear mixed-effects models are widely used in analyzing repeated measures data, including clustered and longitudinal data, where inferences of both fixed effects and variance components are of importance. Unlike the fixed effect inference that has been well studied, inference on the variance components is more challenging due to null value being on the boundary and the nuisance parameters of the fixed effects. Existing methods often require strong distributional assumptions on the random effects and random errors. In this paper, we develop empirical likelihood-based methods for the inference of the variance components in the presence of fixed effects. A nonparametric version of the Wilks' theorem for the proposed empirical likelihood ratio statistics for variance components is derived. We also develop an empirical likelihood test for multiple variance components related to a sequence of correlated outcomes. Simulation studies demonstrate that the proposed methods exhibit better type 1 error control than the commonly used likelihood ratio tests when the Gaussian distributional assumptions of the random effects are violated. We apply the methods to investigate the heritability of physical activity as measured by wearable device in the Australian Twin study and observe that such activity is heritable only in the quantile range from 0.375 to 0.514.
△ Less
Submitted 29 August, 2022;
originally announced August 2022.
-
A case study of glucose levels during sleep using fast function on scalar regression inference
Authors:
Renat Sergazinov,
Andrew Leroux,
Erjia Cui,
Ciprian Crainiceanu,
R. Nisha Aurora,
Naresh M. Punjabi,
Irina Gaynanova
Abstract:
Continuous glucose monitors (CGMs) are increasingly used to measure blood glucose levels and provide information about the treatment and management of diabetes. Our motivating study contains CGM data during sleep for 174 study participants with type II diabetes mellitus measured at a 5-minute frequency for an average of 10 nights. We aim to quantify the effects of diabetes medications and sleep ap…
▽ More
Continuous glucose monitors (CGMs) are increasingly used to measure blood glucose levels and provide information about the treatment and management of diabetes. Our motivating study contains CGM data during sleep for 174 study participants with type II diabetes mellitus measured at a 5-minute frequency for an average of 10 nights. We aim to quantify the effects of diabetes medications and sleep apnea severity on glucose levels. Statistically, this is an inference question about the association between scalar covariates and functional responses. However, many characteristics of the data make analyses difficult, including (1) non-stationary within-day patterns; (2) substantial between-day heterogeneity, non-Gaussianity, and outliers; 3) large dimensionality due to the number of study participants, sleep periods, and time points. We evaluate and compare two methods: fast univariate inference (FUI) and functional additive mixed models (FAMM). We introduce a new approach for calculating p-values for testing a global null effect of covariates using FUI, and provide practical guidelines for speeding up FAMM computations, making it feasible for our data. While FUI and FAMM are philosophically different, they lead to similar point estimators in our study. In contrast to FAMM, FUI is fast, accounts for within-day correlations, and enables the construction of joint confidence intervals. Our analyses reveal that: (1) biguanide medication and sleep apnea severity significantly affect glucose trajectories during sleep, and (2) the estimated effects are time-invariant.
△ Less
Submitted 17 May, 2022;
originally announced May 2022.
-
Faster computation of isogenies of large prime degree
Authors:
Daniel Bernstein,
Luca de Feo,
Antonin Leroux,
Benjamin Smith
Abstract:
Let $\mathcal{E}/\mathbb{F}_q$ be an elliptic curve, and $P$ a point in $\mathcal{E}(\mathbb{F}_q)$ of prime order $\ell$. Vélu's formulae let us compute a quotient curve $\mathcal{E}' = \mathcal{E}/\langle{P}\rangle$ and rational maps defining a quotient isogeny $φ: \mathcal{E} \to \mathcal{E}'$ in $\tilde{O}(\ell)$ $\mathbb{F}_q$-operations, where the $\tilde{O}$ is uniform in $q$.This article s…
▽ More
Let $\mathcal{E}/\mathbb{F}_q$ be an elliptic curve, and $P$ a point in $\mathcal{E}(\mathbb{F}_q)$ of prime order $\ell$. Vélu's formulae let us compute a quotient curve $\mathcal{E}' = \mathcal{E}/\langle{P}\rangle$ and rational maps defining a quotient isogeny $φ: \mathcal{E} \to \mathcal{E}'$ in $\tilde{O}(\ell)$ $\mathbb{F}_q$-operations, where the $\tilde{O}$ is uniform in $q$.This article shows how to compute $\mathcal{E}'$, and $φ(Q)$ for $Q$ in $\mathcal{E}(\mathbb{F}_q)$, using only $\tilde{O}(\sqrt{\ell})$ $\mathbb{F}_q$-operations, where the $\tilde{O}$ is again uniform in $q$.As an application, this article speeds up some computations used in the isogeny-based cryptosystems CSIDH and CSURF.
△ Less
Submitted 23 March, 2020;
originally announced March 2020.
-
Information gain ratio correction: Improving prediction with more balanced decision tree splits
Authors:
Antonin Leroux,
Matthieu Boussard,
Remi Dès
Abstract:
Decision trees algorithms use a gain function to select the best split during the tree's induction. This function is crucial to obtain trees with high predictive accuracy. Some gain functions can suffer from a bias when it compares splits of different arities. Quinlan proposed a gain ratio in C4.5's information gain function to fix this bias. In this paper, we present an updated version of the gai…
▽ More
Decision trees algorithms use a gain function to select the best split during the tree's induction. This function is crucial to obtain trees with high predictive accuracy. Some gain functions can suffer from a bias when it compares splits of different arities. Quinlan proposed a gain ratio in C4.5's information gain function to fix this bias. In this paper, we present an updated version of the gain ratio that performs better as it tries to fix the gain ratio's bias for unbalanced trees and some splits with low predictive interest.
△ Less
Submitted 25 January, 2018;
originally announced January 2018.
-
Epidemiology of Objectively Measured Bedtime and Chronotype in the US adolescents and adults: NHANES 2003-2006
Authors:
Jacek K. Urbanek,
Adam Spira,
Junrui Di,
Andrew Leroux,
Ciprian Crainiceanu,
Vadim Zipunnikov
Abstract:
Background: We propose a method for estimating the timing of in-bed intervals using objective data in a large representative U.S. sample, and quantify the association between these intervals and age, sex, and day of the week. Methods: The study included 11,951 participants six years and older from the National Health and Nutrition Examination Survey (NHANES) 2003-2006, who wore accelerometers to m…
▽ More
Background: We propose a method for estimating the timing of in-bed intervals using objective data in a large representative U.S. sample, and quantify the association between these intervals and age, sex, and day of the week. Methods: The study included 11,951 participants six years and older from the National Health and Nutrition Examination Survey (NHANES) 2003-2006, who wore accelerometers to measure physical activity for seven consecutive days. Participants were instructed to remove the device just before the nighttime sleep period and put it back on immediately after. This nighttime period of non-wear was defined in this paper as the objective bedtime (OBT), an objectively estimated record of the in-bed-interval. For each night of the week, we estimated two measures: the duration of the OBT (OBT-D) and, as a measure of the chronotype, the midpoint of the OBT (OBT-M). We estimated day-of-the-week-specific OBT-D and OBT-M using gender-specific population percentile curves. Differences in OBT-M (chronotype) and OBT-D (the amount of time spent in bed) by age and sex were estimated using regression models. Results: The estimates of OBT-M and their differences among age groups were consistent with the estimates of chronotype obtained via self-report in European populations. The average OBT-M varied significantly by age, while OBT-D was less variable with age. The most pronounced differences were observed between OBT-M of weekday and weekend nights. Conclusions: The proposed measures, OBT-D and OBT-M, provide useful information of time in bed and chronotype in NHANES 2003-2006. They identify within-week patterns of bedtime and can be used to study associations between the bedtime and the large number of health outcomes collected in NHANES 2003-2006.
△ Less
Submitted 16 June, 2017;
originally announced June 2017.