-
A Metric-based Principal Curve Approach for Learning One-dimensional Manifold
Authors:
Elvis Han Cui,
Sisi Shao
Abstract:
Principal curve is a well-known statistical method oriented in manifold learning using concepts from differential geometry. In this paper, we propose a novel metric-based principal curve (MPC) method that learns one-dimensional manifold of spatial data. Synthetic datasets Real applications using MNIST dataset show that our method can learn the one-dimensional manifold well in terms of the shape.
Principal curve is a well-known statistical method oriented in manifold learning using concepts from differential geometry. In this paper, we propose a novel metric-based principal curve (MPC) method that learns one-dimensional manifold of spatial data. Synthetic datasets Real applications using MNIST dataset show that our method can learn the one-dimensional manifold well in terms of the shape.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
PDXpower: A Power Analysis Tool for Experimental Design in Pre-clinical Xenograft Studies for Uncensored and Censored Outcomes
Authors:
Shanpeng Li,
Donatello Telesca,
Harley I. Kornblum,
David Nathanson,
Frank Pajonk,
Elvis Han Cui,
Joycelynne Palmer,
Gang Li
Abstract:
In cancer research, leveraging patient-derived xenografts (PDXs) in pre-clinical experiments is a crucial approach for assessing innovative therapeutic strategies. Addressing the inherent variability in treatment response among and within individual PDX lines is essential. However, the current literature lacks a user-friendly statistical power analysis tool capable of concurrently determining the…
▽ More
In cancer research, leveraging patient-derived xenografts (PDXs) in pre-clinical experiments is a crucial approach for assessing innovative therapeutic strategies. Addressing the inherent variability in treatment response among and within individual PDX lines is essential. However, the current literature lacks a user-friendly statistical power analysis tool capable of concurrently determining the required number of PDX lines and animals per line per treatment group in this context. In this paper, we present a simulation-based R package for sample size determination, named `\textbf{PDXpower}', which is publicly available at The Comprehensive R Archive Network \url{https://CRAN.R-project.org/package=PDXpower}. The package is designed to estimate the necessary number of both PDX lines and animals per line per treatment group for the design of a PDX experiment, whether for an uncensored outcome, or a censored time-to-event outcome. Our sample size considerations rely on two widely used analytical frameworks: the mixed effects ANOVA model for uncensored outcomes and Cox's frailty model for censored data outcomes, which effectively account for both inter-PDX variability and intra-PDX correlation in treatment response. Step-by-step illustrations for utilizing the developed package are provided, catering to scenarios with or without preliminary data.
△ Less
Submitted 13 April, 2024;
originally announced April 2024.
-
Trajectory-aware Principal Manifold Framework for Data Augmentation and Image Generation
Authors:
Elvis Han Cui,
Bingbin Li,
Yanan Li,
Weng Kee Wong,
Donghui Wang
Abstract:
Data augmentation for deep learning benefits model training, image transformation, medical imaging analysis and many other fields. Many existing methods generate new samples from a parametric distribution, like the Gaussian, with little attention to generate samples along the data manifold in either the input or feature space. In this paper, we verify that there are theoretical and practical advan…
▽ More
Data augmentation for deep learning benefits model training, image transformation, medical imaging analysis and many other fields. Many existing methods generate new samples from a parametric distribution, like the Gaussian, with little attention to generate samples along the data manifold in either the input or feature space. In this paper, we verify that there are theoretical and practical advantages of using the principal manifold hidden in the feature space than the Gaussian distribution. We then propose a novel trajectory-aware principal manifold framework to restore the manifold backbone and generate samples along a specific trajectory. On top of the autoencoder architecture, we further introduce an intrinsic dimension regularization term to make the manifold more compact and enable few-shot image generation. Experimental results show that the novel framework is able to extract more compact manifold representation, improve classification accuracy and generate smooth transformation among few samples.
△ Less
Submitted 30 July, 2023;
originally announced October 2023.
-
Continuous-time multivariate analysis
Authors:
Biplab Paul,
Philip T. Reiss,
Erjia Cui,
Noemi FoĆ
Abstract:
The starting point for much of multivariate analysis (MVA) is an $n\times p$ data matrix whose $n$ rows represent observations and whose $p$ columns represent variables. Some multivariate data sets, however, may be best conceptualized not as $n$ discrete $p$-variate observations, but as $p$ curves or functions defined on a common time interval. Here we introduce a framework for extending technique…
▽ More
The starting point for much of multivariate analysis (MVA) is an $n\times p$ data matrix whose $n$ rows represent observations and whose $p$ columns represent variables. Some multivariate data sets, however, may be best conceptualized not as $n$ discrete $p$-variate observations, but as $p$ curves or functions defined on a common time interval. Here we introduce a framework for extending techniques of multivariate analysis to such settings. The proposed continuous-time multivariate analysis (CTMVA) framework rests on the assumption that the curves can be represented as linear combinations of basis functions such as $B$-splines, as in the Ramsay-Silverman representation of functional data; but whereas functional data analysis extends MVA to the case of observations that are curves rather than vectors -- heuristically, $n\times p$ data with $p$ infinite -- we are instead concerned with what happens when $n$ is infinite. We present continuous-time extensions of the classical MVA methods of covariance and correlation estimation, principal component analysis, Fisher's linear discriminant analysis, and $k$-means clustering. We show that CTMVA can improve on the performance of classical MVA, in particular for correlation estimation and clustering, and can be applied in some settings where classical MVA cannot, including variables observed at disparate time points. CTMVA is illustrated with a novel perspective on a well-known Canadian weather data set, and with applications to data sets involving international development, brain signals, and air quality. The proposed methods are implemented in the publicly available R package \texttt{ctmva}.
△ Less
Submitted 12 June, 2024; v1 submitted 18 July, 2023;
originally announced July 2023.
-
Scalable regression calibration approaches to correcting measurement error in multi-level generalized functional linear regression models with heteroscedastic measurement errors
Authors:
Yuanyuan Luan,
Roger S. Zoh,
Erjia Cui,
Xue Lan,
Sneha Jadhav,
Carmen D. Tekwe
Abstract:
Wearable devices permit the continuous monitoring of biological processes, such as blood glucose metabolism, and behavior, such as sleep quality and physical activity. The continuous monitoring often occurs in epochs of 60 seconds over multiple days, resulting in high dimensional longitudinal curves that are best described and analyzed as functional data. From this perspective, the functional data…
▽ More
Wearable devices permit the continuous monitoring of biological processes, such as blood glucose metabolism, and behavior, such as sleep quality and physical activity. The continuous monitoring often occurs in epochs of 60 seconds over multiple days, resulting in high dimensional longitudinal curves that are best described and analyzed as functional data. From this perspective, the functional data are smooth, latent functions obtained at discrete time intervals and prone to homoscedastic white noise. However, the assumption of homoscedastic errors might not be appropriate in this setting because the devices collect the data serially. While researchers have previously addressed measurement error in scalar covariates prone to errors, less work has been done on correcting measurement error in high dimensional longitudinal curves prone to heteroscedastic errors. We present two new methods for correcting measurement error in longitudinal functional curves prone to complex measurement error structures in multi-level generalized functional linear regression models. These methods are based on two-stage scalable regression calibration. We assume that the distribution of the scalar responses and the surrogate measures prone to heteroscedastic errors both belong in the exponential family and that the measurement errors follow Gaussian processes. In simulations and sensitivity analyses, we established some finite sample properties of these methods. In our simulations, both regression calibration methods for correcting measurement error performed better than estimators based on averaging the longitudinal functional data and using observations from a single day. We also applied the methods to assess the relationship between physical activity and type 2 diabetes in community dwelling adults in the United States who participated in the National Health and Nutrition Examination Survey.
△ Less
Submitted 20 April, 2024; v1 submitted 21 May, 2023;
originally announced May 2023.
-
A Roadmap to Asymptotic Properties with Applications to COVID-19 Data
Authors:
Elvis Han Cui
Abstract:
Asymptotic properties of statistical estimators play a significant role both in practice and in theory. However, many asymptotic results in statistics rely heavily on the independent and identically distributed (iid) assumption, which is not realistic when we have fixed designs. In this article, we build a roadmap of general procedures for deriving asymptotic properties under fixed designs and the…
▽ More
Asymptotic properties of statistical estimators play a significant role both in practice and in theory. However, many asymptotic results in statistics rely heavily on the independent and identically distributed (iid) assumption, which is not realistic when we have fixed designs. In this article, we build a roadmap of general procedures for deriving asymptotic properties under fixed designs and the observations need not to be iid. We further provide their applications in many statistical applications. Finally, we apply our results to Poisson regression using a COVID-19 dataset as an illustration to demonstrate the power of these results in practice.
△ Less
Submitted 6 October, 2022;
originally announced November 2022.
-
A Tutorial on Statistical Models Based on Counting Processes
Authors:
Elvis Han Cui
Abstract:
Since the famous paper written by Kaplan and Meier in 1958, survival analysis has become one of the most important fields in statistics. Nowadays it is one of the most important statistical tools in analyzing epidemiological and clinical data including COVID-19 pandemic. This article reviews some of the most celebrated and important results and methods, including consistency, asymptotic normality,…
▽ More
Since the famous paper written by Kaplan and Meier in 1958, survival analysis has become one of the most important fields in statistics. Nowadays it is one of the most important statistical tools in analyzing epidemiological and clinical data including COVID-19 pandemic. This article reviews some of the most celebrated and important results and methods, including consistency, asymptotic normality, bias and variance estimation, in survival analysis and the treatment is parallel to the monograph Statistical Models Based on Counting Processes. Other models and results such as semi-Markov models and the Turnbull's estimator that jump out of the classical counting process martingale framework are also discussed.
△ Less
Submitted 23 October, 2022; v1 submitted 30 September, 2022;
originally announced October 2022.
-
D-optimal Approximate Design for Binary Regression and Quantal Response in Toxicology Studies
Authors:
Elvis Han Cui
Abstract:
We provide a systematic treatment of $D$-optimal design for binary regression and quantal response models in toxicology studies. For the two-parameter case, we provide an analytical equation (WC equation) for computing the $D$-optimal design quickly and when analytical solution is not available, we apply particle swarm optimization to solve for the $D$-optimal design. Examples with various link fu…
▽ More
We provide a systematic treatment of $D$-optimal design for binary regression and quantal response models in toxicology studies. For the two-parameter case, we provide an analytical equation (WC equation) for computing the $D$-optimal design quickly and when analytical solution is not available, we apply particle swarm optimization to solve for the $D$-optimal design. Examples with various link functions are given as well as the sensitivity functions. We extend the two-parameter case to three-parameter case by providing a neat formula for the determinant of the information matrix. We also suggest practitioners to work with the neat formula to derive optimal designs for three-parameter binary regression models.
△ Less
Submitted 27 September, 2022;
originally announced September 2022.
-
A case study of glucose levels during sleep using fast function on scalar regression inference
Authors:
Renat Sergazinov,
Andrew Leroux,
Erjia Cui,
Ciprian Crainiceanu,
R. Nisha Aurora,
Naresh M. Punjabi,
Irina Gaynanova
Abstract:
Continuous glucose monitors (CGMs) are increasingly used to measure blood glucose levels and provide information about the treatment and management of diabetes. Our motivating study contains CGM data during sleep for 174 study participants with type II diabetes mellitus measured at a 5-minute frequency for an average of 10 nights. We aim to quantify the effects of diabetes medications and sleep ap…
▽ More
Continuous glucose monitors (CGMs) are increasingly used to measure blood glucose levels and provide information about the treatment and management of diabetes. Our motivating study contains CGM data during sleep for 174 study participants with type II diabetes mellitus measured at a 5-minute frequency for an average of 10 nights. We aim to quantify the effects of diabetes medications and sleep apnea severity on glucose levels. Statistically, this is an inference question about the association between scalar covariates and functional responses. However, many characteristics of the data make analyses difficult, including (1) non-stationary within-day patterns; (2) substantial between-day heterogeneity, non-Gaussianity, and outliers; 3) large dimensionality due to the number of study participants, sleep periods, and time points. We evaluate and compare two methods: fast univariate inference (FUI) and functional additive mixed models (FAMM). We introduce a new approach for calculating p-values for testing a global null effect of covariates using FUI, and provide practical guidelines for speeding up FAMM computations, making it feasible for our data. While FUI and FAMM are philosophically different, they lead to similar point estimators in our study. In contrast to FAMM, FUI is fast, accounts for within-day correlations, and enables the construction of joint confidence intervals. Our analyses reveal that: (1) biguanide medication and sleep apnea severity significantly affect glucose trajectories during sleep, and (2) the estimated effects are time-invariant.
△ Less
Submitted 17 May, 2022;
originally announced May 2022.
-
Particle swarm optimization in constrained maximum likelihood estimation a case study
Authors:
Elvis Cui,
Dongyuan Song,
Weng Kee Wong
Abstract:
The aim of paper is to apply two types of particle swarm optimization, global best andlocal best PSO to a constrained maximum likelihood estimation problem in pseudotime anal-ysis, a sub-field in bioinformatics. The results have shown that particle swarm optimizationis extremely useful and efficient when the optimization problem is non-differentiable and non-convex so that analytical solution can…
▽ More
The aim of paper is to apply two types of particle swarm optimization, global best andlocal best PSO to a constrained maximum likelihood estimation problem in pseudotime anal-ysis, a sub-field in bioinformatics. The results have shown that particle swarm optimizationis extremely useful and efficient when the optimization problem is non-differentiable and non-convex so that analytical solution can not be derived and gradient-based methods can not beapplied.
△ Less
Submitted 9 April, 2021;
originally announced April 2021.
-
Projection Pursuit with Applications to scRNA Sequencing Data
Authors:
Elvis Han Cui,
Heather Zhou
Abstract:
In this paper, we explore the limitations of PCA as a dimension reduction technique and study its extension, projection pursuit (PP), which is a broad class of linear dimension reduction methods. We first discuss the relevant concepts and theorems and then apply PCA and PP (with negative standardized Shannon's entropy as the projection index) on single cell RNA sequencing data.
In this paper, we explore the limitations of PCA as a dimension reduction technique and study its extension, projection pursuit (PP), which is a broad class of linear dimension reduction methods. We first discuss the relevant concepts and theorems and then apply PCA and PP (with negative standardized Shannon's entropy as the projection index) on single cell RNA sequencing data.
△ Less
Submitted 13 October, 2022; v1 submitted 16 December, 2019;
originally announced December 2019.