-
Penalized spline estimation of principal components for sparse functional data: rates of convergence
Authors:
Shiyuan He,
Jianhua Z. Huang,
Kejun He
Abstract:
This paper gives a comprehensive treatment of the convergence rates of penalized spline estimators for simultaneously estimating several leading principal component functions, when the functional data is sparsely observed. The penalized spline estimators are defined as the solution of a penalized empirical risk minimization problem, where the loss function belongs to a general class of loss functi…
▽ More
This paper gives a comprehensive treatment of the convergence rates of penalized spline estimators for simultaneously estimating several leading principal component functions, when the functional data is sparsely observed. The penalized spline estimators are defined as the solution of a penalized empirical risk minimization problem, where the loss function belongs to a general class of loss functions motivated by the matrix Bregman divergence, and the penalty term is the integrated squared derivative. The theory reveals that the asymptotic behavior of penalized spline estimators depends on the interesting interplay between several factors, i.e., the smoothness of the unknown functions, the spline degree, the spline knot number, the penalty order, and the penalty parameter. The theory also classifies the asymptotic behavior into seven scenarios and characterizes whether and how the minimax optimal rates of convergence are achievable in each scenario.
△ Less
Submitted 8 February, 2024;
originally announced February 2024.
-
A Reproducing Kernel Hilbert Space Approach to Functional Calibration of Computer Models
Authors:
Rui Tuo,
Shiyuan He,
Arash Pourhabib,
Yu Ding,
Jianhua Z. Huang
Abstract:
This paper develops a frequentist solution to the functional calibration problem, where the value of a calibration parameter in a computer model is allowed to vary with the value of control variables in the physical system. The need of functional calibration is motivated by engineering applications where using a constant calibration parameter results in a significant mismatch between outputs from…
▽ More
This paper develops a frequentist solution to the functional calibration problem, where the value of a calibration parameter in a computer model is allowed to vary with the value of control variables in the physical system. The need of functional calibration is motivated by engineering applications where using a constant calibration parameter results in a significant mismatch between outputs from the computer model and the physical experiment. Reproducing kernel Hilbert spaces (RKHS) are used to model the optimal calibration function, defined as the functional relationship between the calibration parameter and control variables that gives the best prediction. This optimal calibration function is estimated through penalized least squares with an RKHS-norm penalty and using physical data. An uncertainty quantification procedure is also developed for such estimates. Theoretical guarantees of the proposed method are provided in terms of prediction consistency and consistency of estimating the optimal calibration function. The proposed method is tested using both real and synthetic data and exhibits more robust performance in prediction and uncertainty quantification than the existing parametric functional calibration method and a state-of-art Bayesian method.
△ Less
Submitted 17 July, 2021;
originally announced July 2021.
-
Deep Personalized Glucose Level Forecasting Using Attention-based Recurrent Neural Networks
Authors:
Mohammadreza Armandpour,
Brian Kidd,
Yu Du,
Jianhua Z. Huang
Abstract:
In this paper, we study the problem of blood glucose forecasting and provide a deep personalized solution. Predicting blood glucose level in people with diabetes has significant value because health complications of abnormal glucose level are serious, sometimes even leading to death. Therefore, having a model that can accurately and quickly warn patients of potential problems is essential. To deve…
▽ More
In this paper, we study the problem of blood glucose forecasting and provide a deep personalized solution. Predicting blood glucose level in people with diabetes has significant value because health complications of abnormal glucose level are serious, sometimes even leading to death. Therefore, having a model that can accurately and quickly warn patients of potential problems is essential. To develop a better deep model for blood glucose forecasting, we analyze the data and detect important patterns. These observations helped us to propose a method that has several key advantages over existing methods: 1- it learns a personalized model for each patient as well as a global model; 2- it uses an attention mechanism and extracted time features to better learn long-term dependencies in the data; 3- it introduces a new, robust training procedure for time series data. We empirically show the efficacy of our model on a real dataset.
△ Less
Submitted 6 September, 2021; v1 submitted 1 June, 2021;
originally announced June 2021.
-
Asymptotic Properties of Penalized Spline Estimators in Concave Extended Linear Models: Rates of Convergence
Authors:
Jianhua Z. Huang,
Ya Su
Abstract:
This paper develops a general theory on rates of convergence of penalized spline estimators for function estimation when the likelihood functional is concave in candidate functions, where the likelihood is interpreted in a broad sense that includes conditional likelihood, quasi-likelihood, and pseudo-likelihood. The theory allows all feasible combinations of the spline degree, the penalty order, a…
▽ More
This paper develops a general theory on rates of convergence of penalized spline estimators for function estimation when the likelihood functional is concave in candidate functions, where the likelihood is interpreted in a broad sense that includes conditional likelihood, quasi-likelihood, and pseudo-likelihood. The theory allows all feasible combinations of the spline degree, the penalty order, and the smoothness of the unknown functions. According to this theory, the asymptotic behaviors of the penalized spline estimators depends on interplay between the spline knot number and the penalty parameter. The general theory is applied to obtain results in a variety of contexts, including regression, generalized regression such as logistic regression and Poisson regression, density estimation, conditional hazard function estimation for censored data, quantile regression, diffusion function estimation for a diffusion type process, and estimation of spectral density function of a stationary time series. For multi-dimensional function estimation, the theory (presented in the Supplementary Material) covers both penalized tensor product splines and penalized bivariate splines on triangulations.
△ Less
Submitted 13 May, 2021;
originally announced May 2021.
-
Simultaneous inference of periods and period-luminosity relations for Mira variable stars
Authors:
Shiyuan He,
Zhenfeng Lin,
Wenlong Yuan,
Lucas M. Macri,
Jianhua Z. Huang
Abstract:
The Period--Luminosity relation (PLR) of Mira variable stars is an important tool to determine astronomical distances. The common approach of estimating the PLR is a two-step procedure that first estimates the Mira periods and then runs a linear regression of magnitude on log period. When the light curves are sparse and noisy, the accuracy of period estimation decreases and can suffer from aliasin…
▽ More
The Period--Luminosity relation (PLR) of Mira variable stars is an important tool to determine astronomical distances. The common approach of estimating the PLR is a two-step procedure that first estimates the Mira periods and then runs a linear regression of magnitude on log period. When the light curves are sparse and noisy, the accuracy of period estimation decreases and can suffer from aliasing effects. Some methods improve accuracy by incorporating complex model structures at the expense of significant computational costs. Another drawback of existing methods is that they only provide point estimation without proper estimation of uncertainty. To overcome these challenges, we develop a hierarchical Bayesian model that simultaneously models the quasi-periodic variations for a collection of Mira light curves while estimating their common PLR. By borrowing strengths through the PLR, our method automatically reduces the aliasing effect, improves the accuracy of period estimation, and is capable of characterizing the estimation uncertainty. We develop a scalable stochastic variational inference algorithm for computation that can effectively deal with the multimodal posterior of period. The effectiveness of the proposed method is demonstrated through simulations, and an application to observations of Miras in the Local Group galaxy M33. Without using ad-hoc period correction tricks, our method achieves a distance estimate of M33 that is consistent with published work. Our method also shows superior robustness to downsampling of the light curves.
△ Less
Submitted 8 January, 2021;
originally announced January 2021.
-
Impact of body thickness and scattering on III-V triple heterojunction Fin-TFET modeled with atomistic mode space approximation
Authors:
Chin-Yi Chen,
Hesameddin Ilatikhameneh,
Jun Z. Huang,
Gerhard Klimeck,
Michael Povolotskyi
Abstract:
The triple heterojunction TFET has been originally proposed to resolve TFET's low ON-current challenge. The carrier transport in such devices is complicated due to the presence of quantum wells and strong scattering. Hence, the full band atomistic NEGF approach, including scattering, is required to model the carrier transport accurately. However, such simulations for devices with realistic dimensi…
▽ More
The triple heterojunction TFET has been originally proposed to resolve TFET's low ON-current challenge. The carrier transport in such devices is complicated due to the presence of quantum wells and strong scattering. Hence, the full band atomistic NEGF approach, including scattering, is required to model the carrier transport accurately. However, such simulations for devices with realistic dimensions are computationally unfeasible. To mitigate this issue, we have employed the empirical tight-binding mode space approximation to simulate triple heterojunction TFETs with the body thickness up to 12 nm. The triple heterojunction TFET design is optimized using the model to achieve a sub-60mV/dec transfer characteristic under realistic scattering conditions.
△ Less
Submitted 11 February, 2020;
originally announced February 2020.
-
Functional PCA with Covariate Dependent Mean and Covariance Structure
Authors:
Fei Ding,
Shiyuan He,
David E. Jones,
Jianhua Z. Huang
Abstract:
Incorporating covariates into functional principal component analysis (PCA) can substantially improve the representation efficiency of the principal components and predictive performance. However, many existing functional PCA methods do not make use of covariates, and those that do often have high computational cost or make overly simplistic assumptions that are violated in practice. In this artic…
▽ More
Incorporating covariates into functional principal component analysis (PCA) can substantially improve the representation efficiency of the principal components and predictive performance. However, many existing functional PCA methods do not make use of covariates, and those that do often have high computational cost or make overly simplistic assumptions that are violated in practice. In this article, we propose a new framework, called Covariate Dependent Functional Principal Component Analysis (CD-FPCA), in which both the mean and covariance structure depend on covariates. We propose a corresponding estimation algorithm, which makes use of spline basis representations and roughness penalties, and is substantially more computationally efficient than competing approaches of adequate estimation and prediction accuracy. A key aspect of our work is our novel approach for modeling the covariance function and ensuring that it is symmetric positive semi-definite. We demonstrate the advantages of our methodology through a simulation study and an astronomical data analysis.
△ Less
Submitted 19 August, 2023; v1 submitted 30 January, 2020;
originally announced January 2020.
-
Generative Neural Network based Spectrum Sharing using Linear Sum Assignment Problems
Authors:
Ahmed B. Zaky,
Joshua Zhexue Huang,
KaishunWu,
Basem M. ElHalawany
Abstract:
Spectrum management and resource allocation (RA) problems are challenging and critical in a vast number of research areas such as wireless communications and computer networks. The traditional approaches for solving such problems usually consume time and memory, especially for large size problems. Recently different machine learning approaches have been considered as potential promising techniques…
▽ More
Spectrum management and resource allocation (RA) problems are challenging and critical in a vast number of research areas such as wireless communications and computer networks. The traditional approaches for solving such problems usually consume time and memory, especially for large size problems. Recently different machine learning approaches have been considered as potential promising techniques for combinatorial optimization problems, especially the generative model of the deep neural networks. In this work, we propose a resource allocation deep autoencoder network, as one of the promising generative models, for enabling spectrum sharing in underlay device-to-device (D2D) communication by solving linear sum assignment problems (LSAPs). Specifically, we investigate the performance of three different architectures for the conditional variational autoencoders (CVAE). The three proposed architecture are the convolutional neural network (CVAE-CNN) autoencoder, the feed-forward neural network (CVAE-FNN) autoencoder, and the hybrid (H-CVAE) autoencoder. The simulation results show that the proposed approach could be used as a replacement of the conventional RA techniques, such as the Hungarian algorithm, due to its ability to find solutions of LASPs of different sizes with high accuracy and very fast execution time. Moreover, the simulation results reveal that the accuracy of the proposed hybrid autoencoder architecture outperforms the other proposed architectures and the state-of-the-art DNN techniques.
△ Less
Submitted 12 October, 2019;
originally announced October 2019.
-
Global Adversarial Attacks for Assessing Deep Learning Robustness
Authors:
Hanbin Hu,
Mit Shah,
Jianhua Z. Huang,
Peng Li
Abstract:
It has been shown that deep neural networks (DNNs) may be vulnerable to adversarial attacks, raising the concern on their robustness particularly for safety-critical applications. Recognizing the local nature and limitations of existing adversarial attacks, we present a new type of global adversarial attacks for assessing global DNN robustness. More specifically, we propose a novel concept of glob…
▽ More
It has been shown that deep neural networks (DNNs) may be vulnerable to adversarial attacks, raising the concern on their robustness particularly for safety-critical applications. Recognizing the local nature and limitations of existing adversarial attacks, we present a new type of global adversarial attacks for assessing global DNN robustness. More specifically, we propose a novel concept of global adversarial example pairs in which each pair of two examples are close to each other but have different class labels predicted by the DNN. We further propose two families of global attack methods and show that our methods are able to generate diverse and intriguing adversarial example pairs at locations far from the training or testing data. Moreover, we demonstrate that DNNs hardened using the strong projected gradient descent (PGD) based (local) adversarial training are vulnerable to the proposed global adversarial example pairs, suggesting that global robustness must be considered while training robust deep learning networks.
△ Less
Submitted 19 June, 2019;
originally announced June 2019.
-
Near-infrared Mira Period-Luminosity Relations in M33
Authors:
Wenlong Yuan,
Lucas M. Macri,
Atefeh Javadi,
Zhenfeng Lin,
Jianhua Z. Huang
Abstract:
We analyze sparsely-sampled near-infrared (JHKs) light curves of a sample of 1781 Mira variable candidates in M33, originally discovered using I-band time-series observations. We extend our single-band semi-parametric Gaussian process modeling of Mira light curves to a multi-band version and obtain improved period determinations. We use our previous results on near-infrared properties of candidate…
▽ More
We analyze sparsely-sampled near-infrared (JHKs) light curves of a sample of 1781 Mira variable candidates in M33, originally discovered using I-band time-series observations. We extend our single-band semi-parametric Gaussian process modeling of Mira light curves to a multi-band version and obtain improved period determinations. We use our previous results on near-infrared properties of candidate Miras in the LMC to classify the majority of the M33 sample into Oxygen- or Carbon-rich subsets. We derive Period-Luminosity relations for O-rich Miras and determine a distance modulus for M33 of 24.80 +- 0.06 mag.
△ Less
Submitted 10 July, 2018;
originally announced July 2018.
-
Safe Active Feature Selection for Sparse Learning
Authors:
Shaogang Ren,
Jianhua Z. Huang,
Shuai Huang,
Xiaoning Qian
Abstract:
We present safe active incremental feature selection~(SAIF) to scale up the computation of LASSO solutions. SAIF does not require a solution from a heavier penalty parameter as in sequential screening or updating the full model for each iteration as in dynamic screening. Different from these existing screening methods, SAIF starts from a small number of features and incrementally recruits active f…
▽ More
We present safe active incremental feature selection~(SAIF) to scale up the computation of LASSO solutions. SAIF does not require a solution from a heavier penalty parameter as in sequential screening or updating the full model for each iteration as in dynamic screening. Different from these existing screening methods, SAIF starts from a small number of features and incrementally recruits active features and updates the significantly reduced model. Hence, it is much more computationally efficient and scalable with the number of features. More critically, SAIF has the safe guarantee as it has the convergence guarantee to the optimal solution to the original full LASSO problem. Such an incremental procedure and theoretical convergence guarantee can be extended to fused LASSO problems. Compared with state-of-the-art screening methods as well as working set and homotopy methods, which may not always guarantee the optimal solution, SAIF can achieve superior or comparable efficiency and high scalability with the safe guarantee when facing extremely high dimensional data sets. Experiments with both synthetic and real-world data sets show that SAIF can be up to 50 times faster than dynamic screening, and hundreds of times faster than computing LASSO or fused LASSO solutions without screening.
△ Less
Submitted 19 June, 2018; v1 submitted 15 June, 2018;
originally announced June 2018.
-
Characterization of Type Ia Supernova Light Curves Using Principal Component Analysis of Sparse Functional Data
Authors:
Shiyuan He,
Lifan Wang,
Jianhua Z. Huang
Abstract:
With growing data from ongoing and future supernova surveys it is possible to empirically quantify the shapes of SNIa light curves in more detail, and to quantitatively relate the shape parameters with the intrinsic properties of SNIa. Building such relationship is critical in controlling systematic errors associated with supernova cosmology. Based on a collection of well-observed SNIa samples acc…
▽ More
With growing data from ongoing and future supernova surveys it is possible to empirically quantify the shapes of SNIa light curves in more detail, and to quantitatively relate the shape parameters with the intrinsic properties of SNIa. Building such relationship is critical in controlling systematic errors associated with supernova cosmology. Based on a collection of well-observed SNIa samples accumulated in the past years, we construct an empirical SNIa light curve model using a statistical method called the functional principal component analysis (FPCA) for sparse and irregularly sampled functional data. Using this method, the entire light curve of an SNIa is represented by a linear combination of principal component functions, and the SNIa is represented by a few numbers called principal component scores. These scores are used to establish relations between light curve shapes and physical quantities such as intrinsic color, interstellar dust reddening, spectral line strength, and spectral classes. These relations allow for descriptions of some critical physical quantities based purely on light curve shape parameters. Our study shows that some important spectral feature information is being encoded in the broad band light curves, for instance, we find that the light curve shapes are correlated with the velocity and velocity gradient of the Si II $λ$6355 line. This is important for supernova surveys, e.g., LSST and WFIRST. Moreover, the FPCA light curve model is used to construct the entire light curve shape, which in turn is used in a functional linear form to adjust intrinsic luminosity when fitting distance models.
△ Less
Submitted 16 February, 2018;
originally announced February 2018.
-
Hole Mobility Model for Si Double-Gate Junctionless Transistors
Authors:
Fan Chen,
Kangliang Wei,
Wei E. I. Sha,
Jun Z. Huang
Abstract:
In this work, a physics based model is developed to calculate the hole mobility of ultra-thin-body double-gate junctionless transistors. Six-band $k\cdot p$ Schrödinger equation and Poisson equation are solved self-consistently. The obtained wave-functions and energies are stored in look-up tables. Hole mobility can be derived using the Kubo-Greenwood formula accounting for impurity, acoustic and…
▽ More
In this work, a physics based model is developed to calculate the hole mobility of ultra-thin-body double-gate junctionless transistors. Six-band $k\cdot p$ Schrödinger equation and Poisson equation are solved self-consistently. The obtained wave-functions and energies are stored in look-up tables. Hole mobility can be derived using the Kubo-Greenwood formula accounting for impurity, acoustic and optical phonon, and surface roughness scattering. Initial benchmark results are shown.
△ Less
Submitted 17 November, 2017;
originally announced January 2018.
-
To Wait or Not to Wait: Two-way Functional Hazards Model for Understanding Waiting in Call Centers
Authors:
Gen Li,
Jianhua Z. Huang,
Haipeng Shen
Abstract:
Telephone call centers offer a convenient communication channel between businesses and their customers. Efficient management of call centers needs accurate modeling of customer waiting behavior, which contains important information about customer patience (how long a customer is willing to wait) and service quality (how long a customer needs to wait to get served). Hazard functions offer dynamic c…
▽ More
Telephone call centers offer a convenient communication channel between businesses and their customers. Efficient management of call centers needs accurate modeling of customer waiting behavior, which contains important information about customer patience (how long a customer is willing to wait) and service quality (how long a customer needs to wait to get served). Hazard functions offer dynamic characterization of customer waiting behavior, and provide critical inputs for agent scheduling. Motivated by this application, we develop a two-way functional hazards (tF-Hazards) model to study customer waiting behavior as a function of two timescales, waiting duration and the time of day that a customer calls in. The model stems from a two-way piecewise constant hazard function, and imposes low-rank structure and smoothness on the hazard rates to enhance interpretability. We exploit an alternating direction method of multipliers (ADMM) algorithm to optimize a penalized likelihood function of the model. We carefully analyze the data from a US bank call center, and provide informative insights about customer patience and service quality patterns along waiting time and across different times of a day. The findings provide primitive inputs for call center agent staffing and scheduling, as well as for call center practitioners to understand the effect of system protocols on customer waiting behavior.
△ Less
Submitted 31 December, 2017;
originally announced January 2018.
-
A Random Sample Partition Data Model for Big Data Analysis
Authors:
Salman Salloum,
Yulin He,
Joshua Zhexue Huang,
Xiaoliang Zhang,
Tamer Z. Emara,
Chenghao Wei,
He** He
Abstract:
Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) data model to represent a big data set as a set of non-overlap** data subsets, called RSP data blocks, where each RSP data block has a probability distribution similar to the whole b…
▽ More
Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) data model to represent a big data set as a set of non-overlap** data subsets, called RSP data blocks, where each RSP data block has a probability distribution similar to the whole big data set. Under this data model, efficient block level sampling is used to randomly select RSP data blocks, replacing expensive record level sampling to select sample data from a big distributed data set on a computing cluster. We show how RSP data blocks can be employed to estimate statistics of a big data set and build models which are equivalent to those built from the whole big data set. In this approach, analysis of a big data set becomes analysis of few RSP data blocks which have been generated in advance on the computing cluster. Therefore, the new method for data analysis based on RSP data blocks is scalable to big data.
△ Less
Submitted 20 January, 2018; v1 submitted 12 December, 2017;
originally announced December 2017.
-
Robust Mode Space Approach for Atomistic Modeling of Realistically Large Nanowire Transistors
Authors:
Jun Z. Huang,
Hesameddin Ilatikhameneh,
Michael Povolotskyi,
Gerhard Klimeck
Abstract:
Atomistic quantum transport simulation of realistically large devices is computationally very demanding. The widely used mode space (MS) approach can significantly reduce the numerical cost but good MS basis is usually very hard to obtain for atomistic full-band models. In this work, a robust and parallel algorithm is developed to optimize the MS basis for atomistic nanowires. This enables tight b…
▽ More
Atomistic quantum transport simulation of realistically large devices is computationally very demanding. The widely used mode space (MS) approach can significantly reduce the numerical cost but good MS basis is usually very hard to obtain for atomistic full-band models. In this work, a robust and parallel algorithm is developed to optimize the MS basis for atomistic nanowires. This enables tight binding non-equilibrium Green's function (NEGF) simulation of nanowire MOSFET with realistic cross section of $\rm 10nm\times10nm$ using a small computer cluster. This approach is applied to compare the performance of InGaAs and Si nanowire nMOSFETs with various channel lengths and cross sections. Simulation results with full-band accuracy indicate that InGaAs nanowire nMOSFETs have no drive current advantage over their Si counterparts for cross sections up to about $\rm 10nm\times10nm$.
△ Less
Submitted 27 December, 2017; v1 submitted 22 October, 2017;
originally announced October 2017.
-
Large Magellanic Cloud Near-Infrared Synoptic Survey. V. Period-Luminosity Relations of Miras
Authors:
Wenlong Yuan,
Lucas M. Macri,
Shiyuan He,
Jianhua Z. Huang,
Shashi M. Kanbur,
Chow-Choong Ngeow
Abstract:
We study the near-infrared properties of 690 Mira candidates in the central region of the Large Magellanic Cloud, based on time-series observations at JHKs. We use densely-sampled I-band observations from the OGLE project to generate template light curves in the near infrared and derive robust mean magnitudes at those wavelengths. We obtain near-infrared Period-Luminosity relations for Oxygen-rich…
▽ More
We study the near-infrared properties of 690 Mira candidates in the central region of the Large Magellanic Cloud, based on time-series observations at JHKs. We use densely-sampled I-band observations from the OGLE project to generate template light curves in the near infrared and derive robust mean magnitudes at those wavelengths. We obtain near-infrared Period-Luminosity relations for Oxygen-rich Miras with a scatter as low as 0.12 mag at Ks. We study the Period-Luminosity-Color relations and the color excesses of Carbon-rich Miras, which show evidence for a substantially different reddening law.
△ Less
Submitted 15 August, 2017;
originally announced August 2017.
-
The M33 Synoptic Stellar Survey. II. Mira Variables
Authors:
Wenlong Yuan,
Shiyuan He,
Lucas M. Macri,
James Long,
Jianhua Z. Huang
Abstract:
We present the discovery of 1847 Mira candidates in the Local Group galaxy M33 using a novel semi-parametric periodogram technique coupled with a Random Forest classifier. The algorithms were applied to ~2.4x10^5 I-band light curves previously obtained by the M33 Synoptic Stellar Survey. We derive preliminary Period-Luminosity relations at optical, near- & mid-infrared wavelengths and compare them…
▽ More
We present the discovery of 1847 Mira candidates in the Local Group galaxy M33 using a novel semi-parametric periodogram technique coupled with a Random Forest classifier. The algorithms were applied to ~2.4x10^5 I-band light curves previously obtained by the M33 Synoptic Stellar Survey. We derive preliminary Period-Luminosity relations at optical, near- & mid-infrared wavelengths and compare them to the corresponding relations in the Large Magellanic Cloud.
△ Less
Submitted 24 March, 2017; v1 submitted 2 March, 2017;
originally announced March 2017.
-
Adaptive Basis Selection for Exponential Family Smoothing Splines with Application in Joint Modeling of Multiple Sequencing Samples
Authors:
** Ma,
Nan Zhang,
Jianhua Z. Huang,
Wenxuan Zhong
Abstract:
Second-generation sequencing technologies have replaced array-based technologies and become the default method for genomics and epigenomics analysis. Second-generation sequencing technologies sequence tens of millions of DNA/cDNA fragments in parallel. After the resulting sequences (short reads) are mapped to the genome, one gets a sequence of short read counts along the genome. Effective extracti…
▽ More
Second-generation sequencing technologies have replaced array-based technologies and become the default method for genomics and epigenomics analysis. Second-generation sequencing technologies sequence tens of millions of DNA/cDNA fragments in parallel. After the resulting sequences (short reads) are mapped to the genome, one gets a sequence of short read counts along the genome. Effective extraction of signals in these short read counts is the key to the success of sequencing technologies. Nonparametric methods, in particular smoothing splines, have been used extensively for modeling and processing single sequencing samples. However, nonparametric joint modeling of multiple second-generation sequencing samples is still lacking due to computational cost. In this article, we develop an adaptive basis selection method for efficient computation of exponential family smoothing splines for modeling multiple second-generation sequencing samples. Our adaptive basis selection gives a sparse approximation of smoothing splines, yielding a lower-dimensional effective model space for a more scalable computation. The asymptotic analysis shows that the effective model space is rich enough to retain essential features of the data. Moreover, exponential family smoothing spline models computed via adaptive basis selection are shown to have good statistical properties, e.g., convergence at the same rate as that of full basis exponential family smoothing splines. The empirical performance is demonstrated through simulation studies and two second-generation sequencing data examples.
△ Less
Submitted 6 February, 2017;
originally announced February 2017.
-
Combination of equilibrium and non-equilibrium carrier statistics into an atomistic quantum transport model for tunneling hetero-junctions
Authors:
Tarek A. Ameen,
Hesameddin Ilatikhameneh,
Jun Z. Huang,
Michael Povolotskyi,
Rajib Rahman,
Gerhard Klimeck
Abstract:
Tunneling hetero-junctions (THJs) usually induce confined states at the regions close to the tunnel junction which significantly affect their transport properties. Accurate numerical modeling of such effects requires combining the non-equilibrium coherent quantum transport through tunnel junction, as well as the quasi-equilibrium statistics arising from the strong scattering in the induced quantum…
▽ More
Tunneling hetero-junctions (THJs) usually induce confined states at the regions close to the tunnel junction which significantly affect their transport properties. Accurate numerical modeling of such effects requires combining the non-equilibrium coherent quantum transport through tunnel junction, as well as the quasi-equilibrium statistics arising from the strong scattering in the induced quantum wells. In this work, a novel atomistic model is proposed to include both effects: the strong scattering in the regions around THJ and the coherent tunneling. The new model matches reasonably well with experimental measurements of Nitride THJ and provides an efficient engineering tool for performance prediction and design of THJ based devices.
△ Less
Submitted 4 February, 2017;
originally announced February 2017.
-
A Multiscale Modeling of Triple-Heterojunction Tunneling FETs
Authors:
Jun Z. Huang,
Pengyu Long,
Michael Povolotskyi,
Hesameddin Ilatikhameneh,
Tarek Ameen,
Rajib Rahman,
Mark J. W. Rodwell,
Gerhard Klimeck
Abstract:
A high performance triple-heterojunction (3HJ) design has been previously proposed for tunneling FETs (TFETs). Compared with single heterojunction (HJ) TFETs, the 3HJ TFETs have both shorter tunneling distance and two transmission resonances that significantly improve the ON-state current ($I_{\rm{ON}}$). Coherent quantum transport simulation predicts, that $I_{\rm{ON}}=460\rm{μA/μm}$ can be achie…
▽ More
A high performance triple-heterojunction (3HJ) design has been previously proposed for tunneling FETs (TFETs). Compared with single heterojunction (HJ) TFETs, the 3HJ TFETs have both shorter tunneling distance and two transmission resonances that significantly improve the ON-state current ($I_{\rm{ON}}$). Coherent quantum transport simulation predicts, that $I_{\rm{ON}}=460\rm{μA/μm}$ can be achieved at gate length $Lg=15\rm{nm}$, supply voltage $V_{\rm{DD}}=0.3\rm{V}$, and OFF-state current $I_{\rm{OFF}}=1\rm{nA/μm}$. However, strong electron-phonon and electron-electron scattering in the heavily doped leads implies, that the 3HJ devices operate far from the ideal coherent limit. In this study, such scattering effects are assessed by a newly developed multiscale transport model, which combines the ballistic non-equilibrium Green's function method for the channel and the drift-diffusion scattering method for the leads. Simulation results show that the thermalizing scattering in the leads both degrades the 3HJ TFET's subthreshold swing through scattering induced leakage and reduces the turn-on current through the access resistance. Assuming bulk scattering rates and carrier mobilities, the $I_{\rm{ON}}$ is dropped from $460\rm{μA/μm}$ down to $254\rm{μA/μm}$, which is still much larger than the single HJ TFET case.
△ Less
Submitted 3 April, 2017; v1 submitted 2 January, 2017;
originally announced January 2017.
-
Asymptotic properties of adaptive group Lasso for sparse reduced rank regression
Authors:
Kejun He,
Jianhua Z. Huang
Abstract:
This paper studies the asymptotic properties of the penalized least squares estimator using an adaptive group Lasso penalty for the reduced rank regression. The group Lasso penalty is defined in the way that the regression coefficients corresponding to each predictor are treated as one group. It is shown that under certain regularity conditions, the estimator can achieve the minimax optimal rate o…
▽ More
This paper studies the asymptotic properties of the penalized least squares estimator using an adaptive group Lasso penalty for the reduced rank regression. The group Lasso penalty is defined in the way that the regression coefficients corresponding to each predictor are treated as one group. It is shown that under certain regularity conditions, the estimator can achieve the minimax optimal rate of convergence. Moreover, the variable selection consistency can also be achieved, that is, the relevant predictors can be identified with probability approaching one. In the asymptotic theory, the number of response variables, the number of predictors, and the rank number are allowed to grow to infinity with the sample size.
△ Less
Submitted 24 October, 2016; v1 submitted 21 September, 2016;
originally announced September 2016.
-
Period estimation for sparsely-sampled quasi-periodic light curves applied to Miras
Authors:
Shiyuan He,
Wenlong Yuan,
Jianhua Z. Huang,
James Long,
Lucas M. Macri
Abstract:
We develop a non-linear semi-parametric Gaussian process model to estimate periods of Miras with sparsely-sampled light curves. The model uses a sinusoidal basis for the periodic variation and a Gaussian process for the stochastic changes. We use maximum likelihood to estimate the period and the parameters of the Gaussian process, while integrating out the effects of other nuisance parameters in t…
▽ More
We develop a non-linear semi-parametric Gaussian process model to estimate periods of Miras with sparsely-sampled light curves. The model uses a sinusoidal basis for the periodic variation and a Gaussian process for the stochastic changes. We use maximum likelihood to estimate the period and the parameters of the Gaussian process, while integrating out the effects of other nuisance parameters in the model with respect to a suitable prior distribution obtained from earlier studies. Since the likelihood is highly multimodal for period, we implement a hybrid method that applies the quasi-Newton algorithm for Gaussian process parameters and search the period/frequency parameter over a dense grid.
A large-scale, high-fidelity simulation is conducted to mimic the sampling quality of Mira light curves obtained by the M33 Synoptic Stellar Survey. The simulated data set is publicly available and can serve as a testbed for future evaluation of different period estimation methods. The semi-parametric model outperforms an existing algorithm on this simulated test data set as measured by period recovery rate and quality of the resulting Period-Luminosity relations.
△ Less
Submitted 17 November, 2016; v1 submitted 21 September, 2016;
originally announced September 2016.
-
Scalable GaSb/InAs tunnel FETs with non-uniform body thickness
Authors:
Jun Z. Huang,
Pengyu Long,
Michael Povolotskyi,
Gerhard Klimeck,
Mark J. W. Rodwell
Abstract:
GaSb/InAs heterojunction tunnel field-effect transistors are strong candidates in building future low-power integrated circuits, as they could provide both steep subthreshold swing and large ON-state current ($I_{\rm{ON}}$). However, at short channel lengths they suffer from large tunneling leakage originating from the small band gap and small effective masses of the InAs channel. As proposed in t…
▽ More
GaSb/InAs heterojunction tunnel field-effect transistors are strong candidates in building future low-power integrated circuits, as they could provide both steep subthreshold swing and large ON-state current ($I_{\rm{ON}}$). However, at short channel lengths they suffer from large tunneling leakage originating from the small band gap and small effective masses of the InAs channel. As proposed in this article, this problem can be significantly mitigated by reducing the channel thickness meanwhile retaining a thick source-channel tunnel junction, thus forming a design with a non-uniform body thickness. Because of the quantum confinement, the thin InAs channel offers a large band gap and large effective masses, reducing the ambipolar and source-to-drain tunneling leakage at OFF state. The thick GaSb/InAs tunnel junction, instead, offers a low tunnel barrier and small effective masses, allowing a large tunnel probability at ON state. In addition, the confinement induced band discontinuity enhances the tunnel electric field and creates a resonant state, further improving $I_{\rm{ON}}$. Atomistic quantum transport simulations show that ballistic $I_{\rm{ON}}=284$A/m is obtained at 15nm channel length, $I_{\rm{OFF}}=1\times10^{-3}$A/m, and $V_{\rm{DD}}=0.3$V. While with uniform body thickness, the largest achievable $I_{\rm{ON}}$ is only 25A/m. Simulations also indicate that this design is scalable to sub-10nm channel length.
△ Less
Submitted 17 July, 2016;
originally announced July 2016.
-
P-Type Tunnel FETs With Triple Heterojunctions
Authors:
Jun Z. Huang,
Pengyu Long,
Michael Povolotskyi,
Gerhard Klimeck,
Mark J. W. Rodwell
Abstract:
A triple-heterojunction (3HJ) design is employed to improve p-type InAs/GaSb heterojunction (HJ) tunnel FETs. The added two HJs (AlInAsSb/InAs in the source and GaSb/AlSb in the channel) significantly shorten the tunnel distance and create two resonant states, greatly improving the ON state tunneling probability. Moreover, the source Fermi degeneracy is reduced by the increased source (AlInAsSb) d…
▽ More
A triple-heterojunction (3HJ) design is employed to improve p-type InAs/GaSb heterojunction (HJ) tunnel FETs. The added two HJs (AlInAsSb/InAs in the source and GaSb/AlSb in the channel) significantly shorten the tunnel distance and create two resonant states, greatly improving the ON state tunneling probability. Moreover, the source Fermi degeneracy is reduced by the increased source (AlInAsSb) density of states and the OFF state leakage is reduced by the heavier channel (AlSb) hole effective masses. Quantum ballistic transport simulations show, that with V_{DD} = 0.3V and I_{OFF} = 10^{-3}A/m, I_{ON} of 582A=m (488A=m) is obtained at 30nm (15nm) channel length, which is comparable to n-type 3HJ counterpart and significantly exceeding p-type silicon MOSFET. Simultaneously, the nonlinear turn on and delayed saturation in the output characteristics are also greatly improved.
△ Less
Submitted 23 May, 2016;
originally announced May 2016.
-
High-Performance Complementary III-V Tunnel FETs with Strain Engineering
Authors:
Jun Z. Huang,
Yu Wang,
Pengyu Long,
Yaohua Tan,
Michael Povolotskyi,
Gerhard Klimeck
Abstract:
Strain engineering has recently been explored to improve tunnel field-effect transistors (TFETs). Here, we report design and performance of strained ultra-thin-body (UTB) III-V TFETs by quantum transport simulations. It is found that for an InAs UTB confined in [001] orientation, uniaxial compressive strain in [100] or [110] orientation shrinks the band gap meanwhile reduces (increases) transport…
▽ More
Strain engineering has recently been explored to improve tunnel field-effect transistors (TFETs). Here, we report design and performance of strained ultra-thin-body (UTB) III-V TFETs by quantum transport simulations. It is found that for an InAs UTB confined in [001] orientation, uniaxial compressive strain in [100] or [110] orientation shrinks the band gap meanwhile reduces (increases) transport (transverse) effective masses. Thus it improves the ON state current of both n-type and p-type UTB InAs TFETs without lowering the source density of states. Applying the strain locally in the source region makes further improvements by suppressing the OFF state leakage. For p-type TFETs, the locally strained area can be extended into the channel to form a quantum well, giving rise to even larger ON state current that is comparable to the n-type ones. Therefore strain engineering is a promising option for improving complementary circuits based on UTB III-V TFETs.
△ Less
Submitted 3 May, 2016;
originally announced May 2016.
-
Quantum Transport Simulation of III-V TFETs with Reduced-Order K.P Method
Authors:
Jun Z. Huang,
Lining Zhang,
Pengyu Long,
Michael Povolotskyi,
Gerhard Klimeck
Abstract:
III-V tunneling field-effect transistors (TFETs) offer great potentials in future low-power electronics application due to their steep subthreshold slope and large "on" current. Their 3D quantum transport study using non-equilibrium Green's function method is computationally very intensive, in particular when combined with multiband approaches such as the eight-band K.P method. To reduce the numer…
▽ More
III-V tunneling field-effect transistors (TFETs) offer great potentials in future low-power electronics application due to their steep subthreshold slope and large "on" current. Their 3D quantum transport study using non-equilibrium Green's function method is computationally very intensive, in particular when combined with multiband approaches such as the eight-band K.P method. To reduce the numerical cost, an efficient reduced-order method is developed in this article and applied to study homojunction InAs and heterojunction GaSb-InAs nanowire TFETs. Device performances are obtained for various channel widths, channel lengths, crystal orientations, do** densities, source pocket lengths, and strain conditions.
△ Less
Submitted 8 November, 2015;
originally announced November 2015.
-
A Study of Functional Depths
Authors:
James P. Long,
Jianhua Z. Huang
Abstract:
Functional depth is used for ranking functional observations from most outlying to most typical. The ranks produced by functional depth have been proposed as the basis for functional classifiers, rank tests, and data visualization procedures. Many of the proposed functional depths are invariant to domain permutation, an unusual property for a functional data analysis procedure. Essentially these d…
▽ More
Functional depth is used for ranking functional observations from most outlying to most typical. The ranks produced by functional depth have been proposed as the basis for functional classifiers, rank tests, and data visualization procedures. Many of the proposed functional depths are invariant to domain permutation, an unusual property for a functional data analysis procedure. Essentially these depths treat functional data as if it were multivariate data. In this work, we compare the performance of several existing functional depths to a simple adaptation of an existing multivariate depth notion, $L^\infty$ depth ($L^{\infty}D$). On simulated and real data, we show $L^{\infty}D$ has performance comparable or superior to several existing notions of functional depth. In addition, we review how depth functions are evaluated and propose some improvements. In particular, we show that empirical depth function asymptotics can be mis--leading and instead propose a new method, the rank--rank plot, for evaluating empirical depth rank stability.
△ Less
Submitted 1 November, 2016; v1 submitted 3 June, 2015;
originally announced June 2015.
-
Efficient semiparametric estimation in generalized partially linear additive models for longitudinal/clustered data
Authors:
Guang Cheng,
Lan Zhou,
Jianhua Z. Huang
Abstract:
We consider efficient estimation of the Euclidean parameters in a generalized partially linear additive models for longitudinal/clustered data when multiple covariates need to be modeled nonparametrically, and propose an estimation procedure based on a spline approximation of the nonparametric part of the model and the generalized estimating equations (GEE). Although the model in consideration is…
▽ More
We consider efficient estimation of the Euclidean parameters in a generalized partially linear additive models for longitudinal/clustered data when multiple covariates need to be modeled nonparametrically, and propose an estimation procedure based on a spline approximation of the nonparametric part of the model and the generalized estimating equations (GEE). Although the model in consideration is natural and useful in many practical applications, the literature on this model is very limited because of challenges in dealing with dependent data for nonparametric additive models. We show that the proposed estimators are consistent and asymptotically normal even if the covariance structure is misspecified. An explicit consistent estimate of the asymptotic variance is also provided. Moreover, we derive the semiparametric efficiency score and information bound under general moment conditions. By showing that our estimators achieve the semiparametric information bound, we effectively establish their efficiency in a stronger sense than what is typically considered for GEE. The derivation of our asymptotic results relies heavily on the empirical processes tools that we develop for the longitudinal/clustered data. Numerical results are used to illustrate the finite sample performance of the proposed estimators.
△ Less
Submitted 4 February, 2014;
originally announced February 2014.
-
Bayesian object classification of gold nanoparticles
Authors:
Bledar A. Konomi,
Soma S. Dhavala,
Jianhua Z. Huang,
Subrata Kundu,
David Huitink,
Hong Liang,
Yu Ding,
Bani K. Mallick
Abstract:
The properties of materials synthesized with nanoparticles (NPs) are highly correlated to the sizes and shapes of the nanoparticles. The transmission electron microscopy (TEM) imaging technique can be used to measure the morphological characteristics of NPs, which can be simple circles or more complex irregular polygons with varying degrees of scales and sizes. A major difficulty in analyzing the…
▽ More
The properties of materials synthesized with nanoparticles (NPs) are highly correlated to the sizes and shapes of the nanoparticles. The transmission electron microscopy (TEM) imaging technique can be used to measure the morphological characteristics of NPs, which can be simple circles or more complex irregular polygons with varying degrees of scales and sizes. A major difficulty in analyzing the TEM images is the overlap** of objects, having different morphological properties with no specific information about the number of objects present. Furthermore, the objects lying along the boundary render automated image analysis much more difficult. To overcome these challenges, we propose a Bayesian method based on the marked-point process representation of the objects. We derive models, both for the marks which parameterize the morphological aspects and the points which determine the location of the objects. The proposed model is an automatic image segmentation and classification procedure, which simultaneously detects the boundaries and classifies the NPs into one of the predetermined shape families. We execute the inference by sampling the posterior distribution using Markov chain Monte Carlo (MCMC) since the posterior is doubly intractable. We apply our novel method to several TEM imaging samples of gold NPs, producing the needed statistical characterization of their morphology.
△ Less
Submitted 5 December, 2013;
originally announced December 2013.
-
Robust regularized singular value decomposition with application to mortality data
Authors:
Lingsong Zhang,
Haipeng Shen,
Jianhua Z. Huang
Abstract:
We develop a robust regularized singular value decomposition (RobRSVD) method for analyzing two-way functional data. The research is motivated by the application of modeling human mortality as a smooth two-way function of age group and year. The RobRSVD is formulated as a penalized loss minimization problem where a robust loss function is used to measure the reconstruction error of a low-rank matr…
▽ More
We develop a robust regularized singular value decomposition (RobRSVD) method for analyzing two-way functional data. The research is motivated by the application of modeling human mortality as a smooth two-way function of age group and year. The RobRSVD is formulated as a penalized loss minimization problem where a robust loss function is used to measure the reconstruction error of a low-rank matrix approximation of the data, and an appropriately defined two-way roughness penalty function is used to ensure smoothness along each of the two functional domains. By viewing the minimization problem as two conditional regularized robust regressions, we develop a fast iterative reweighted least squares algorithm to implement the method. Our implementation naturally incorporates missing values. Furthermore, our formulation allows rigorous derivation of leave-one-row/column-out cross-validation and generalized cross-validation criteria, which enable computationally efficient data-driven penalty parameter selection. The advantages of the new robust method over nonrobust ones are shown via extensive simulation studies and the mortality rate application.
△ Less
Submitted 29 November, 2013;
originally announced November 2013.
-
Asymptotic optimality and efficient computation of the leave-subject-out cross-validation
Authors:
Ganggang Xu,
Jianhua Z. Huang
Abstract:
Although the leave-subject-out cross-validation (CV) has been widely used in practice for tuning parameter selection for various nonparametric and semiparametric models of longitudinal data, its theoretical property is unknown and solving the associated optimization problem is computationally expensive, especially when there are multiple tuning parameters. In this paper, by focusing on the penaliz…
▽ More
Although the leave-subject-out cross-validation (CV) has been widely used in practice for tuning parameter selection for various nonparametric and semiparametric models of longitudinal data, its theoretical property is unknown and solving the associated optimization problem is computationally expensive, especially when there are multiple tuning parameters. In this paper, by focusing on the penalized spline method, we show that the leave-subject-out CV is optimal in the sense that it is asymptotically equivalent to the empirical squared error loss function minimization. An efficient Newton-type algorithm is developed to compute the penalty parameters that optimize the CV criterion. Simulated and real data are used to demonstrate the effectiveness of the leave-subject-out CV in selecting both the penalty parameters and the working correlation matrix.
△ Less
Submitted 19 February, 2013;
originally announced February 2013.
-
A two-way regularization method for MEG source reconstruction
Authors:
Tian Siva Tian,
Jianhua Z. Huang,
Haipeng Shen,
Zhimin Li
Abstract:
The MEG inverse problem refers to the reconstruction of the neural activity of the brain from magnetoencephalography (MEG) measurements. We propose a two-way regularization (TWR) method to solve the MEG inverse problem under the assumptions that only a small number of locations in space are responsible for the measured signals (focality), and each source time course is smooth in time (smoothness).…
▽ More
The MEG inverse problem refers to the reconstruction of the neural activity of the brain from magnetoencephalography (MEG) measurements. We propose a two-way regularization (TWR) method to solve the MEG inverse problem under the assumptions that only a small number of locations in space are responsible for the measured signals (focality), and each source time course is smooth in time (smoothness). The focality and smoothness of the reconstructed signals are ensured respectively by imposing a sparsity-inducing penalty and a roughness penalty in the data fitting criterion. A two-stage algorithm is developed for fast computation, where a raw estimate of the source time course is obtained in the first stage and then refined in the second stage by the two-way regularization. The proposed method is shown to be effective on both synthetic and real-world examples.
△ Less
Submitted 28 September, 2012;
originally announced September 2012.
-
Functional dynamic factor models with application to yield curve forecasting
Authors:
Spencer Hays,
Haipeng Shen,
Jianhua Z. Huang
Abstract:
Accurate forecasting of zero coupon bond yields for a continuum of maturities is paramount to bond portfolio management and derivative security pricing. Yet a universal model for yield curve forecasting has been elusive, and prior attempts often resulted in a trade-off between goodness of fit and consistency with economic theory. To address this, herein we propose a novel formulation which connect…
▽ More
Accurate forecasting of zero coupon bond yields for a continuum of maturities is paramount to bond portfolio management and derivative security pricing. Yet a universal model for yield curve forecasting has been elusive, and prior attempts often resulted in a trade-off between goodness of fit and consistency with economic theory. To address this, herein we propose a novel formulation which connects the dynamic factor model (DFM) framework with concepts from functional data analysis: a DFM with functional factor loading curves. This results in a model capable of forecasting functional time series. Further, in the yield curve context we show that the model retains economic interpretation. Model estimation is achieved through an expectation-maximization algorithm, where the time series parameters and factor loading curves are simultaneously estimated in a single step. Efficient computing is implemented and a data-driven smoothing parameter is nicely incorporated. We show that our model performs very well on forecasting actual yield data compared with existing approaches, especially in regard to profit-based assessment for an innovative trading exercise. We further illustrate the viability of our model to applications outside of yield forecasting.
△ Less
Submitted 27 September, 2012;
originally announced September 2012.
-
Covariance approximation for large multivariate spatial data sets with an application to multiple climate model errors
Authors:
Huiyan Sang,
Mikyoung Jun,
Jianhua Z. Huang
Abstract:
This paper investigates the cross-correlations across multiple climate model errors. We build a Bayesian hierarchical model that accounts for the spatial dependence of individual models as well as cross-covariances across different climate models. Our method allows for a nonseparable and nonstationary cross-covariance structure. We also present a covariance approximation approach to facilitate the…
▽ More
This paper investigates the cross-correlations across multiple climate model errors. We build a Bayesian hierarchical model that accounts for the spatial dependence of individual models as well as cross-covariances across different climate models. Our method allows for a nonseparable and nonstationary cross-covariance structure. We also present a covariance approximation approach to facilitate the computation in the modeling and analysis of very large multivariate spatial data sets. The covariance approximation consists of two parts: a reduced-rank part to capture the large-scale spatial dependence, and a sparse covariance matrix to correct the small-scale dependence error induced by the reduced rank approximation. We pay special attention to the case that the second part of the approximation has a block-diagonal structure. Simulation results of model fitting and prediction show substantial improvement of the proposed approximation over the predictive process approximation and the independent blocks analysis. We then apply our computational approach to the joint statistical modeling of multiple climate model errors.
△ Less
Submitted 1 March, 2012;
originally announced March 2012.
-
Sparse logistic principal components analysis for binary data
Authors:
Seokho Lee,
Jianhua Z. Huang,
Jianhua Hu
Abstract:
We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable ex…
▽ More
We develop a new principal components analysis (PCA) type dimension reduction method for binary data. Different from the standard PCA which is defined on the observed data, the proposed PCA is defined on the logit transform of the success probabilities of the binary observations. Sparsity is introduced to the principal component (PC) loading vectors for enhanced interpretability and more stable extraction of the principal components. Our sparse PCA is formulated as solving an optimization problem with a criterion function motivated from a penalized Bernoulli likelihood. A Majorization--Minimization algorithm is developed to efficiently solve the optimization problem. The effectiveness of the proposed sparse logistic PCA method is illustrated by application to a single nucleotide polymorphism data set and a simulation study.
△ Less
Submitted 16 November, 2010;
originally announced November 2010.
-
Use of multiple singular value decompositions to analyze complex intracellular calcium ion signals
Authors:
Josue G. Martinez,
Jianhua Z. Huang,
Robert C. Burghardt,
Rola Barhoumi,
Raymond J. Carroll
Abstract:
We compare calcium ion signaling ($\mathrm {Ca}^{2+}$) between two exposures; the data are present as movies, or, more prosaically, time series of images. This paper describes novel uses of singular value decompositions (SVD) and weighted versions of them (WSVD) to extract the signals from such movies, in a way that is semi-automatic and tuned closely to the actual data and their many complexities…
▽ More
We compare calcium ion signaling ($\mathrm {Ca}^{2+}$) between two exposures; the data are present as movies, or, more prosaically, time series of images. This paper describes novel uses of singular value decompositions (SVD) and weighted versions of them (WSVD) to extract the signals from such movies, in a way that is semi-automatic and tuned closely to the actual data and their many complexities. These complexities include the following. First, the images themselves are of no interest: all interest focuses on the behavior of individual cells across time, and thus, the cells need to be segmented in an automated manner. Second, the cells themselves have 100$+$ pixels, so that they form 100$+$ curves measured over time, so that data compression is required to extract the features of these curves. Third, some of the pixels in some of the cells are subject to image saturation due to bit depth limits, and this saturation needs to be accounted for if one is to normalize the images in a reasonably unbiased manner. Finally, the $\mathrm {Ca}^{2+}$ signals have oscillations or waves that vary with time and these signals need to be extracted. Thus, our aim is to show how to use multiple weighted and standard singular value decompositions to detect, extract and clarify the $\mathrm {Ca}^{2+}$ signals. Our signal extraction methods then lead to simple although finely focused statistical methods to compare $\mathrm {Ca}^{2+}$ signals across experimental conditions.
△ Less
Submitted 28 September, 2010;
originally announced September 2010.
-
Bootstrap consistency for general semiparametric $M$-estimation
Authors:
Guang Cheng,
Jianhua Z. Huang
Abstract:
Consider $M$-estimation in a semiparametric model that is characterized by a Euclidean parameter of interest and an infinite-dimensional nuisance parameter. As a general purpose approach to statistical inferences, the bootstrap has found wide applications in semiparametric $M$-estimation and, because of its simplicity, provides an attractive alternative to the inference approach based on the asymp…
▽ More
Consider $M$-estimation in a semiparametric model that is characterized by a Euclidean parameter of interest and an infinite-dimensional nuisance parameter. As a general purpose approach to statistical inferences, the bootstrap has found wide applications in semiparametric $M$-estimation and, because of its simplicity, provides an attractive alternative to the inference approach based on the asymptotic distribution theory. The purpose of this paper is to provide theoretical justifications for the use of bootstrap as a semiparametric inferential tool. We show that, under general conditions, the bootstrap is asymptotically consistent in estimating the distribution of the $M$-estimate of Euclidean parameter; that is, the bootstrap distribution asymptotically imitates the distribution of the $M$-estimate. We also show that the bootstrap confidence set has the asymptotically correct coverage probability. These general conclusions hold, in particular, when the nuisance parameter is not estimable at root-$n$ rate, and apply to a broad class of bootstrap methods with exchangeable bootstrap weights. This paper provides a first general theoretical study of the bootstrap in semiparametric models.
△ Less
Submitted 3 February, 2011; v1 submitted 6 June, 2009;
originally announced June 2009.
-
Functional principal components analysis via penalized rank one approximation
Authors:
Jianhua Z. Huang,
Haipeng Shen,
Andreas Buja
Abstract:
Two existing approaches to functional principal components analysis (FPCA) are due to Rice and Silverman (1991) and Silverman (1996), both based on maximizing variance but introducing penalization in different ways. In this article we propose an alternative approach to FPCA using penalized rank one approximation to the data matrix. Our contributions are four-fold: (1) by considering invariance u…
▽ More
Two existing approaches to functional principal components analysis (FPCA) are due to Rice and Silverman (1991) and Silverman (1996), both based on maximizing variance but introducing penalization in different ways. In this article we propose an alternative approach to FPCA using penalized rank one approximation to the data matrix. Our contributions are four-fold: (1) by considering invariance under scale transformation of the measurements, the new formulation sheds light on how regularization should be performed for FPCA and suggests an efficient power algorithm for computation; (2) it naturally incorporates spline smoothing of discretized functional data; (3) the connection with smoothing splines also facilitates construction of cross-validation or generalized cross-validation criteria for smoothing parameter selection that allows efficient computation; (4) different smoothing parameters are permitted for different FPCs. The methodology is illustrated with a real data example and a simulation.
△ Less
Submitted 30 July, 2008;
originally announced July 2008.
-
Forecasting time series of inhomogeneous Poisson processes with application to call center workforce management
Authors:
Haipeng Shen,
Jianhua Z. Huang
Abstract:
We consider forecasting the latent rate profiles of a time series of inhomogeneous Poisson processes. The work is motivated by operations management of queueing systems, in particular, telephone call centers, where accurate forecasting of call arrival rates is a crucial primitive for efficient staffing of such centers. Our forecasting approach utilizes dimension reduction through a factor analys…
▽ More
We consider forecasting the latent rate profiles of a time series of inhomogeneous Poisson processes. The work is motivated by operations management of queueing systems, in particular, telephone call centers, where accurate forecasting of call arrival rates is a crucial primitive for efficient staffing of such centers. Our forecasting approach utilizes dimension reduction through a factor analysis of Poisson variables, followed by time series modeling of factor score series. Time series forecasts of factor scores are combined with factor loadings to yield forecasts of future Poisson rate profiles. Penalized Poisson regressions on factor loadings guided by time series forecasts of factor scores are used to generate dynamic within-process rate updating. Methods are also developed to obtain distributional forecasts. Our methods are illustrated using simulation and real data. The empirical results demonstrate how forecasting and dynamic updating of call arrival rates can affect the accuracy of call center staffing.
△ Less
Submitted 25 July, 2008;
originally announced July 2008.
-
Clustering Categorical Data Streams
Authors:
Zengyou He,
Xiaofei Xu,
Shengchun Deng,
Joshua Zhexue Huang
Abstract:
The data stream model has been defined for new classes of applications involving massive data being generated at a fast pace. Web click stream analysis and detection of network intrusions are two examples. Cluster analysis on data streams becomes more difficult, because the data objects in a data stream must be accessed in order and can be read only once or few times with limited resources. Rece…
▽ More
The data stream model has been defined for new classes of applications involving massive data being generated at a fast pace. Web click stream analysis and detection of network intrusions are two examples. Cluster analysis on data streams becomes more difficult, because the data objects in a data stream must be accessed in order and can be read only once or few times with limited resources. Recently, a few clustering algorithms have been developed for analyzing numeric data streams. However, to our knowledge to date, no algorithm exists for clustering categorical data streams. In this paper, we propose an efficient clustering algorithm for analyzing categorical data streams. It has been proved that the proposed algorithm uses small memory footprints. We provide empirical analysis on the performance of the algorithm in clustering both synthetic and real data streams
△ Less
Submitted 13 December, 2004;
originally announced December 2004.