-
Identifying Genetic Variants for Obesity Incorporating Prior Insights: Quantile Regression with Insight Fusion for Ultra-high Dimensional Data
Authors:
Jiantong Wang,
Heng Lian,
Yan Yu,
He** Zhang
Abstract:
Obesity is widely recognized as a critical and pervasive health concern. We strive to identify important genetic risk factors from hundreds of thousands of single nucleotide polymorphisms (SNPs) for obesity. We propose and apply a novel Quantile Regression with Insight Fusion (QRIF) approach that can integrate insights from established studies or domain knowledge to simultaneously select variables…
▽ More
Obesity is widely recognized as a critical and pervasive health concern. We strive to identify important genetic risk factors from hundreds of thousands of single nucleotide polymorphisms (SNPs) for obesity. We propose and apply a novel Quantile Regression with Insight Fusion (QRIF) approach that can integrate insights from established studies or domain knowledge to simultaneously select variables and modeling for ultra-high dimensional genetic data, focusing on high conditional quantiles of body mass index (BMI) that are of most interest. We discover interesting new SNPs and shed new light on a comprehensive view of the underlying genetic risk factors for different levels of BMI. This may potentially pave the way for more precise and targeted treatment strategies. The QRIF approach intends to balance the trade-off between the prior insights and the observed data while being robust to potential false information. We further establish the desirable asymptotic properties under the challenging non-differentiable check loss functions via Huber loss approximation and nonconvex SCAD penalty via local linear approximation. Finally, we develop an efficient algorithm for the QRIF approach. Our simulation studies further demonstrate its effectiveness.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Statistical inference for high-dimensional convoluted rank regression
Authors:
Leheng Cai,
Xu Guo,
Heng Lian,
Li** Zhu
Abstract:
High-dimensional penalized rank regression is a powerful tool for modeling high-dimensional data due to its robustness and estimation efficiency. However, the non-smoothness of the rank loss brings great challenges to the computation. To solve this critical issue, high-dimensional convoluted rank regression is recently proposed, and penalized convoluted rank regression estimators are introduced. H…
▽ More
High-dimensional penalized rank regression is a powerful tool for modeling high-dimensional data due to its robustness and estimation efficiency. However, the non-smoothness of the rank loss brings great challenges to the computation. To solve this critical issue, high-dimensional convoluted rank regression is recently proposed, and penalized convoluted rank regression estimators are introduced. However, these developed estimators cannot be directly used to make inference. In this paper, we investigate the inference problem of high-dimensional convoluted rank regression. We first establish estimation error bounds of penalized convoluted rank regression estimators under weaker conditions on the predictors. Based on the penalized convoluted rank regression estimators, we further introduce a debiased estimator. We then provide Bahadur representation for our proposed estimator. We further develop simultaneous inference procedures. A novel bootstrap procedure is proposed and its theoretical validity is also established. Finally, simulation and real data analysis are conducted to illustrate the merits of our proposed methods.
△ Less
Submitted 23 May, 2024;
originally announced May 2024.
-
Distributed Iterative Hard Thresholding for Variable Selection in Tobit Models
Authors:
Changxin Yang,
Zhongyi Zhu,
Heng Lian
Abstract:
While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converge…
▽ More
While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converges with a near-optimal minimax rate. Additionally, we extend the method to a distributed setting, requiring only a few rounds of communication while retaining the estimation rate of the centralized version. Simulation results show that the IHT algorithm for the Tobit model achieves superior accuracy in predictions and subset selection, with the distributed estimator closely matching that of the centralized estimator. When applied to high-dimensional left-censored HIV viral load data, our method also exhibits similar superiority.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Nonparametric Quantile Regression for Homogeneity Pursuit in Panel Data Models
Authors:
Xiaoyu Zhang,
Di Wang,
Heng Lian,
Guodong Li
Abstract:
Many panel data have the latent subgroup effect on individuals, and it is important to correctly identify these groups since the efficiency of resulting estimators can be improved significantly by pooling the information of individuals within each group. However, the currently assumed parametric and semiparametric relationship between the response and predictors may be misspecified, which leads to…
▽ More
Many panel data have the latent subgroup effect on individuals, and it is important to correctly identify these groups since the efficiency of resulting estimators can be improved significantly by pooling the information of individuals within each group. However, the currently assumed parametric and semiparametric relationship between the response and predictors may be misspecified, which leads to a wrong grou** result, and the nonparametric approach hence can be considered to avoid such mistakes. Moreover, the response may depend on predictors in different ways at various quantile levels, and the corresponding grou** structure may also vary. To tackle these problems, this article proposes a nonparametric quantile regression method for homogeneity pursuit in panel data models with individual effects, and a pairwise fused penalty is used to automatically select the number of groups. The asymptotic properties are established, and an ADMM algorithm is also developed. The finite sample performance is evaluated by simulation experiments, and the usefulness of the proposed methodology is further illustrated by an empirical example.
△ Less
Submitted 22 August, 2022; v1 submitted 3 May, 2021;
originally announced May 2021.
-
High-dimensional vector autoregressive time series modeling via tensor decomposition
Authors:
Di Wang,
Yao Zheng,
Heng Lian,
Guodong Li
Abstract:
The classical vector autoregressive model is a fundamental tool for multivariate time series analysis. However, it involves too many parameters when the number of time series and lag order are even moderately large. This paper proposes to rearrange the transition matrices of the model into a tensor form such that the parameter space can be restricted along three directions simultaneously via tenso…
▽ More
The classical vector autoregressive model is a fundamental tool for multivariate time series analysis. However, it involves too many parameters when the number of time series and lag order are even moderately large. This paper proposes to rearrange the transition matrices of the model into a tensor form such that the parameter space can be restricted along three directions simultaneously via tensor decomposition. In contrast, the reduced-rank regression method can restrict the parameter space in only one direction. Besides achieving substantial dimension reduction, the proposed model is interpretable from the factor modeling perspective. Moreover, to handle high-dimensional time series, this paper considers imposing sparsity on factor matrices to improve the model interpretability and estimation efficiency, which leads to a sparsity-inducing estimator. For the low-dimensional case, we derive asymptotic properties of the proposed least squares estimator and introduce an alternating least squares algorithm. For the high-dimensional case, we establish non-asymptotic properties of the sparsity-inducing estimator and propose an ADMM algorithm for regularized estimation. Simulation experiments and a real data example demonstrate the advantages of the proposed approach over various existing methods.
△ Less
Submitted 3 November, 2020; v1 submitted 14 September, 2019;
originally announced September 2019.
-
A General Framework For Frequentist Model Averaging
Authors:
Priyam Mitra,
Heng Lian,
Ritwik Mitra,
Hua Liang,
Min-ge Xie
Abstract:
Model selection strategies have been routinely employed to determine a model for data analysis in statistics, and further study and inference then often proceed as though the selected model were the true model that were known a priori. This practice does not account for the uncertainty introduced by the selection process and the fact that the selected model can possibly be a wrong one. Model avera…
▽ More
Model selection strategies have been routinely employed to determine a model for data analysis in statistics, and further study and inference then often proceed as though the selected model were the true model that were known a priori. This practice does not account for the uncertainty introduced by the selection process and the fact that the selected model can possibly be a wrong one. Model averaging approaches try to remedy this issue by combining estimators for a set of candidate models. Specifically, instead of deciding which model is the 'right' one, a model averaging approach suggests to fit a set of candidate models and average over the estimators using certain data adaptive weights. In this paper we establish a general frequentist model averaging framework that does not set any restrictions on the set of candidate models. It greatly broadens the scope of the existing methodologies under the frequentist model averaging development. Assuming the data is from an unknown model, we derive the model averaging estimator and study its limiting distributions and related predictions while taking possible modeling biases into account. We propose a set of optimal weights to combine the individual estimators so that the expected mean squared error of the average estimator is minimized. Simulation studies are conducted to compare the performance of the estimator with that of the existing methods. The results show the benefits of the proposed approach over traditional model selection approaches as well as existing model averaging methods.
△ Less
Submitted 9 February, 2018;
originally announced February 2018.
-
Debiased distributed learning for sparse partial linear models in high dimensions
Authors:
Shaogao Lv,
Heng Lian
Abstract:
Although various distributed machine learning schemes have been proposed recently for pure linear models and fully nonparametric models, little attention has been paid on distributed optimization for semi-paramemetric models with multiple-level structures (e.g. sparsity, linearity and nonlinearity). To address these issues, the current paper proposes a new communication-efficient distributed learn…
▽ More
Although various distributed machine learning schemes have been proposed recently for pure linear models and fully nonparametric models, little attention has been paid on distributed optimization for semi-paramemetric models with multiple-level structures (e.g. sparsity, linearity and nonlinearity). To address these issues, the current paper proposes a new communication-efficient distributed learning algorithm for partially sparse linear models with an increasing number of features. The proposed method is based on the classical divide and conquer strategy for handing big data and each sub-method defined on each subsample consists of a debiased estimation of the double-regularized least squares approach. With the proposed method, we theoretically prove that our global parametric estimator can achieve optimal parametric rate in our semi-parametric model given an appropriate partition on the total data. Specially, the choice of data partition relies on the underlying smoothness of the nonparametric component, but it is adaptive to the sparsity parameter. Even under the non-distributed setting, we develop a new and easily-read proof for optimal estimation of the parametric error in high dimensional partial linear model. Finally, several simulated experiments are implemented to indicate comparable empirical performance of our debiased technique under the distributed setting.
△ Less
Submitted 3 November, 2019; v1 submitted 17 August, 2017;
originally announced August 2017.
-
Additive Partially Linear Models for Massive Heterogeneous Data
Authors:
Binhuan Wang,
Yixin Fang,
Heng Lian,
Hua Liang
Abstract:
We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. We propose an aggregation type of estimators for the commonality parameters that possess the asymptotic optimal bounds and the asymptotic distributions as if…
▽ More
We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. We propose an aggregation type of estimators for the commonality parameters that possess the asymptotic optimal bounds and the asymptotic distributions as if there were no heterogeneity. This oracle result holds when the number of sub-populations does not grow too fast and the tuning parameters are selected carefully. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. Furthermore, we develop a heterogeneity test for the linear components and a homogeneity test for the non-linear components accordingly. The performance of the proposed methods is evaluated via simulation studies and an application to the Medicare Provider Utilization and Payment data.
△ Less
Submitted 28 December, 2018; v1 submitted 13 January, 2017;
originally announced January 2017.
-
Greedy Forward Regression for Variable Screening
Authors:
Ming-Yen Cheng,
Sanying Feng,
Gaorong Li,
Heng Lian
Abstract:
Two popular variable screening methods under the ultra-high dimensional setting with the desirable sure screening property are the sure independence screening (SIS) and the forward regression (FR). Both are classical variable screening methods and recently have attracted greater attention under the new light of high-dimensional data analysis. We consider a new and simple screening method that inco…
▽ More
Two popular variable screening methods under the ultra-high dimensional setting with the desirable sure screening property are the sure independence screening (SIS) and the forward regression (FR). Both are classical variable screening methods and recently have attracted greater attention under the new light of high-dimensional data analysis. We consider a new and simple screening method that incorporates multiple predictors in each step of forward regression, with decision on which variables to incorporate based on the same criterion. If only one step is carried out, it actually reduces to the SIS. Thus it can be regarded as a generalization and unification of the FR and the SIS. More importantly, it preserves the sure screening property and has similar computational complexity as FR in each step, yet it can discover the relevant covariates in fewer steps. Thus, it reduces the computational burden of FR drastically while retaining advantages of the latter over SIS. Furthermore, we show that it can find all the true variables if the number of steps taken is the same as the correct model size, even when using the original FR. An extensive simulation study and application to two real data examples demonstrate excellent performance of the proposed method.
△ Less
Submitted 3 November, 2015;
originally announced November 2015.
-
Variable Selection and Estimation for Partially Linear Single-index Models with Longitudinal Data
Authors:
Gaorong Li,
Peng Lai,
Heng Lian
Abstract:
In this paper, we consider the partially linear single-index models with longitudinal data. To deal with the variable selection problem in this context, we propose a penalized procedure combined with two bias correction methods, resulting in the bias-corrected generalized estimating equation (GEE) and the bias-corrected quadratic inference function (QIF), which can take into account the correlatio…
▽ More
In this paper, we consider the partially linear single-index models with longitudinal data. To deal with the variable selection problem in this context, we propose a penalized procedure combined with two bias correction methods, resulting in the bias-corrected generalized estimating equation (GEE) and the bias-corrected quadratic inference function (QIF), which can take into account the correlations. Asymptotic properties of these methods are demonstrated. We also evaluate the finite sample performance of the proposed methods via Monte Carlo simulation studies and a real data analysis.
△ Less
Submitted 7 February, 2014;
originally announced February 2014.
-
Letter to the Editor
Authors:
Yuao Hu,
Ye Tian,
Heng Lian
Abstract:
The paper by Alfons, Croux and Gelper (2013), Sparse least trimmed squares regression for analyzing high-dimensional large data sets, considered a combination of least trimmed squares (LTS) and lasso penalty for robust and sparse high-dimensional regression. In a recent paper [She and Owen (2011)], a method for outlier detection based on a sparsity penalty on the mean shift parameter was proposed…
▽ More
The paper by Alfons, Croux and Gelper (2013), Sparse least trimmed squares regression for analyzing high-dimensional large data sets, considered a combination of least trimmed squares (LTS) and lasso penalty for robust and sparse high-dimensional regression. In a recent paper [She and Owen (2011)], a method for outlier detection based on a sparsity penalty on the mean shift parameter was proposed (designated by "SO" in the following). This work is mentioned in Alfons et al. as being an "entirely different approach." Certainly the problem studied by Alfons et al. is novel and interesting.
△ Less
Submitted 9 December, 2013;
originally announced December 2013.
-
Reduced-rank Regression in Sparse Multivariate Varying-Coefficient Models with High-dimensional Covariates
Authors:
Heng Lian,
Shujie Ma
Abstract:
In genetic studies, not only can the number of predictors obtained from microarray measurements be extremely large, there can also be multiple response variables. Motivated by such a situation, we consider semiparametric dimension reduction methods in sparse multivariate regression models. Previous studies on joint variable and rank selection have focused on parametric models while here we conside…
▽ More
In genetic studies, not only can the number of predictors obtained from microarray measurements be extremely large, there can also be multiple response variables. Motivated by such a situation, we consider semiparametric dimension reduction methods in sparse multivariate regression models. Previous studies on joint variable and rank selection have focused on parametric models while here we consider the more challenging varying-coefficient models which make the investigation on nonlinear interactions of variables possible. Spline approximation, rank constraints and concave group penalties are utilized for model estimation. Asymptotic oracle properties of the estimators are presented. We also propose reduced-rank independent screening to deal with the situation when the dimension is so high that penalized estimation cannot be efficiently applied. In simulations, we show the advantages of simultaneously performing variable and rank selection. A real data set is analyzed to illustrate the good prediction performance when incorporating interactions between genetic variables and an index variable.
△ Less
Submitted 24 September, 2013;
originally announced September 2013.
-
Bayesian Quantile Regression for Partially Linear Additive Models
Authors:
Yuao Hu,
Kaifeng Zhao,
Heng Lian
Abstract:
In this article, we develop a semiparametric Bayesian estimation and model selection approach for partially linear additive models in conditional quantile regression. The asymmetric Laplace distribution provides a mechanism for Bayesian inferences of quantile regression models based on the check loss. The advantage of this new method is that nonlinear, linear and zero function components can be se…
▽ More
In this article, we develop a semiparametric Bayesian estimation and model selection approach for partially linear additive models in conditional quantile regression. The asymmetric Laplace distribution provides a mechanism for Bayesian inferences of quantile regression models based on the check loss. The advantage of this new method is that nonlinear, linear and zero function components can be separated automatically and simultaneously during model fitting without the need of pre-specification or parameter tuning. This is achieved by spike-and-slab priors using two sets of indicator variables. For posterior inferences, we design an effective partially collapsed Gibbs sampler. Simulation studies are used to illustrate our algorithm. The proposed approach is further illustrated by applications to two real data sets.
△ Less
Submitted 10 July, 2013;
originally announced July 2013.
-
Minimax Prediction for Functional Linear Regression with Functional Responses in Reproducing Kernel Hilbert Spaces
Authors:
Heng Lian
Abstract:
In this article, we consider convergence rates in functional linear regression with functional responses, where the linear coefficient lies in a reproducing kernel Hilbert space (RKHS). Without assuming that the reproducing kernel and the covariate covariance kernel are aligned, or assuming polynomial rate of decay of the eigenvalues of the covariance kernel, convergence rates in prediction risk a…
▽ More
In this article, we consider convergence rates in functional linear regression with functional responses, where the linear coefficient lies in a reproducing kernel Hilbert space (RKHS). Without assuming that the reproducing kernel and the covariate covariance kernel are aligned, or assuming polynomial rate of decay of the eigenvalues of the covariance kernel, convergence rates in prediction risk are established. The corresponding lower bound in rates is derived by reducing to the scalar response case. Simulation studies and two benchmark datasets are used to illustrate that the proposed approach can significantly outperform the functional PCA approach in prediction.
△ Less
Submitted 17 November, 2012;
originally announced November 2012.
-
Bayesian Quantile Regression for Single-Index Models
Authors:
Yuao Hua,
Robert B. Gramacy,
Heng Lian
Abstract:
Using an asymmetric Laplace distribution, which provides a mechanism for Bayesian inference of quantile regression models, we develop a fully Bayesian approach to fitting single-index models in conditional quantile regression. In this work, we use a Gaussian process prior for the unknown nonparametric link function and a Laplace distribution on the index vector, with the latter motivated by the re…
▽ More
Using an asymmetric Laplace distribution, which provides a mechanism for Bayesian inference of quantile regression models, we develop a fully Bayesian approach to fitting single-index models in conditional quantile regression. In this work, we use a Gaussian process prior for the unknown nonparametric link function and a Laplace distribution on the index vector, with the latter motivated by the recent popularity of the Bayesian lasso idea. We design a Markov chain Monte Carlo algorithm for posterior inference. Careful consideration of the singularity of the kernel matrix, and tractability of some of the full conditional distributions leads to a partially collapsed approach where the nonparametric link function is integrated out in some of the sampling steps. Our simulations demonstrate the superior performance of the Bayesian method versus the frequentist approach. The method is further illustrated by an application to the hurricane data.
△ Less
Submitted 29 December, 2011; v1 submitted 2 October, 2011;
originally announced October 2011.
-
Shrinkage Estimation and Selection for Multiple Functional Regression
Authors:
Heng Lian
Abstract:
Functional linear regression is a useful extension of simple linear regression and has been investigated by many researchers. However, functional variable selection problems when multiple functional observations exist, which is the counterpart in the functional context of multiple linear regression, is seldom studied. Here we propose a method using group smoothly clipped absolute deviation penalty…
▽ More
Functional linear regression is a useful extension of simple linear regression and has been investigated by many researchers. However, functional variable selection problems when multiple functional observations exist, which is the counterpart in the functional context of multiple linear regression, is seldom studied. Here we propose a method using group smoothly clipped absolute deviation penalty (gSCAD) which can perform regression estimation and variable selection simultaneously. We show the method can identify the true model consistently and discuss construction of pointwise confidence interval for the estimated functional coefficients. Our methodology and theory is verified by simulation studies as well as an application to spectrometrics data.
△ Less
Submitted 19 August, 2011;
originally announced August 2011.
-
Bias-corrected GEE estimation and smooth-threshold GEE variable selection for single-index models with clustered data
Authors:
Peng Lai,
Qihua Wang,
Heng Lian
Abstract:
In this paper, we present a generalized estimating equations based estimation approach and a variable selection procedure for single-index models when the observed data are clustered. Unlike the case of independent observations, bias-correction is necessary when general working correlation matrices are used in the estimating equations. Our variable selection procedure based on smooth-threshold est…
▽ More
In this paper, we present a generalized estimating equations based estimation approach and a variable selection procedure for single-index models when the observed data are clustered. Unlike the case of independent observations, bias-correction is necessary when general working correlation matrices are used in the estimating equations. Our variable selection procedure based on smooth-threshold estimating equations \citep{Ueki-2009} can automatically eliminate irrelevant parameters by setting them as zeros and is computationally simpler than alternative approaches based on shrinkage penalty. The resulting estimator consistently identifies the significant variables in the index, even when the working correlation matrix is misspecified. The asymptotic property of the estimator is the same whether or not the nonzero parameters are known (in both cases we use the same estimating equations), thus achieving the oracle property in the sense of \cite{Fan-Li-2001}. The finite sample properties of the estimator are illustrated by some simulation examples, as well as a real data application.
△ Less
Submitted 5 August, 2011;
originally announced August 2011.
-
Semiparametric Bayesian Information Criterion for Model Selection in Ultra-high Dimensional Additive Models
Authors:
Heng Lian
Abstract:
For linear models with a diverging number of parameters, it has recently been shown that modified versions of Bayesian information criterion (BIC) can identify the true model consistently. However, in many cases there is little justification that the effects of the covariates are actually linear. Thus a semiparametric model such as the additive model studied here, is a viable alternative. We demon…
▽ More
For linear models with a diverging number of parameters, it has recently been shown that modified versions of Bayesian information criterion (BIC) can identify the true model consistently. However, in many cases there is little justification that the effects of the covariates are actually linear. Thus a semiparametric model such as the additive model studied here, is a viable alternative. We demonstrate that theoretical results on the consistency of BIC-type criterion can be extended to this more challenging situation, with dimension diverging exponentially fast with sample size. Besides, the noise assumptions are relaxed in our theoretical studies. These efforts significantly enlarge the applicability of the criterion to a more general class of models.
△ Less
Submitted 25 July, 2011;
originally announced July 2011.
-
Gaussian process single-index models as emulators for computer experiments
Authors:
Robert B. Gramacy,
Heng Lian
Abstract:
A single-index model (SIM) provides for parsimonious multi-dimensional nonlinear regression by combining parametric (linear) projection with univariate nonparametric (non-linear) regression models. We show that a particular Gaussian process (GP) formulation is simple to work with and ideal as an emulator for some types of computer experiment as it can outperform the canonical separable GP regressi…
▽ More
A single-index model (SIM) provides for parsimonious multi-dimensional nonlinear regression by combining parametric (linear) projection with univariate nonparametric (non-linear) regression models. We show that a particular Gaussian process (GP) formulation is simple to work with and ideal as an emulator for some types of computer experiment as it can outperform the canonical separable GP regression model commonly used in this setting. Our contribution focuses on drastically simplifying, re-interpreting, and then generalizing a recently proposed fully Bayesian GP-SIM combination, and then illustrating its favorable performance on synthetic data and a real-data computer experiment. Two R packages, both released on CRAN, have been augmented to facilitate inference under our proposed model(s).
△ Less
Submitted 17 August, 2011; v1 submitted 21 September, 2010;
originally announced September 2010.
-
Flexible Shrinkage Estimation in High-Dimensional Varying Coefficient Models
Authors:
Heng Lian
Abstract:
We consider the problem of simultaneous variable selection and constant coefficient identification in high-dimensional varying coefficient models based on B-spline basis expansion. Both objectives can be considered as some type of model selection problems and we show that they can be achieved by a double shrinkage strategy. We apply the adaptive group Lasso penalty in models involving a diverging…
▽ More
We consider the problem of simultaneous variable selection and constant coefficient identification in high-dimensional varying coefficient models based on B-spline basis expansion. Both objectives can be considered as some type of model selection problems and we show that they can be achieved by a double shrinkage strategy. We apply the adaptive group Lasso penalty in models involving a diverging number of covariates, which can be much larger than the sample size, but we assume the number of relevant variables is smaller than the sample size via model sparsity. Such so-called ultra-high dimensional settings are especially challenging in semiparametric models as we consider here and has not been dealt with before. Under suitable conditions, we show that consistency in terms of both variable selection and constant coefficient identification can be achieved, as well as the oracle property of the constant coefficients. Even in the case that the zero and constant coefficients are known a priori, our results appear to be new in that it reduces to semivarying coefficient models (a.k.a. partially linear varying coefficient models) with a diverging number of covariates. We also theoretically demonstrate the consistency of a semiparametric BIC-type criterion in this high-dimensional context, extending several previous results. The finite sample behavior of the estimator is evaluated by some Monte Carlo studies.
△ Less
Submitted 13 August, 2010;
originally announced August 2010.
-
Gaussian Process Models for Nonparametric Functional Regression with Functional Responses
Authors:
Heng Lian
Abstract:
Recently nonparametric functional model with functional responses has been proposed within the functional reproducing kernel Hilbert spaces (fRKHS) framework. Motivated by its superior performance and also its limitations, we propose a Gaussian process model whose posterior mode coincide with the fRKHS estimator. The Bayesian approach has several advantages compared to its predecessor. Firstly, th…
▽ More
Recently nonparametric functional model with functional responses has been proposed within the functional reproducing kernel Hilbert spaces (fRKHS) framework. Motivated by its superior performance and also its limitations, we propose a Gaussian process model whose posterior mode coincide with the fRKHS estimator. The Bayesian approach has several advantages compared to its predecessor. Firstly, the multiple unknown parameters can be inferred together with the regression function in a unified framework. Secondly, as a Bayesian method, the statistical inferences are straightforward through the posterior distributions. We also use the predictive process models adapted from the spatial statistics literature to overcome the computational limitations, thus extending the applicability of this popular technique to a new problem. Modifications of predictive process models are nevertheless critical in our context to obtain valid inferences. The numerical results presented demonstrate the effectiveness of the modifications.
△ Less
Submitted 10 August, 2010;
originally announced August 2010.
-
A simple and efficient algorithm for fused lasso signal approximator with convex loss function
Authors:
Heng Lian
Abstract:
We consider the augmented Lagrangian method (ALM) as a solver for the fused lasso signal approximator (FLSA) problem. The ALM is a dual method in which squares of the constraint functions are added as penalties to the Lagrangian. In order to apply this method to FLSA, two types of auxiliary variables are introduced to transform the original unconstrained minimization problem into a linearly constr…
▽ More
We consider the augmented Lagrangian method (ALM) as a solver for the fused lasso signal approximator (FLSA) problem. The ALM is a dual method in which squares of the constraint functions are added as penalties to the Lagrangian. In order to apply this method to FLSA, two types of auxiliary variables are introduced to transform the original unconstrained minimization problem into a linearly constrained minimization problem. Each updating in this iterative algorithm consists of just a simple one-dimensional convex programming problem, with closed form solution in many cases. While the existing literature mostly focused on the quadratic loss function, our algorithm can be easily implemented for general convex loss. The most attractive feature of this algorithm is its simplicity in implementation compared to other existing fast solvers. We also provide some convergence analysis of the algorithm. Finally, the method is illustrated with some simulation datasets.
△ Less
Submitted 27 May, 2010;
originally announced May 2010.
-
Time-varying Coefficients Estimation in Differential Equation Models with Noisy Time-varying Covariates
Authors:
Heng Lian
Abstract:
We study the problem of estimating time-varying coefficients in ordinary differential equations. Current theory only applies to the case when the associated state variables are observed without measurement errors as presented in \cite{chenwu08b,chenwu08}. The difficulty arises from the quadratic functional of observations that one needs to deal with instead of the linear functional that appears…
▽ More
We study the problem of estimating time-varying coefficients in ordinary differential equations. Current theory only applies to the case when the associated state variables are observed without measurement errors as presented in \cite{chenwu08b,chenwu08}. The difficulty arises from the quadratic functional of observations that one needs to deal with instead of the linear functional that appears when state variables contain no measurement errors. We derive the asymptotic bias and variance for the previously proposed two-step estimators using quadratic regression functional theory.
△ Less
Submitted 6 October, 2009;
originally announced October 2009.
-
Shrinkage Tuning Parameter Selection in Precision Matrices Estimation
Authors:
Heng Lian
Abstract:
Recent literature provides many computational and modeling approaches for covariance matrices estimation in a penalized Gaussian graphical models but relatively little study has been carried out on the choice of the tuning parameter. This paper tries to fill this gap by focusing on the problem of shrinkage parameter selection when estimating sparse precision matrices using the penalized likeliho…
▽ More
Recent literature provides many computational and modeling approaches for covariance matrices estimation in a penalized Gaussian graphical models but relatively little study has been carried out on the choice of the tuning parameter. This paper tries to fill this gap by focusing on the problem of shrinkage parameter selection when estimating sparse precision matrices using the penalized likelihood approach. Previous approaches typically used K-fold cross-validation in this regard. In this paper, we first derived the generalized approximate cross-validation for tuning parameter selection which is not only a more computationally efficient alternative, but also achieves smaller error rate for model fitting compared to leave-one-out cross-validation. For consistency in the selection of nonzero entries in the precision matrix, we employ a Bayesian information criterion which provably can identify the nonzero conditional correlations in the Gaussian model. Our simulations demonstrate the general superiority of the two proposed selectors in comparison with leave-one-out cross-validation, ten-fold cross-validation and Akaike information criterion.
△ Less
Submitted 6 September, 2009;
originally announced September 2009.
-
Functional Partial Linear Model
Authors:
Heng Lian
Abstract:
When predicting scalar responses in the situation where the explanatory variables are functions, it is sometimes the case that some functional variables are related to responses linearly while other variables have more complicated relationships with the responses. In this paper, we propose a new semi-parametric model to take advantage of both parametric and nonparametric functional modeling. Asymp…
▽ More
When predicting scalar responses in the situation where the explanatory variables are functions, it is sometimes the case that some functional variables are related to responses linearly while other variables have more complicated relationships with the responses. In this paper, we propose a new semi-parametric model to take advantage of both parametric and nonparametric functional modeling. Asymptotic properties of the proposed estimators are established and finite sample behavior is investigated through a small simulation experiment.
△ Less
Submitted 27 November, 2012; v1 submitted 5 August, 2009;
originally announced August 2009.
-
Total Variation, Adaptive Total Variation and Nonconvex Smoothly Clipped Absolute Deviation Penalty for Denoising Blocky Images
Authors:
Aditya Chopra,
Heng Lian
Abstract:
The total variation-based image denoising model has been generalized and extended in numerous ways, improving its performance in different contexts. We propose a new penalty function motivated by the recent progress in the statistical literature on high-dimensional variable selection. Using a particular instantiation of the majorization-minimization algorithm, the optimization problem can be eff…
▽ More
The total variation-based image denoising model has been generalized and extended in numerous ways, improving its performance in different contexts. We propose a new penalty function motivated by the recent progress in the statistical literature on high-dimensional variable selection. Using a particular instantiation of the majorization-minimization algorithm, the optimization problem can be efficiently solved and the computational procedure realized is similar to the spatially adaptive total variation model. Our two-pixel image model shows theoretically that the new penalty function solves the bias problem inherent in the total variation model. The superior performance of the new penalty is demonstrated through several experiments. Our investigation is limited to "blocky" images which have small total variation.
△ Less
Submitted 2 June, 2009;
originally announced June 2009.
-
Sparse Bayesian Hierarchical Modeling of High-dimensional Clustering Problems
Authors:
Heng Lian
Abstract:
Clustering is one of the most widely used procedures in the analysis of microarray data, for example with the goal of discovering cancer subtypes based on observed heterogeneity of genetic marks between different tissues. It is well-known that in such high-dimensional settings, the existence of many noise variables can overwhelm the few signals embedded in the high-dimensional space. We propose…
▽ More
Clustering is one of the most widely used procedures in the analysis of microarray data, for example with the goal of discovering cancer subtypes based on observed heterogeneity of genetic marks between different tissues. It is well-known that in such high-dimensional settings, the existence of many noise variables can overwhelm the few signals embedded in the high-dimensional space. We propose a novel Bayesian approach based on Dirichlet process with a sparsity prior that simultaneous performs variable selection and clustering, and also discover variables that only distinguish a subset of the cluster components. Unlike previous Bayesian formulations, we use Dirichlet process (DP) for both clustering of samples as well as for regularizing the high-dimensional mean/variance structure. To solve the computational challenge brought by this double usage of DP, we propose to make use of a sequential sampling scheme embedded within Markov chain Monte Carlo (MCMC) updates to improve the naive implementation of existing algorithms for DP mixture models. Our method is demonstrated on a simulation study and illustrated with the leukemia gene expression dataset.
△ Less
Submitted 19 April, 2009;
originally announced April 2009.
-
Empirical Likelihood Confidence Intervals for Nonparametric Functional Data Analysis
Authors:
Heng Lian
Abstract:
We consider the problem of constructing confidence intervals for nonparametric functional data analysis using empirical likelihood. In this doubly infinite-dimensional context, we demonstrate the Wilks's phenomenon and propose a bias-corrected construction that requires neither undersmoothing nor direct bias estimation. We also extend our results to partially linear regression involving function…
▽ More
We consider the problem of constructing confidence intervals for nonparametric functional data analysis using empirical likelihood. In this doubly infinite-dimensional context, we demonstrate the Wilks's phenomenon and propose a bias-corrected construction that requires neither undersmoothing nor direct bias estimation. We also extend our results to partially linear regression involving functional data. Our numerical results demonstrated the improved performance of empirical likelihood over approximation based on asymptotic normality.
△ Less
Submitted 6 April, 2009;
originally announced April 2009.
-
Nonparametric Estimation of Variance Function for Functional Data
Authors:
Heng Lian
Abstract:
This article investigates nonparametric estimation of variance functions for functional data when the mean function is unknown. We obtain asymptotic results for the kernel estimator based on squared residuals. Similar to the finite dimensional case, our asymptotic result shows the smoothness of the unknown mean function has an effect on the rate of convergence. Our simulaton studies demonstrate…
▽ More
This article investigates nonparametric estimation of variance functions for functional data when the mean function is unknown. We obtain asymptotic results for the kernel estimator based on squared residuals. Similar to the finite dimensional case, our asymptotic result shows the smoothness of the unknown mean function has an effect on the rate of convergence. Our simulaton studies demonstrate that estimator based on residuals performs much better than that based on conditional second moment of the responses.
△ Less
Submitted 14 December, 2008;
originally announced December 2008.
-
A note on conditional Akaike information for Poisson regression with random effects
Authors:
Heng Lian
Abstract:
A popular model selection approach for generalized linear mixed-effects models is the Akaike information criterion, or AIC. Among others, \cite{vaida05} pointed out the distinction between the marginal and conditional inference depending on the focus of research. The conditional AIC was derived for the linear mixed-effects model which was later generalized by \cite{liang08}. We show that the sim…
▽ More
A popular model selection approach for generalized linear mixed-effects models is the Akaike information criterion, or AIC. Among others, \cite{vaida05} pointed out the distinction between the marginal and conditional inference depending on the focus of research. The conditional AIC was derived for the linear mixed-effects model which was later generalized by \cite{liang08}. We show that the similar strategy extends to Poisson regression with random effects, where condition AIC can be obtained based on our observations. Simulation studies demonstrate the usage of the criterion.
△ Less
Submitted 11 October, 2008;
originally announced October 2008.
-
Stochastic adaptation of importance sampler
Authors:
Heng Lian
Abstract:
Improving efficiency of importance sampler is at the center of research in Monte Carlo methods. While adaptive approach is usually difficult within the Markov Chain Monte Carlo framework, the counterpart in importance sampling can be justified and validated easily. We propose an iterative adaptation method for learning the proposal distribution of an importance sampler based on stochastic approx…
▽ More
Improving efficiency of importance sampler is at the center of research in Monte Carlo methods. While adaptive approach is usually difficult within the Markov Chain Monte Carlo framework, the counterpart in importance sampling can be justified and validated easily. We propose an iterative adaptation method for learning the proposal distribution of an importance sampler based on stochastic approximation. The stochastic approximation method can recruit general iterative optimization techniques like the minorization-maximization algorithm. The effectiveness of the approach in optimizing the Kullback divergence between the proposal distribution and the target is demonstrated using several simple examples.
△ Less
Submitted 10 December, 2007;
originally announced December 2007.
-
Bayes and empirical Bayes changepoint problems
Authors:
Heng Lian
Abstract:
We generalize the approach of Liu and Lawrence (1999) for multiple changepoint problems where the number of changepoints is unknown. The approach is based on dynamic programming recursion for efficient calculation of the marginal probability of the data with the hidden parameters integrated out. For the estimation of the hyperparameters, we propose to use Monte Carlo EM when training data are av…
▽ More
We generalize the approach of Liu and Lawrence (1999) for multiple changepoint problems where the number of changepoints is unknown. The approach is based on dynamic programming recursion for efficient calculation of the marginal probability of the data with the hidden parameters integrated out. For the estimation of the hyperparameters, we propose to use Monte Carlo EM when training data are available. We argue that there is some advantages of using samples from the posterior which takes into account the uncertainty of the changepoints, compared to the traditional MAP estimator, which is also more expensive to compute in this context. The samples from the posterior obtained by our algorithm are independent, getting rid of the convergence issue associated with the MCMC approach. We illustrate our approach on limited simulations and some real data set.
△ Less
Submitted 10 September, 2007;
originally announced September 2007.
-
MOST: detecting cancer differential gene expression
Authors:
Heng Lian
Abstract:
We propose a new statistics for the detection of differentially expressed genes, when the genes are activated only in a subset of the samples. Statistics designed for this unconventional circumstance has proved to be valuable for most cancer studies, where oncogenes are activated for a small number of disease samples. Previous efforts made in this direction include COPA, OS and ORT. We propose a…
▽ More
We propose a new statistics for the detection of differentially expressed genes, when the genes are activated only in a subset of the samples. Statistics designed for this unconventional circumstance has proved to be valuable for most cancer studies, where oncogenes are activated for a small number of disease samples. Previous efforts made in this direction include COPA, OS and ORT. We propose a new statistics called maximum ordered subset t-statistics (MOST) which seems to be natural when the number of activated samples is unknown. We compare MOST to other statistics and find the proposed method often has more power then its competitors.
△ Less
Submitted 10 September, 2007;
originally announced September 2007.