Search | arXiv e-print repository

arXiv:2406.12212 [pdf, other]

Identifying Genetic Variants for Obesity Incorporating Prior Insights: Quantile Regression with Insight Fusion for Ultra-high Dimensional Data

Authors: Jiantong Wang, Heng Lian, Yan Yu, He** Zhang

Abstract: Obesity is widely recognized as a critical and pervasive health concern. We strive to identify important genetic risk factors from hundreds of thousands of single nucleotide polymorphisms (SNPs) for obesity. We propose and apply a novel Quantile Regression with Insight Fusion (QRIF) approach that can integrate insights from established studies or domain knowledge to simultaneously select variables… ▽ More Obesity is widely recognized as a critical and pervasive health concern. We strive to identify important genetic risk factors from hundreds of thousands of single nucleotide polymorphisms (SNPs) for obesity. We propose and apply a novel Quantile Regression with Insight Fusion (QRIF) approach that can integrate insights from established studies or domain knowledge to simultaneously select variables and modeling for ultra-high dimensional genetic data, focusing on high conditional quantiles of body mass index (BMI) that are of most interest. We discover interesting new SNPs and shed new light on a comprehensive view of the underlying genetic risk factors for different levels of BMI. This may potentially pave the way for more precise and targeted treatment strategies. The QRIF approach intends to balance the trade-off between the prior insights and the observed data while being robust to potential false information. We further establish the desirable asymptotic properties under the challenging non-differentiable check loss functions via Huber loss approximation and nonconvex SCAD penalty via local linear approximation. Finally, we develop an efficient algorithm for the QRIF approach. Our simulation studies further demonstrate its effectiveness. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: This article is submitted to Journal of the American Statistical Association

arXiv:2405.14652 [pdf, ps, other]

Statistical inference for high-dimensional convoluted rank regression

Authors: Leheng Cai, Xu Guo, Heng Lian, Li** Zhu

Abstract: High-dimensional penalized rank regression is a powerful tool for modeling high-dimensional data due to its robustness and estimation efficiency. However, the non-smoothness of the rank loss brings great challenges to the computation. To solve this critical issue, high-dimensional convoluted rank regression is recently proposed, and penalized convoluted rank regression estimators are introduced. H… ▽ More High-dimensional penalized rank regression is a powerful tool for modeling high-dimensional data due to its robustness and estimation efficiency. However, the non-smoothness of the rank loss brings great challenges to the computation. To solve this critical issue, high-dimensional convoluted rank regression is recently proposed, and penalized convoluted rank regression estimators are introduced. However, these developed estimators cannot be directly used to make inference. In this paper, we investigate the inference problem of high-dimensional convoluted rank regression. We first establish estimation error bounds of penalized convoluted rank regression estimators under weaker conditions on the predictors. Based on the penalized convoluted rank regression estimators, we further introduce a debiased estimator. We then provide Bahadur representation for our proposed estimator. We further develop simultaneous inference procedures. A novel bootstrap procedure is proposed and its theoretical validity is also established. Finally, simulation and real data analysis are conducted to illustrate the merits of our proposed methods. △ Less

Submitted 23 May, 2024; originally announced May 2024.

arXiv:2405.02539 [pdf, ps, other]

Distributed Iterative Hard Thresholding for Variable Selection in Tobit Models

Authors: Changxin Yang, Zhongyi Zhu, Heng Lian

Abstract: While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converge… ▽ More While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converges with a near-optimal minimax rate. Additionally, we extend the method to a distributed setting, requiring only a few rounds of communication while retaining the estimation rate of the centralized version. Simulation results show that the IHT algorithm for the Tobit model achieves superior accuracy in predictions and subset selection, with the distributed estimator closely matching that of the centralized estimator. When applied to high-dimensional left-censored HIV viral load data, our method also exhibits similar superiority. △ Less

Submitted 3 May, 2024; originally announced May 2024.

arXiv:2105.01278 [pdf, other]

Nonparametric Quantile Regression for Homogeneity Pursuit in Panel Data Models

Authors: Xiaoyu Zhang, Di Wang, Heng Lian, Guodong Li

Abstract: Many panel data have the latent subgroup effect on individuals, and it is important to correctly identify these groups since the efficiency of resulting estimators can be improved significantly by pooling the information of individuals within each group. However, the currently assumed parametric and semiparametric relationship between the response and predictors may be misspecified, which leads to… ▽ More Many panel data have the latent subgroup effect on individuals, and it is important to correctly identify these groups since the efficiency of resulting estimators can be improved significantly by pooling the information of individuals within each group. However, the currently assumed parametric and semiparametric relationship between the response and predictors may be misspecified, which leads to a wrong grou** result, and the nonparametric approach hence can be considered to avoid such mistakes. Moreover, the response may depend on predictors in different ways at various quantile levels, and the corresponding grou** structure may also vary. To tackle these problems, this article proposes a nonparametric quantile regression method for homogeneity pursuit in panel data models with individual effects, and a pairwise fused penalty is used to automatically select the number of groups. The asymptotic properties are established, and an ADMM algorithm is also developed. The finite sample performance is evaluated by simulation experiments, and the usefulness of the proposed methodology is further illustrated by an empirical example. △ Less

Submitted 22 August, 2022; v1 submitted 3 May, 2021; originally announced May 2021.

Comments: To appear at the Journal of Business & Economic Statistics

arXiv:1909.06624 [pdf, other]

High-dimensional vector autoregressive time series modeling via tensor decomposition

Authors: Di Wang, Yao Zheng, Heng Lian, Guodong Li

Abstract: The classical vector autoregressive model is a fundamental tool for multivariate time series analysis. However, it involves too many parameters when the number of time series and lag order are even moderately large. This paper proposes to rearrange the transition matrices of the model into a tensor form such that the parameter space can be restricted along three directions simultaneously via tenso… ▽ More The classical vector autoregressive model is a fundamental tool for multivariate time series analysis. However, it involves too many parameters when the number of time series and lag order are even moderately large. This paper proposes to rearrange the transition matrices of the model into a tensor form such that the parameter space can be restricted along three directions simultaneously via tensor decomposition. In contrast, the reduced-rank regression method can restrict the parameter space in only one direction. Besides achieving substantial dimension reduction, the proposed model is interpretable from the factor modeling perspective. Moreover, to handle high-dimensional time series, this paper considers imposing sparsity on factor matrices to improve the model interpretability and estimation efficiency, which leads to a sparsity-inducing estimator. For the low-dimensional case, we derive asymptotic properties of the proposed least squares estimator and introduce an alternating least squares algorithm. For the high-dimensional case, we establish non-asymptotic properties of the sparsity-inducing estimator and propose an ADMM algorithm for regularized estimation. Simulation experiments and a real data example demonstrate the advantages of the proposed approach over various existing methods. △ Less

Submitted 3 November, 2020; v1 submitted 14 September, 2019; originally announced September 2019.

arXiv:1802.03511 [pdf, other]

A General Framework For Frequentist Model Averaging

Authors: Priyam Mitra, Heng Lian, Ritwik Mitra, Hua Liang, Min-ge Xie

Abstract: Model selection strategies have been routinely employed to determine a model for data analysis in statistics, and further study and inference then often proceed as though the selected model were the true model that were known a priori. This practice does not account for the uncertainty introduced by the selection process and the fact that the selected model can possibly be a wrong one. Model avera… ▽ More Model selection strategies have been routinely employed to determine a model for data analysis in statistics, and further study and inference then often proceed as though the selected model were the true model that were known a priori. This practice does not account for the uncertainty introduced by the selection process and the fact that the selected model can possibly be a wrong one. Model averaging approaches try to remedy this issue by combining estimators for a set of candidate models. Specifically, instead of deciding which model is the 'right' one, a model averaging approach suggests to fit a set of candidate models and average over the estimators using certain data adaptive weights. In this paper we establish a general frequentist model averaging framework that does not set any restrictions on the set of candidate models. It greatly broadens the scope of the existing methodologies under the frequentist model averaging development. Assuming the data is from an unknown model, we derive the model averaging estimator and study its limiting distributions and related predictions while taking possible modeling biases into account. We propose a set of optimal weights to combine the individual estimators so that the expected mean squared error of the average estimator is minimized. Simulation studies are conducted to compare the performance of the estimator with that of the existing methods. The results show the benefits of the proposed approach over traditional model selection approaches as well as existing model averaging methods. △ Less

Submitted 9 February, 2018; originally announced February 2018.

arXiv:1708.05487 [pdf, ps, other]

Debiased distributed learning for sparse partial linear models in high dimensions

Authors: Shaogao Lv, Heng Lian

Abstract: Although various distributed machine learning schemes have been proposed recently for pure linear models and fully nonparametric models, little attention has been paid on distributed optimization for semi-paramemetric models with multiple-level structures (e.g. sparsity, linearity and nonlinearity). To address these issues, the current paper proposes a new communication-efficient distributed learn… ▽ More Although various distributed machine learning schemes have been proposed recently for pure linear models and fully nonparametric models, little attention has been paid on distributed optimization for semi-paramemetric models with multiple-level structures (e.g. sparsity, linearity and nonlinearity). To address these issues, the current paper proposes a new communication-efficient distributed learning algorithm for partially sparse linear models with an increasing number of features. The proposed method is based on the classical divide and conquer strategy for handing big data and each sub-method defined on each subsample consists of a debiased estimation of the double-regularized least squares approach. With the proposed method, we theoretically prove that our global parametric estimator can achieve optimal parametric rate in our semi-parametric model given an appropriate partition on the total data. Specially, the choice of data partition relies on the underlying smoothness of the nonparametric component, but it is adaptive to the sparsity parameter. Even under the non-distributed setting, we develop a new and easily-read proof for optimal estimation of the parametric error in high dimensional partial linear model. Finally, several simulated experiments are implemented to indicate comparable empirical performance of our debiased technique under the distributed setting. △ Less

Submitted 3 November, 2019; v1 submitted 17 August, 2017; originally announced August 2017.

arXiv:1701.03772 [pdf, other]

Additive Partially Linear Models for Massive Heterogeneous Data

Authors: Binhuan Wang, Yixin Fang, Heng Lian, Hua Liang

Abstract: We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. We propose an aggregation type of estimators for the commonality parameters that possess the asymptotic optimal bounds and the asymptotic distributions as if… ▽ More We consider an additive partially linear framework for modelling massive heterogeneous data. The major goal is to extract multiple common features simultaneously across all sub-populations while exploring heterogeneity of each sub-population. We propose an aggregation type of estimators for the commonality parameters that possess the asymptotic optimal bounds and the asymptotic distributions as if there were no heterogeneity. This oracle result holds when the number of sub-populations does not grow too fast and the tuning parameters are selected carefully. A plug-in estimator for the heterogeneity parameter is further constructed, and shown to possess the asymptotic distribution as if the commonality information were available. Furthermore, we develop a heterogeneity test for the linear components and a homogeneity test for the non-linear components accordingly. The performance of the proposed methods is evaluated via simulation studies and an application to the Medicare Provider Utilization and Payment data. △ Less

Submitted 28 December, 2018; v1 submitted 13 January, 2017; originally announced January 2017.

arXiv:1511.01124 [pdf, ps, other]

Greedy Forward Regression for Variable Screening

Authors: Ming-Yen Cheng, Sanying Feng, Gaorong Li, Heng Lian

Abstract: Two popular variable screening methods under the ultra-high dimensional setting with the desirable sure screening property are the sure independence screening (SIS) and the forward regression (FR). Both are classical variable screening methods and recently have attracted greater attention under the new light of high-dimensional data analysis. We consider a new and simple screening method that inco… ▽ More Two popular variable screening methods under the ultra-high dimensional setting with the desirable sure screening property are the sure independence screening (SIS) and the forward regression (FR). Both are classical variable screening methods and recently have attracted greater attention under the new light of high-dimensional data analysis. We consider a new and simple screening method that incorporates multiple predictors in each step of forward regression, with decision on which variables to incorporate based on the same criterion. If only one step is carried out, it actually reduces to the SIS. Thus it can be regarded as a generalization and unification of the FR and the SIS. More importantly, it preserves the sure screening property and has similar computational complexity as FR in each step, yet it can discover the relevant covariates in fewer steps. Thus, it reduces the computational burden of FR drastically while retaining advantages of the latter over SIS. Furthermore, we show that it can find all the true variables if the number of steps taken is the same as the correct model size, even when using the original FR. An extensive simulation study and application to two real data examples demonstrate excellent performance of the proposed method. △ Less

Submitted 3 November, 2015; originally announced November 2015.

arXiv:1402.1649 [pdf, ps, other]

Variable Selection and Estimation for Partially Linear Single-index Models with Longitudinal Data

Authors: Gaorong Li, Peng Lai, Heng Lian

Abstract: In this paper, we consider the partially linear single-index models with longitudinal data. To deal with the variable selection problem in this context, we propose a penalized procedure combined with two bias correction methods, resulting in the bias-corrected generalized estimating equation (GEE) and the bias-corrected quadratic inference function (QIF), which can take into account the correlatio… ▽ More In this paper, we consider the partially linear single-index models with longitudinal data. To deal with the variable selection problem in this context, we propose a penalized procedure combined with two bias correction methods, resulting in the bias-corrected generalized estimating equation (GEE) and the bias-corrected quadratic inference function (QIF), which can take into account the correlations. Asymptotic properties of these methods are demonstrated. We also evaluate the finite sample performance of the proposed methods via Monte Carlo simulation studies and a real data analysis. △ Less

Submitted 7 February, 2014; originally announced February 2014.

Comments: to appear in Statistics and Computing

arXiv:1312.2364 [pdf, ps, other]

doi 10.1214/13-AOAS640

Letter to the Editor

Authors: Yuao Hu, Ye Tian, Heng Lian

Abstract: The paper by Alfons, Croux and Gelper (2013), Sparse least trimmed squares regression for analyzing high-dimensional large data sets, considered a combination of least trimmed squares (LTS) and lasso penalty for robust and sparse high-dimensional regression. In a recent paper [She and Owen (2011)], a method for outlier detection based on a sparsity penalty on the mean shift parameter was proposed… ▽ More The paper by Alfons, Croux and Gelper (2013), Sparse least trimmed squares regression for analyzing high-dimensional large data sets, considered a combination of least trimmed squares (LTS) and lasso penalty for robust and sparse high-dimensional regression. In a recent paper [She and Owen (2011)], a method for outlier detection based on a sparsity penalty on the mean shift parameter was proposed (designated by "SO" in the following). This work is mentioned in Alfons et al. as being an "entirely different approach." Certainly the problem studied by Alfons et al. is novel and interesting. △ Less

Submitted 9 December, 2013; originally announced December 2013.

Comments: Published in at http://dx.doi.org/10.1214/13-AOAS640 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS640

Journal ref: Annals of Applied Statistics 2013, Vol. 7, No. 2, 1244-1246

arXiv:1309.6058 [pdf, ps, other]

Reduced-rank Regression in Sparse Multivariate Varying-Coefficient Models with High-dimensional Covariates

Authors: Heng Lian, Shujie Ma

Abstract: In genetic studies, not only can the number of predictors obtained from microarray measurements be extremely large, there can also be multiple response variables. Motivated by such a situation, we consider semiparametric dimension reduction methods in sparse multivariate regression models. Previous studies on joint variable and rank selection have focused on parametric models while here we conside… ▽ More In genetic studies, not only can the number of predictors obtained from microarray measurements be extremely large, there can also be multiple response variables. Motivated by such a situation, we consider semiparametric dimension reduction methods in sparse multivariate regression models. Previous studies on joint variable and rank selection have focused on parametric models while here we consider the more challenging varying-coefficient models which make the investigation on nonlinear interactions of variables possible. Spline approximation, rank constraints and concave group penalties are utilized for model estimation. Asymptotic oracle properties of the estimators are presented. We also propose reduced-rank independent screening to deal with the situation when the dimension is so high that penalized estimation cannot be efficiently applied. In simulations, we show the advantages of simultaneously performing variable and rank selection. A real data set is analyzed to illustrate the good prediction performance when incorporating interactions between genetic variables and an index variable. △ Less

Submitted 24 September, 2013; originally announced September 2013.

arXiv:1307.2668 [pdf, ps, other]

Bayesian Quantile Regression for Partially Linear Additive Models

Authors: Yuao Hu, Kaifeng Zhao, Heng Lian

Abstract: In this article, we develop a semiparametric Bayesian estimation and model selection approach for partially linear additive models in conditional quantile regression. The asymmetric Laplace distribution provides a mechanism for Bayesian inferences of quantile regression models based on the check loss. The advantage of this new method is that nonlinear, linear and zero function components can be se… ▽ More In this article, we develop a semiparametric Bayesian estimation and model selection approach for partially linear additive models in conditional quantile regression. The asymmetric Laplace distribution provides a mechanism for Bayesian inferences of quantile regression models based on the check loss. The advantage of this new method is that nonlinear, linear and zero function components can be separated automatically and simultaneously during model fitting without the need of pre-specification or parameter tuning. This is achieved by spike-and-slab priors using two sets of indicator variables. For posterior inferences, we design an effective partially collapsed Gibbs sampler. Simulation studies are used to illustrate our algorithm. The proposed approach is further illustrated by applications to two real data sets. △ Less

Submitted 10 July, 2013; originally announced July 2013.

arXiv:1211.4080 [pdf, ps, other]

Minimax Prediction for Functional Linear Regression with Functional Responses in Reproducing Kernel Hilbert Spaces

Showing 1–33 of 33 results for author: Lian, H