Search | arXiv e-print repository

arXiv:2405.19058 [pdf, other]

Participation bias in the estimation of heritability and genetic correlation

Authors: Shuang Song, Stefania Benonisdottir, Jun S. Liu, Augustine Kong

Abstract: It is increasingly recognized that participation bias can pose problems for genetic studies. Recently, to overcome the challenge that genetic information of non-participants is unavailable, it is shown that by comparing the IBD (identity by descent) shared and not-shared segments among the participants, one can estimate the genetic component underlying participation. That, however, does not direct… ▽ More It is increasingly recognized that participation bias can pose problems for genetic studies. Recently, to overcome the challenge that genetic information of non-participants is unavailable, it is shown that by comparing the IBD (identity by descent) shared and not-shared segments among the participants, one can estimate the genetic component underlying participation. That, however, does not directly address how to adjust estimates of heritability and genetic correlation for phenotypes correlated with participation. Here, for phenotypes whose mean differences between population and sample are known, we demonstrate a way to do so by adopting a statistical framework that separates out the genetic and non-genetic correlations between participation and these phenotypes. Crucially, our method avoids making the assumption that the effect of the genetic component underlying participation is manifested entirely through these other phenotypes. Applying the method to 12 UK Biobank phenotypes, we found 8 have significant genetic correlations with participation, including body mass index, educational attainment, and smoking status. For most of these phenotypes, without adjustments, estimates of heritability and the absolute value of genetic correlation would have underestimation biases. △ Less

Submitted 30 May, 2024; v1 submitted 29 May, 2024; originally announced May 2024.

arXiv:2401.06383 [pdf, other]

Decomposition with Monotone B-splines: Fitting and Testing

Authors: Lijun Wang, Xiaodan Fan, Hongyu Zhao, Jun S. Liu

Abstract: A univariate continuous function can always be decomposed as the sum of a non-increasing function and a non-decreasing one. Based on this property, we propose a non-parametric regression method that combines two spline-fitted monotone curves. We demonstrate by extensive simulations that, compared to standard spline-fitting methods, the proposed approach is particularly advantageous in high-noise s… ▽ More A univariate continuous function can always be decomposed as the sum of a non-increasing function and a non-decreasing one. Based on this property, we propose a non-parametric regression method that combines two spline-fitted monotone curves. We demonstrate by extensive simulations that, compared to standard spline-fitting methods, the proposed approach is particularly advantageous in high-noise scenarios. Several theoretical guarantees are established for the proposed approach. Additionally, we present statistics to test the monotonicity of a function based on monotone decomposition, which can better control Type I error and achieve comparable (if not always higher) power compared to existing methods. Finally, we apply the proposed fitting and testing approaches to analyze the single-cell pseudotime trajectory datasets, identifying significant biological insights for non-monotonically expressed genes through Gene Ontology enrichment analysis. The source code implementing the methodology and producing all results is accessible at https://github.com/szcf-weiya/MonotoneDecomposition.jl. △ Less

Submitted 9 April, 2024; v1 submitted 12 January, 2024; originally announced January 2024.

arXiv:2309.16855 [pdf, other]

A Variational Spike-and-Slab Approach for Group Variable Selection

Authors: Buyu Lin, Changhao Ge, Jun S. Liu

Abstract: We introduce a class of generic spike-and-slab priors for high-dimensional linear regression with grouped variables and present a Coordinate-ascent Variational Inference (CAVI) algorithm for obtaining an optimal variational Bayes approximation. Using parameter expansion for a specific, yet comprehensive, family of slab distributions, we obtain a further gain in computational efficiency. The method… ▽ More We introduce a class of generic spike-and-slab priors for high-dimensional linear regression with grouped variables and present a Coordinate-ascent Variational Inference (CAVI) algorithm for obtaining an optimal variational Bayes approximation. Using parameter expansion for a specific, yet comprehensive, family of slab distributions, we obtain a further gain in computational efficiency. The method can be easily extended to fitting additive models. Theoretically, we present general conditions on the generic spike-and-slab priors that enable us to derive the contraction rates for both the true posterior and the VB posterior for linear regression and additive models, of which some previous theoretical results can be viewed as special cases. Our simulation studies and real data application demonstrate that the proposed method is superior to existing methods in both variable selection and parameter estimation. Our algorithm is implemented in the R package GVSSB. △ Less

Submitted 28 September, 2023; originally announced September 2023.

Comments: 64 pages, 6 figures

arXiv:2308.15370 [pdf, other]

Multi-Response Heteroscedastic Gaussian Process Models and Their Inference

Authors: Taehee Lee, Jun S. Liu

Abstract: Despite the widespread utilization of Gaussian process models for versatile nonparametric modeling, they exhibit limitations in effectively capturing abrupt changes in function smoothness and accommodating relationships with heteroscedastic errors. Addressing these shortcomings, the heteroscedastic Gaussian process (HeGP) regression seeks to introduce flexibility by acknowledging the variability o… ▽ More Despite the widespread utilization of Gaussian process models for versatile nonparametric modeling, they exhibit limitations in effectively capturing abrupt changes in function smoothness and accommodating relationships with heteroscedastic errors. Addressing these shortcomings, the heteroscedastic Gaussian process (HeGP) regression seeks to introduce flexibility by acknowledging the variability of residual variances across covariates in the regression model. In this work, we extend the HeGP concept, expanding its scope beyond regression tasks to encompass classification and state-space models. To achieve this, we propose a novel framework where the Gaussian process is coupled with a covariate-induced precision matrix process, adopting a mixture formulation. This approach enables the modeling of heteroscedastic covariance functions across covariates. To mitigate the computational challenges posed by sampling, we employ variational inference to approximate the posterior and facilitate posterior predictive modeling. Additionally, our training process leverages an EM algorithm featuring closed-form M-step updates to efficiently evaluate the heteroscedastic covariance function. A notable feature of our model is its consistent performance on multivariate responses, accommodating various types (continuous or categorical) seamlessly. Through a combination of simulations and real-world applications in climatology, we illustrate the model's prowess and advantages. By overcoming the limitations of traditional Gaussian process models, our proposed framework offers a robust and versatile tool for a wide array of applications. △ Less

Submitted 30 August, 2023; v1 submitted 29 August, 2023; originally announced August 2023.

Comments: submitted to the Journal of the American Statistical Association (JASA)

arXiv:2307.01748 [pdf, other]

Monotone Cubic B-Splines with a Neural-Network Generator

Authors: Lijun Wang, Xiaodan Fan, Huabai Li, Jun S. Liu

Abstract: We present a method for fitting monotone curves using cubic B-splines, which is equivalent to putting a monotonicity constraint on the coefficients. We explore different ways of enforcing this constraint and analyze their theoretical and empirical properties. We propose two algorithms for solving the spline fitting problem: one that uses standard optimization techniques and one that trains a Multi… ▽ More We present a method for fitting monotone curves using cubic B-splines, which is equivalent to putting a monotonicity constraint on the coefficients. We explore different ways of enforcing this constraint and analyze their theoretical and empirical properties. We propose two algorithms for solving the spline fitting problem: one that uses standard optimization techniques and one that trains a Multi-Layer Perceptrons (MLP) generator to approximate the solutions under various settings and perturbations. The generator approach can speed up the fitting process when we need to solve the problem repeatedly, such as when constructing confidence bands using bootstrap. We evaluate our method against several existing methods, some of which do not use the monotonicity constraint, on some monotone curves with varying noise levels. We demonstrate that our method outperforms the other methods, especially in high-noise scenarios. We also apply our method to analyze the polarization-hole phenomenon during star formation in astrophysics. The source code is accessible at \texttt{\url{https://github.com/szcf-weiya/MonotoneSplines.jl}}. △ Less

Submitted 17 November, 2023; v1 submitted 4 July, 2023; originally announced July 2023.

arXiv:2201.10063 [pdf, other]

Varying Coefficient Model via Adaptive Spline Fitting

Authors: Xufei Wang, Bo Jiang, Jun S. Liu

Abstract: The varying coefficient model has received broad attention from researchers as it is a powerful dimension reduction tool for non-parametric modeling. Most existing varying coefficient models fitted with polynomial spline assume equidistant knots and take the number of knots as the hyperparameter. However, imposing equidistant knots appears to be too rigid, and determining the optimal number of kno… ▽ More The varying coefficient model has received broad attention from researchers as it is a powerful dimension reduction tool for non-parametric modeling. Most existing varying coefficient models fitted with polynomial spline assume equidistant knots and take the number of knots as the hyperparameter. However, imposing equidistant knots appears to be too rigid, and determining the optimal number of knots systematically is also a challenge. In this article, we deal with this challenge by utilizing polynomial splines with adaptively selected and predictor-specific knots to fit the coefficients in varying coefficient models. An efficient dynamic programming algorithm is proposed to find the optimal solution. Numerical results show that the new method can achieve significantly smaller mean squared errors for coefficients compared with the equidistant spline fitting method. △ Less

Submitted 14 June, 2022; v1 submitted 24 January, 2022; originally announced January 2022.

arXiv:2112.08641 [pdf, ps, other]

On Gibbs Sampling for Structured Bayesian Models Discussion of paper by Zanella and Roberts

Authors: Xiaodong Yang, Jun S. Liu

Abstract: This article is a discussion of Zanella and Roberts' paper: Multilevel linear models, gibbs samplers and multigrid decompositions. We consider several extensions in which the multigrid decomposition would bring us interesting insights, including vector hierarchical models, linear mixed effects models and partial centering parametrizations. This article is a discussion of Zanella and Roberts' paper: Multilevel linear models, gibbs samplers and multigrid decompositions. We consider several extensions in which the multigrid decomposition would bring us interesting insights, including vector hierarchical models, linear mixed effects models and partial centering parametrizations. △ Less

Submitted 16 December, 2021; originally announced December 2021.

Comments: 18 pages

arXiv:2111.15084 [pdf, other]

Convergence Rate of Multiple-try Metropolis Independent sampler

Authors: Xiaodong Yang, Jun S. Liu

Abstract: The Multiple-try Metropolis (MTM) method is an interesting extension of the classical Metropolis-Hastings algorithm. However, theoretical understandings of its convergence behavior as well as whether and how it may help are still unknown. This paper derives the exact convergence rate for Multiple-try Metropolis Independent sampler (MTM-IS) via an explicit eigen analysis. As a by-product, we prove… ▽ More The Multiple-try Metropolis (MTM) method is an interesting extension of the classical Metropolis-Hastings algorithm. However, theoretical understandings of its convergence behavior as well as whether and how it may help are still unknown. This paper derives the exact convergence rate for Multiple-try Metropolis Independent sampler (MTM-IS) via an explicit eigen analysis. As a by-product, we prove that MTM-IS is less efficient than the simpler approach of repeated independent Metropolis-Hastings method at the same computational cost. We further explore more variations and find it possible to design more efficient MTM algorithms by creating correlated multiple trials. △ Less

Submitted 3 February, 2023; v1 submitted 29 November, 2021; originally announced November 2021.

Comments: 34 pages; 7 figures

arXiv:2110.15406 [pdf, other]

Kernel-based Partial Permutation Test for Detecting Heterogeneous Functional Relationship

Authors: Xinran Li, Bo Jiang, Jun S. Liu

Abstract: We propose a kernel-based partial permutation test for checking the equality of functional relationship between response and covariates among different groups. The main idea, which is intuitive and easy to implement, is to keep the projections of the response vector $\boldsymbol{Y}$ on leading principle components of a kernel matrix fixed and permute $\boldsymbol{Y}$'s projections on the remaining… ▽ More We propose a kernel-based partial permutation test for checking the equality of functional relationship between response and covariates among different groups. The main idea, which is intuitive and easy to implement, is to keep the projections of the response vector $\boldsymbol{Y}$ on leading principle components of a kernel matrix fixed and permute $\boldsymbol{Y}$'s projections on the remaining principle components. The proposed test allows for different choices of kernels, corresponding to different classes of functions under the null hypothesis. First, using linear or polynomial kernels, our partial permutation tests are exactly valid in finite samples for linear or polynomial regression models with Gaussian noise; similar results straightforwardly extend to kernels with finite feature spaces. Second, by allowing the kernel feature space to diverge with the sample size, the test can be large-sample valid for a wider class of functions. Third, for general kernels with possibly infinite-dimensional feature space, the partial permutation test is exactly valid when the covariates are exactly balanced across all groups, or asymptotically valid when the underlying function follows certain regularized Gaussian processes. We further suggest test statistics using likelihood ratio between two (nested) GPR models, and propose computationally efficient algorithms utilizing the EM algorithm and Newton's method, where the latter also involves Fisher scoring and quadratic programming and is particularly useful when EM suffers from slow convergence. Extensions to correlated and non-Gaussian noises have also been investigated theoretically or numerically. Furthermore, the test can be extended to use multiple kernels together and can thus enjoy properties from each kernel. Both simulation study and application illustrate the properties of the proposed test. △ Less

Submitted 28 October, 2021; originally announced October 2021.

arXiv:2104.07261 [pdf, other]

Partition-Mallows Model and Its Inference for Rank Aggregation

Authors: Wanchuang Zhu, Yingkai Jiang, Jun S. Liu, Ke Deng

Abstract: Learning how to aggregate ranking lists has been an active research area for many years and its advances have played a vital role in many applications ranging from bioinformatics to internet commerce. The problem of discerning reliability of rankers based only on the rank data is of great interest to many practitioners, but has received less attention from researchers. By dividing the ranked entit… ▽ More Learning how to aggregate ranking lists has been an active research area for many years and its advances have played a vital role in many applications ranging from bioinformatics to internet commerce. The problem of discerning reliability of rankers based only on the rank data is of great interest to many practitioners, but has received less attention from researchers. By dividing the ranked entities into two disjoint groups, i.e., relevant and irrelevant/background ones, and incorporating the Mallows model for the relative ranking of relevant entities, we propose a framework for rank aggregation that can not only distinguish quality differences among the rankers but also provide the detailed ranking information for relevant entities. Theoretical properties of the proposed approach are established, and its advantages over existing approaches are demonstrated via simulation studies and real-data applications. Extensions of the proposed method to handle partial ranking lists and conduct covariate-assisted rank aggregation are also discussed. △ Less

Submitted 15 April, 2021; originally announced April 2021.

arXiv:2010.06175 [pdf, other]

Neural Gaussian Mirror for Controlled Feature Selection in Neural Networks

Authors: Xin Xing, Yu Gui, Chenguang Dai, Jun S. Liu

Abstract: Deep neural networks (DNNs) have become increasingly popular and achieved outstanding performance in predictive tasks. However, the DNN framework itself cannot inform the user which features are more or less relevant for making the prediction, which limits its applicability in many scientific fields. We introduce neural Gaussian mirrors (NGMs), in which mirrored features are created, via a structu… ▽ More Deep neural networks (DNNs) have become increasingly popular and achieved outstanding performance in predictive tasks. However, the DNN framework itself cannot inform the user which features are more or less relevant for making the prediction, which limits its applicability in many scientific fields. We introduce neural Gaussian mirrors (NGMs), in which mirrored features are created, via a structured perturbation based on a kernel-based conditional dependence measure, to help evaluate feature importance. We design two modifications of the DNN architecture for incorporating mirrored features and providing mirror statistics to measure feature importance. As shown in simulated and real data examples, the proposed method controls the feature selection error rate at a predefined level and maintains a high selection power even with the presence of highly correlated features. △ Less

Submitted 13 October, 2020; originally announced October 2020.

arXiv:2007.07498 [pdf, other]

Measurement error models: from nonparametric methods to deep neural networks

Authors: Zhirui Hu, Zheng Tracy Ke, Jun S Liu

Abstract: The success of deep learning has inspired recent interests in applying neural networks in statistical inference. In this paper, we investigate the use of deep neural networks for nonparametric regression with measurement errors. We propose an efficient neural network design for estimating measurement error models, in which we use a fully connected feed-forward neural network (FNN) to approximate t… ▽ More The success of deep learning has inspired recent interests in applying neural networks in statistical inference. In this paper, we investigate the use of deep neural networks for nonparametric regression with measurement errors. We propose an efficient neural network design for estimating measurement error models, in which we use a fully connected feed-forward neural network (FNN) to approximate the regression function $f(x)$, a normalizing flow to approximate the prior distribution of $X$, and an inference network to approximate the posterior distribution of $X$. Our method utilizes recent advances in variational inference for deep neural networks, such as the importance weight autoencoder, doubly reparametrized gradient estimator, and non-linear independent components estimation. We conduct an extensive numerical study to compare the neural network approach with classical nonparametric methods and observe that the neural network approach is more flexible in accommodating different classes of regression functions and performs superior or comparable to the best available method in nearly all settings. △ Less

Submitted 15 July, 2020; originally announced July 2020.

Comments: 37 pages, 8 figures

arXiv:2007.06136 [pdf, other]

Bayesian Bi-clustering Methods with Applications in Computational Biology

Authors: Han Yan, Jiexing Wu, Yang Li, Jun S. Liu

Abstract: Bi-clustering is a useful approach in analyzing biological data when observations come from heterogeneous groups and have a large number of features. We outline a general Bayesian approach in tackling bi-clustering problems in moderate to high dimensions, and propose three Bayesian bi-clustering models on categorical data, which increase in complexities in their modeling of the distributions of fe… ▽ More Bi-clustering is a useful approach in analyzing biological data when observations come from heterogeneous groups and have a large number of features. We outline a general Bayesian approach in tackling bi-clustering problems in moderate to high dimensions, and propose three Bayesian bi-clustering models on categorical data, which increase in complexities in their modeling of the distributions of features across bi-clusters. Our proposed methods apply to a wide range of scenarios: from situations where data are cluster-distinguishable only among a small subset of features but masked by a large amount of noise, to situations where different groups of data are identified by different sets of features or data exhibit hierarchical structures. Through simulation studies, we show that our methods outperform existing (bi-)clustering methods in both identifying clusters and recovering feature distributional patterns across bi-clusters. We apply our methods to two genetic datasets, though the area of application of our methods is even broader. △ Less

Submitted 9 February, 2021; v1 submitted 12 July, 2020; originally announced July 2020.

arXiv:2007.01237 [pdf, other]

A Scale-free Approach for False Discovery Rate Control in Generalized Linear Models

Authors: Chenguang Dai, Buyu Lin, Xin Xing, Jun S. Liu

Abstract: The generalized linear models (GLM) have been widely used in practice to model non-Gaussian response variables. When the number of explanatory features is relatively large, scientific researchers are of interest to perform controlled feature selection in order to simplify the downstream analysis. This paper introduces a new framework for feature selection in GLMs that can achieve false discovery r… ▽ More The generalized linear models (GLM) have been widely used in practice to model non-Gaussian response variables. When the number of explanatory features is relatively large, scientific researchers are of interest to perform controlled feature selection in order to simplify the downstream analysis. This paper introduces a new framework for feature selection in GLMs that can achieve false discovery rate (FDR) control in two asymptotic regimes. The key step is to construct a mirror statistic to measure the importance of each feature, which is based upon two (asymptotically) independent estimates of the corresponding true coefficient obtained via either the data-splitting method or the Gaussian mirror method. The FDR control is achieved by taking advantage of the mirror statistic's property that, for any null feature, its sampling distribution is (asymptotically) symmetric about 0. In the moderate-dimensional setting in which the ratio between the dimension (number of features) p and the sample size n converges to a fixed value, we construct the mirror statistic based on the maximum likelihood estimation. In the high-dimensional setting where p is much larger than n, we use the debiased Lasso to build the mirror statistic. Compared to the Benjamini-Hochberg procedure, which crucially relies on the asymptotic normality of the Z statistic, the proposed methodology is scale free as it only hinges on the symmetric property, thus is expected to be more robust in finite-sample cases. Both simulation results and a real data application show that the proposed methods are capable of controlling the FDR, and are often more powerful than existing methods including the Benjamini-Hochberg procedure and the knockoff filter. △ Less

Submitted 2 July, 2020; originally announced July 2020.

Comments: 60 pages, 13 pages

arXiv:2006.01055 [pdf, other]

On Posterior Consistency of Bayesian Factor Models in High Dimensions

Authors: Yucong Ma, Jun S. Liu

Abstract: As a principled dimension reduction technique, factor models have been widely adopted in social science, economics, bioinformatics, and many other fields. However, in high-dimensional settings, conducting a 'correct' Bayesianfactor analysis can be subtle since it requires both a careful prescription of the prior distribution and a suitable computational strategy. In particular, we analyze the issu… ▽ More As a principled dimension reduction technique, factor models have been widely adopted in social science, economics, bioinformatics, and many other fields. However, in high-dimensional settings, conducting a 'correct' Bayesianfactor analysis can be subtle since it requires both a careful prescription of the prior distribution and a suitable computational strategy. In particular, we analyze the issues related to the attempt of being "noninformative" for elements of the factor loading matrix, especially for sparse Bayesian factor models in high dimensions, and propose solutions to them. We show here why adopting the orthogonal factor assumption is appropriate and can result in a consistent posterior inference of the loading matrix conditional on the true idiosyncratic variance and the allocation of nonzero elements in the true loading matrix. We also provide an efficient Gibbs sampler to conduct the full posterior inference based on the prior setup from Rockova and George (2016)and a uniform orthogonal factor assumption on the factor matrix. △ Less

Submitted 1 January, 2021; v1 submitted 1 June, 2020; originally announced June 2020.

arXiv:2006.00767 [pdf, other]

Generative Multiple-purpose Sampler for Weighted M-estimation

Authors: Minsuk Shin, Shijie Wang, Jun S Liu

Abstract: To overcome the computational bottleneck of various data perturbation procedures such as the bootstrap and cross validations, we propose the Generative Multiple-purpose Sampler (GMS), which constructs a generator function to produce solutions of weighted M-estimators from a set of given weights and tuning parameters. The GMS is implemented by a single optimization without having to repeatedly eval… ▽ More To overcome the computational bottleneck of various data perturbation procedures such as the bootstrap and cross validations, we propose the Generative Multiple-purpose Sampler (GMS), which constructs a generator function to produce solutions of weighted M-estimators from a set of given weights and tuning parameters. The GMS is implemented by a single optimization without having to repeatedly evaluate the minimizers of weighted losses, and is thus capable of significantly reducing the computational time. We demonstrate that the GMS framework enables the implementation of various statistical procedures that would be unfeasible in a conventional framework, such as the iterated bootstrap, bootstrapped cross-validation for penalized likelihood, bootstrapped empirical Bayes with nonparametric maximum likelihood, etc. To construct a computationally efficient generator function, we also propose a novel form of neural network called the \emph{weight multiplicative multilayer perceptron} to achieve fast convergence. Our numerical results demonstrate that the new neural network structure enjoys a few orders of magnitude speed advantage in comparison to the conventional one. An R package called GMS is provided, which runs under Pytorch to implement the proposed methods and allows the user to provide a customized loss function to tailor to their own models of interest. △ Less

Submitted 16 October, 2023; v1 submitted 1 June, 2020; originally announced June 2020.

arXiv:2004.01975 [pdf, other]

Stratification and Optimal Resampling for Sequential Monte Carlo

Authors: Yichao Li, Wenshuo Wang, Ke Deng, Jun S Liu

Abstract: Sequential Monte Carlo (SMC), also known as particle filters, has been widely accepted as a powerful computational tool for making inference with dynamical systems. A key step in SMC is resampling, which plays the role of steering the algorithm towards the future dynamics. Several strategies have been proposed and used in practice, including multinomial resampling, residual resampling (Liu and Che… ▽ More Sequential Monte Carlo (SMC), also known as particle filters, has been widely accepted as a powerful computational tool for making inference with dynamical systems. A key step in SMC is resampling, which plays the role of steering the algorithm towards the future dynamics. Several strategies have been proposed and used in practice, including multinomial resampling, residual resampling (Liu and Chen 1998), optimal resampling (Fearnhead and Clifford 2003), stratified resampling (Kitagawa 1996), and optimal transport resampling (Reich 2013). We show that, in the one dimensional case, optimal transport resampling is equivalent to stratified resampling on the sorted particles, and they both minimize the resampling variance as well as the expected squared energy distance between the original and resampled empirical distributions; in the multidimensional case, the variance of stratified resampling after sorting particles using Hilbert curve (Gerber et al. 2019) in $\mathbb{R}^d$ is $O(m^{-(1+2/d)})$, an improved rate compared to the original $O(m^{-(1+1/d)})$, where $m$ is the number of resampled particles. This improved rate is the lowest for ordered stratified resampling schemes, as conjectured in Gerber et al. (2019). We also present an almost sure bound on the Wasserstein distance between the original and Hilbert-curve-resampled empirical distributions. In light of these theoretical results, we propose the stratified multiple-descendant growth (SMG) algorithm, which allows us to explore the sample space more efficiently compared to the standard i.i.d. multiple-descendant sampling-resampling approach as measured by the Wasserstein metric. Numerical evidence is provided to demonstrate the effectiveness of our proposed method. △ Less

Submitted 7 December, 2020; v1 submitted 4 April, 2020; originally announced April 2020.

arXiv:2002.08542 [pdf, other]

False Discovery Rate Control via Data Splitting

Authors: Chenguang Dai, Buyu Lin, Xin Xing, Jun S. Liu

Abstract: Selecting relevant features associated with a given response variable is an important issue in many scientific fields. Quantifying quality and uncertainty of a selection result via false discovery rate (FDR) control has been of recent interest. This paper introduces a way of using data-splitting strategies to asymptotically control the FDR while maintaining a high power. For each feature, the meth… ▽ More Selecting relevant features associated with a given response variable is an important issue in many scientific fields. Quantifying quality and uncertainty of a selection result via false discovery rate (FDR) control has been of recent interest. This paper introduces a way of using data-splitting strategies to asymptotically control the FDR while maintaining a high power. For each feature, the method constructs a test statistic by estimating two independent regression coefficients via data splitting. FDR control is achieved by taking advantage of the statistic's property that, for any null feature, its sampling distribution is symmetric about zero. Furthermore, we propose Multiple Data Splitting (MDS) to stabilize the selection result and boost the power. Interestingly and surprisingly, with the FDR still under control, MDS not only helps overcome the power loss caused by sample splitting, but also results in a lower variance of the false discovery proportion (FDP) compared with all other methods in consideration. We prove that the proposed data-splitting methods can asymptotically control the FDR at any designated level for linear and Gaussian graphical models in both low and high dimensions. Through intensive simulation studies and a real-data application, we show that the proposed methods are robust to the unknown distribution of features, easy to implement and computationally efficient, and are often the most powerful ones amongst competitors especially when the signals are weak and the correlations or partial correlations are high among features. △ Less

Submitted 15 December, 2020; v1 submitted 19 February, 2020; originally announced February 2020.

arXiv:1911.09761 [pdf, other]

Controlling False Discovery Rate Using Gaussian Mirrors

Authors: Xin Xing, Zhigen Zhao, Jun S. Liu

Abstract: Simultaneously finding multiple influential variables and controlling the false discovery rate (FDR) for linear regression models is a fundamental problem. We here propose the Gaussian Mirror (GM) method, which creates for each predictor variable a pair of mirror variables by adding and subtracting a randomly generated Gaussian perturbation, and proceeds with a certain regression method, such as t… ▽ More Simultaneously finding multiple influential variables and controlling the false discovery rate (FDR) for linear regression models is a fundamental problem. We here propose the Gaussian Mirror (GM) method, which creates for each predictor variable a pair of mirror variables by adding and subtracting a randomly generated Gaussian perturbation, and proceeds with a certain regression method, such as the ordinary least-square or the Lasso (the mirror variables can also be created after selection). The mirror variables naturally lead to test statistics effective for controlling the FDR. Under a mild assumption on the dependence among the covariates, we show that the FDR can be controlled at any designated level asymptotically. We also demonstrate through extensive numerical studies that the GM method is more powerful than many existing methods for selecting relevant variables subject to FDR control, especially for cases when the covariates are highly correlated and the influential variables are not overly sparse. △ Less

Submitted 19 March, 2021; v1 submitted 21 November, 2019; originally announced November 2019.

arXiv:1911.02171 [pdf, other]

Minimax Nonparametric Two-sample Test under Smoothing

Authors: Xin Xing, Zuofeng Shang, Pang Du, ** Ma, Wenxuan Zhong, Jun S. Liu

Abstract: We consider the problem of comparing probability densities between two groups. A new probabilistic tensor product smoothing spline framework is developed to model the joint density of two variables. Under such a framework, the probability density comparison is equivalent to testing the presence/absence of interactions. We propose a penalized likelihood ratio test for such interaction testing and s… ▽ More We consider the problem of comparing probability densities between two groups. A new probabilistic tensor product smoothing spline framework is developed to model the joint density of two variables. Under such a framework, the probability density comparison is equivalent to testing the presence/absence of interactions. We propose a penalized likelihood ratio test for such interaction testing and show that the test statistic is asymptotically chi-square distributed under the null hypothesis. Furthermore, we derive a sharp minimax testing rate based on the Bernstein width for nonparametric two-sample tests and show that our proposed test statistics is minimax optimal. In addition, a data-adaptive tuning criterion is developed to choose the penalty parameter. Simulations and real applications demonstrate that the proposed test outperforms the conventional approaches under various scenarios. △ Less

Submitted 11 January, 2021; v1 submitted 5 November, 2019; originally announced November 2019.

arXiv:1909.05922 [pdf, other]

Monte Carlo Approximation of Bayes Factors via Mixing with Surrogate Distributions

Authors: Chenguang Dai, Jun S. Liu

Abstract: By mixing the target posterior distribution with a surrogate distribution, of which the normalizing constant is tractable, we propose a method for estimating the marginal likelihood using the Wang-Landau algorithm. We show that a faster convergence of the proposed method can be achieved via the momentum acceleration. Two implementation strategies are detailed: (i) facilitating global jumps between… ▽ More By mixing the target posterior distribution with a surrogate distribution, of which the normalizing constant is tractable, we propose a method for estimating the marginal likelihood using the Wang-Landau algorithm. We show that a faster convergence of the proposed method can be achieved via the momentum acceleration. Two implementation strategies are detailed: (i) facilitating global jumps between the posterior and surrogate distributions via the Multiple-try Metropolis; (ii) constructing the surrogate via the variational approximation. When a surrogate is difficult to come by, we describe a new jum** mechanism for general reversible jump Markov chain Monte Carlo algorithms, which combines the Multiple-try Metropolis and a directional sampling algorithm. We illustrate the proposed methods on several statistical models, including the Log-Gaussian Cox process, the Bayesian Lasso, the logistic regression, and the g-prior Bayesian variable selection. △ Less

Submitted 15 December, 2020; v1 submitted 12 September, 2019; originally announced September 2019.

arXiv:1907.11985 [pdf, other]

doi 10.1103/PhysRevE.101.033301

The Wang-Landau Algorithm as Stochastic Optimization and Its Acceleration

Authors: Chenguang Dai, Jun S. Liu

Abstract: We show that the Wang-Landau algorithm can be formulated as a stochastic gradient descent algorithm minimizing a smooth and convex objective function, of which the gradient is estimated using Markov chain Monte Carlo iterations. The optimization formulation provides us a new way to establish the convergence rate of the Wang-Landau algorithm, by exploiting the fact that almost surely, the density e… ▽ More We show that the Wang-Landau algorithm can be formulated as a stochastic gradient descent algorithm minimizing a smooth and convex objective function, of which the gradient is estimated using Markov chain Monte Carlo iterations. The optimization formulation provides us a new way to establish the convergence rate of the Wang-Landau algorithm, by exploiting the fact that almost surely, the density estimates (on the logarithmic scale) remain in a compact set, upon which the objective function is strongly convex. The optimization viewpoint motivates us to improve the efficiency of the Wang-Landau algorithm using popular tools including the momentum method and the adaptive learning rate method. We demonstrate the accelerated Wang-Landau algorithm on a two-dimensional Ising model and a two-dimensional ten-state Potts model. △ Less

Submitted 2 February, 2020; v1 submitted 27 July, 2019; originally announced July 2019.

Comments: 10 pages, 3 figures

Journal ref: Phys. Rev. E 101, 033301 (2020)

arXiv:1905.12440 [pdf, other]

Generative Parameter Sampler For Scalable Uncertainty Quantification

Authors: Minsuk Shin, Young Lee, Jun S. Liu

Abstract: Uncertainty quantification has been a core of the statistical machine learning, but its computational bottleneck has been a serious challenge for both Bayesians and frequentists. We propose a model-based framework in quantifying uncertainty, called predictive-matching Generative Parameter Sampler (GPS). This procedure considers an Uncertainty Quantification (UQ) distribution on the targeted parame… ▽ More Uncertainty quantification has been a core of the statistical machine learning, but its computational bottleneck has been a serious challenge for both Bayesians and frequentists. We propose a model-based framework in quantifying uncertainty, called predictive-matching Generative Parameter Sampler (GPS). This procedure considers an Uncertainty Quantification (UQ) distribution on the targeted parameter, which matches the corresponding predictive distribution to the observed data. This framework adopts a hierarchical modeling perspective such that each observation is modeled by an individual parameter. This individual parameterization permits the resulting inference to be computationally scalable and robust to outliers. Our approach is illustrated for linear models, Poisson processes, and deep neural networks for classification. The results show that the GPS is successful in providing uncertainty quantification as well as additional flexibility beyond what is allowed by classical statistical procedures under the postulated statistical models. △ Less

Submitted 2 June, 2019; v1 submitted 28 May, 2019; originally announced May 2019.

arXiv:1810.02658 [pdf, other]

doi 10.3390/e22030291

IMMIGRATE: A Margin-based Feature Selection Method with Interaction Terms

Authors: Ruzhang Zhao, Pengyu Hong, Jun S Liu

Abstract: Relief based algorithms have often been claimed to uncover feature interactions. However, it is still unclear whether and how interaction terms will be differentiated from marginal effects. In this paper, we propose IMMIGRATE algorithm by including and training weights for interaction terms. Besides applying the large margin principle, we focus on the robustness of the contributors of margin and c… ▽ More Relief based algorithms have often been claimed to uncover feature interactions. However, it is still unclear whether and how interaction terms will be differentiated from marginal effects. In this paper, we propose IMMIGRATE algorithm by including and training weights for interaction terms. Besides applying the large margin principle, we focus on the robustness of the contributors of margin and consider local and global information simultaneously. Moreover, IMMIGRATE has been shown to enjoy attractive properties, such as robustness and combination with Boosting. We evaluate our proposed method on several tasks, which achieves state-of-the-art results significantly. △ Less

Submitted 3 March, 2020; v1 submitted 5 October, 2018; originally announced October 2018.

Comments: R package ('Immigrate') available on CRAN

Journal ref: Entropy. 2020; 22(3):291

arXiv:1810.00141 [pdf, other]

Neuronized Priors for Bayesian Sparse Linear Regression

Authors: Minsuk Shin, Jun S Liu

Abstract: Although Bayesian variable selection methods have been intensively studied, their routine use in practice has not caught up with their non-Bayesian counterparts such as Lasso, likely due to difficulties in both computations and flexibilities of prior choices. To ease these challenges, we propose the neuronized priors to unify and extend some popular shrinkage priors, such as Laplace, Cauchy, horse… ▽ More Although Bayesian variable selection methods have been intensively studied, their routine use in practice has not caught up with their non-Bayesian counterparts such as Lasso, likely due to difficulties in both computations and flexibilities of prior choices. To ease these challenges, we propose the neuronized priors to unify and extend some popular shrinkage priors, such as Laplace, Cauchy, horseshoe, and spike-and-slab priors. A neuronized prior can be written as the product of a Gaussian weight variable and a scale variable transformed from Gaussian via an activation function. Compared with classic spike-and-slab priors, the neuronized priors achieve the same explicit variable selection without employing any latent indicator variables, which results in both more efficient and flexible posterior sampling and more effective posterior modal estimation. Theoretically, we provide specific conditions on the neuronized formulation to achieve the optimal posterior contraction rate, and show that a broadly applicable MCMC algorithm achieves an exponentially fast convergence rate under the neuronized formulation. We also examine various simulated and real data examples and demonstrate that using the neuronization representation is computationally more or comparably efficient than its standard counterpart in all well-known cases. An R package NPrior is provided in the CRAN for using neuronized priors in Bayesian linear regression. △ Less

Submitted 5 July, 2021; v1 submitted 28 September, 2018; originally announced October 2018.

arXiv:1808.06109 [pdf, other]

Bayesian Hidden Markov Tree Models for Clustering Genes with Shared Evolutionary History

Authors: Yang Li, Shaoyang Ning, Sarah E. Calvo, Vamsi K. Mootha, Jun S. Liu

Abstract: Determination of functions for poorly characterized genes is crucial for understanding biological processes and studying human diseases. Functionally associated genes are often gained and lost together through evolution. Therefore identifying co-evolution of genes can predict functional gene-gene associations. We describe here the full statistical model and computational strategies underlying the… ▽ More Determination of functions for poorly characterized genes is crucial for understanding biological processes and studying human diseases. Functionally associated genes are often gained and lost together through evolution. Therefore identifying co-evolution of genes can predict functional gene-gene associations. We describe here the full statistical model and computational strategies underlying the original algorithm, CLustering by Inferred Models of Evolution (CLIME 1.0) recently reported by us [Li et al., 2014]. CLIME 1.0 employs a mixture of tree-structured hidden Markov models for gene evolution process, and a Bayesian model-based clustering algorithm to detect gene modules with shared evolutionary histories (termed evolutionary conserved modules, or ECMs). A Dirichlet process prior was adopted for estimating the number of gene clusters and a Gibbs sampler was developed for posterior sampling. We further developed an extended version, CLIME 1.1, to incorporate the uncertainty on the evolutionary tree structure. By simulation studies and benchmarks on real data sets, we show that CLIME 1.0 and CLIME 1.1 outperform traditional methods that use simple metrics (e.g., the Hamming distance or Pearson correlation) to measure co-evolution between pairs of genes. △ Less

Submitted 18 August, 2018; originally announced August 2018.

Comments: 34 pages, 8 figures

arXiv:1807.01635 [pdf, other]

Randomization Inference for Peer Effects

Authors: Xinran Li, Peng Ding, Qian Lin, Dawei Yang, Jun S. Liu

Abstract: Many previous causal inference studies require no interference, that is, the potential outcomes of a unit do not depend on the treatments of other units. However, this no-interference assumption becomes unreasonable when a unit interacts with other units in the same group or cluster. In a motivating application, a university in China admits students through two channels: the college entrance exam… ▽ More Many previous causal inference studies require no interference, that is, the potential outcomes of a unit do not depend on the treatments of other units. However, this no-interference assumption becomes unreasonable when a unit interacts with other units in the same group or cluster. In a motivating application, a university in China admits students through two channels: the college entrance exam (also known as Gaokao) and recommendation (often based on Olympiads in various subjects). The university randomly assigns students to dorms, each of which hosts four students. Students within the same dorm live together and have extensive interactions. Therefore, it is likely that peer effects exist and the no-interference assumption does not hold. It is important to understand peer effects, because they give useful guidance for future roommate assignment to improve the performance of students. We define peer effects using potential outcomes. We then propose a randomization-based inference framework to study peer effects with arbitrary numbers of peers and peer types. Our inferential procedure does not assume any parametric model on the outcome distribution. Our analysis gives useful practical guidance for policy makers of the university in China. △ Less

Submitted 20 December, 2018; v1 submitted 4 July, 2018; originally announced July 2018.

arXiv:1611.08649 [pdf, other]

Robust Variable and Interaction Selection for Logistic Regression and Multiple Index Models

Authors: Yang Li, Jun S. Liu

Abstract: We propose Stepwise cOnditional likelihood variable selection for Discriminant Analysis (SODA) to detect both main and quadratic interaction effects in logistic regression and quadratic discriminant analysis (QDA) models. In the forward stage, SODA adds in important predictors evaluated based on their overall contributions, whereas in the backward stage SODA removes unimportant terms so as to opti… ▽ More We propose Stepwise cOnditional likelihood variable selection for Discriminant Analysis (SODA) to detect both main and quadratic interaction effects in logistic regression and quadratic discriminant analysis (QDA) models. In the forward stage, SODA adds in important predictors evaluated based on their overall contributions, whereas in the backward stage SODA removes unimportant terms so as to optimize the extended Bayesian Information Criterion (EBIC). Compared with existing methods on QDA variable selections, SODA can deal with high-dimensional data with the number of predictors much larger than the sample size and does not require the joint normality assumption on predictors, leading to much enhanced robustness. We further extend SODA to conduct variable selection and model fitting for multiple index models. Compared with existing variable selection methods based on the Sliced Inverse Regression (SIR) (Li 1991), SODA requires neither the linearity nor the constant variance condition and is much more robust. Our theoretical analyses establish the variable-selection consistency of SODA under high-dimensional settings, and our simulation studies as well as real-data applications demonstrate superior performances of SODA in dealing with non-Gaussian design matrices in both classification problems and multiple index models. △ Less

Submitted 29 May, 2017; v1 submitted 25 November, 2016; originally announced November 2016.

arXiv:1611.06655 [pdf, ps, other]

Sparse Sliced Inverse Regression Via Lasso

Authors: Qian Lin, Zhigen Zhao, Jun S. Liu

Abstract: For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) space if and only if $ρ=\lim\frac{p}{n}=0$, where $p$ is the dimension and $n$ is the sample size. Thus, when $p$ is of the same or a higher order of $n$, additional assumptions such as sparsity must be imposed in order to ensure consi… ▽ More For multiple index models, it has recently been shown that the sliced inverse regression (SIR) is consistent for estimating the sufficient dimension reduction (SDR) space if and only if $ρ=\lim\frac{p}{n}=0$, where $p$ is the dimension and $n$ is the sample size. Thus, when $p$ is of the same or a higher order of $n$, additional assumptions such as sparsity must be imposed in order to ensure consistency for SIR. By constructing artificial response variables made up from top eigenvectors of the estimated conditional covariance matrix, we introduce a simple Lasso regression method to obtain an estimate of the SDR space. The resulting algorithm, Lasso-SIR, is shown to be consistent and achieve the optimal convergence rate under certain sparsity conditions when $p$ is of order $o(n^2λ^2)$, where $λ$ is the generalized signal-to-noise ratio. We also demonstrate the superior performance of Lasso-SIR compared with existing approaches via extensive numerical studies and several real data examples. △ Less

Submitted 17 June, 2018; v1 submitted 21 November, 2016; originally announced November 2016.

Comments: 41 pages, 2 figures

MSC Class: 62J02 (Primary); 62H25 (Secondary)

arXiv:1607.06051 [pdf, other]

Bayesian Analysis of Rank Data with Covariates and Heterogeneous Rankers

Authors: Xinran Li, Dingdong Yi, Jun S. Liu

Abstract: Data in the form of ranking lists are frequently encountered, and combining ranking results from different sources can potentially generate a better ranking list and help understand behaviors of the rankers. Of interest here are the rank data under the following settings: (i) covariate information available for the ranked entities; (ii) rankers of varying qualities or having different opinions; an… ▽ More Data in the form of ranking lists are frequently encountered, and combining ranking results from different sources can potentially generate a better ranking list and help understand behaviors of the rankers. Of interest here are the rank data under the following settings: (i) covariate information available for the ranked entities; (ii) rankers of varying qualities or having different opinions; and (iii) incomplete ranking lists for non-overlap** subgroups. We review some key ideas built around the Thurstone model family by researchers in the past few decades and provide a unifying approach for Bayesian Analysis of Rank data with Covariates (BARC) and its extensions in handling heterogeneous rankers. With this Bayesian framework, we can study rankers' varying quality, cluster rankers' heterogeneous opinions, and measure the corresponding uncertainties. To enable an efficient Bayesian inference, we advocate a parameter-expanded Gibbs sampler to sample from the target posterior distribution. The posterior samples also result in a Bayesian aggregated ranking list, with credible intervals quantifying its uncertainty. We investigate and compare performances of the proposed methods and other rank aggregation methods in both simulation studies and two real-data examples. △ Less

Submitted 16 July, 2020; v1 submitted 20 July, 2016; originally announced July 2016.

arXiv:1604.02736 [pdf, other]

Generalized R-squared for Detecting Dependence

Authors: Xufei Wang, Bo Jiang, Jun S. Liu

Abstract: Detecting dependence between two random variables is a fundamental problem. Although the Pearson correlation is effective for capturing linear dependency, it can be entirely powerless for detecting nonlinear and/or heteroscedastic patterns. We introduce a new measure, G-squared, to test whether two univariate random variables are independent and to measure the strength of their relationship. The G… ▽ More Detecting dependence between two random variables is a fundamental problem. Although the Pearson correlation is effective for capturing linear dependency, it can be entirely powerless for detecting nonlinear and/or heteroscedastic patterns. We introduce a new measure, G-squared, to test whether two univariate random variables are independent and to measure the strength of their relationship. The G-squared is almost identical to the square of the Pearson correlation coefficient, R-squared, for linear relationships with constant error variance, and has the intuitive meaning of the piecewise R-squared between the variables. It is particularly effective in handling nonlinearity and heteroscedastic errors. We propose two estimators of G-squared and show their consistency. Simulations demonstrate that G-squared estimates are among the most powerful test statistics compared with several state-of-the-art methods. △ Less

Submitted 17 November, 2016; v1 submitted 10 April, 2016; originally announced April 2016.

arXiv:1511.08102 [pdf, other]

L1-Regularized Least Squares for Support Recovery of High Dimensional Single Index Models with Gaussian Designs

Authors: Matey Neykov, Jun S. Liu, Tianxi Cai

Abstract: It is known that for a certain class of single index models (SIMs) $Y = f(\boldsymbol{X}_{p \times 1}^\intercal\boldsymbolβ_0, \varepsilon)$, support recovery is impossible when $\boldsymbol{X} \sim \mathcal{N}(0, \mathbb{I}_{p \times p})$ and a model complexity adjusted sample size is below a critical threshold. Recently, optimal algorithms based on Sliced Inverse Regression (SIR) were suggested.… ▽ More It is known that for a certain class of single index models (SIMs) $Y = f(\boldsymbol{X}_{p \times 1}^\intercal\boldsymbolβ_0, \varepsilon)$, support recovery is impossible when $\boldsymbol{X} \sim \mathcal{N}(0, \mathbb{I}_{p \times p})$ and a model complexity adjusted sample size is below a critical threshold. Recently, optimal algorithms based on Sliced Inverse Regression (SIR) were suggested. These algorithms work provably under the assumption that the design $\boldsymbol{X}$ comes from an i.i.d. Gaussian distribution. In the present paper we analyze algorithms based on covariance screening and least squares with $L_1$ penalization (i.e. LASSO) and demonstrate that they can also enjoy optimal (up to a scalar) rescaled sample size in terms of support recovery, albeit under slightly different assumptions on $f$ and $\varepsilon$ compared to the SIR based algorithms. Furthermore, we show more generally, that LASSO succeeds in recovering the signed support of $\boldsymbolβ_0$ if $\boldsymbol{X} \sim \mathcal{N}(0, \boldsymbolΣ)$, and the covariance $\boldsymbolΣ$ satisfies the irrepresentable condition. Our work extends existing results on the support recovery of LASSO for the linear model, to a more general class of SIMs. △ Less

Submitted 22 June, 2016; v1 submitted 25 November, 2015; originally announced November 2015.

Comments: 36 pages; 6 figures; typos corrected; clearer notation introduced

arXiv:1511.02270 [pdf, other]

Signed Support Recovery for Single Index Models in High-Dimensions

Authors: Matey Neykov, Qian Lin, Jun S. Liu

Abstract: In this paper we study the support recovery problem for single index models $Y=f(\boldsymbol{X}^{\intercal} \boldsymbolβ,\varepsilon)$, where $f$ is an unknown link function, $\boldsymbol{X}\sim N_p(0,\mathbb{I}_{p})$ and $\boldsymbolβ$ is an $s$-sparse unit vector such that $\boldsymbolβ_{i}\in \{\pm\frac{1}{\sqrt{s}},0\}$. In particular, we look into the performance of two computationally inexpe… ▽ More In this paper we study the support recovery problem for single index models $Y=f(\boldsymbol{X}^{\intercal} \boldsymbolβ,\varepsilon)$, where $f$ is an unknown link function, $\boldsymbol{X}\sim N_p(0,\mathbb{I}_{p})$ and $\boldsymbolβ$ is an $s$-sparse unit vector such that $\boldsymbolβ_{i}\in \{\pm\frac{1}{\sqrt{s}},0\}$. In particular, we look into the performance of two computationally inexpensive algorithms: (a) the diagonal thresholding sliced inverse regression (DT-SIR) introduced by Lin et al. (2015); and (b) a semi-definite programming (SDP) approach inspired by Amini & Wainwright (2008). When $s=O(p^{1-δ})$ for some $δ>0$, we demonstrate that both procedures can succeed in recovering the support of $\boldsymbolβ$ as long as the rescaled sample size $κ=\frac{n}{s\log(p-s)}$ is larger than a certain critical threshold. On the other hand, when $κ$ is smaller than a critical value, any algorithm fails to recover the support with probability at least $\frac{1}{2}$ asymptotically. In other words, we demonstrate that both DT-SIR and the SDP approach are optimal (up to a scalar) for recovering the support of $\boldsymbolβ$ in terms of sample size. We provide extensive simulations, as well as a real dataset application to help verify our theoretical observations. △ Less

Submitted 22 June, 2016; v1 submitted 6 November, 2015; originally announced November 2015.

Comments: 38 pages, 7 figures; 1 table; data set analysis added; typos corrected

arXiv:1510.08986 [pdf, other]

A Unified Theory of Confidence Regions and Testing for High Dimensional Estimating Equations

Authors: Matey Neykov, Yang Ning, Jun S. Liu, Han Liu

Abstract: We propose a new inferential framework for constructing confidence regions and testing hypotheses in statistical models specified by a system of high dimensional estimating equations. We construct an influence function by projecting the fitted estimating equations to a sparse direction obtained by solving a large-scale linear program. Our main theoretical contribution is to establish a unified Z-e… ▽ More We propose a new inferential framework for constructing confidence regions and testing hypotheses in statistical models specified by a system of high dimensional estimating equations. We construct an influence function by projecting the fitted estimating equations to a sparse direction obtained by solving a large-scale linear program. Our main theoretical contribution is to establish a unified Z-estimation theory of confidence regions for high dimensional problems. Different from existing methods, all of which require the specification of the likelihood or pseudo-likelihood, our framework is likelihood-free. As a result, our approach provides valid inference for a broad class of high dimensional constrained estimating equation problems, which are not covered by existing methods. Such examples include, noisy compressed sensing, instrumental variable regression, undirected graphical models, discriminant analysis and vector autoregressive models. We present detailed theoretical results for all these examples. Finally, we conduct thorough numerical simulations, and a real dataset analysis to back up the developed theoretical results. △ Less

Submitted 22 June, 2016; v1 submitted 30 October, 2015; originally announced October 2015.

Comments: 67 pages, 2 tables, 1 figure

arXiv:1510.07158 [pdf, other]

Fast Parameter Estimation in Loss Tomography for Networks of General Topology

Authors: Ke Deng, Yang Li, Wei** Zhu, Jun S. Liu

Abstract: As a technique to investigate link-level loss rates of a computer network with low operational cost, loss tomography has received considerable attentions in recent years. A number of parameter estimation methods have been proposed for loss tomography of networks with a tree structure as well as a general topological structure. However, these methods suffer from either high computational cost or in… ▽ More As a technique to investigate link-level loss rates of a computer network with low operational cost, loss tomography has received considerable attentions in recent years. A number of parameter estimation methods have been proposed for loss tomography of networks with a tree structure as well as a general topological structure. However, these methods suffer from either high computational cost or insufficient use of information in the data. In this paper, we provide both theoretical results and practical algorithms for parameter estimation in loss tomography. By introducing a group of novel statistics and alternative parameter systems, we find that the likelihood function of the observed data from loss tomography keeps exactly the same mathematical formulation for tree and general topologies, revealing that networks with different topologies share the same mathematical nature for loss tomography. More importantly, we discover that a re-parametrization of the likelihood function belongs to the standard exponential family, which is convex and has a unique mode under regularity conditions. Based on these theoretical results, novel algorithms to find the MLE are developed. Compared to existing methods in the literature, the proposed methods enjoy great computational advantages. △ Less

Submitted 24 October, 2015; originally announced October 2015.

Comments: To appear in Annals of Applied Statistics

arXiv:1506.08852 [pdf, other]

Locally weighted Markov chain Monte Carlo

Authors: Espen Bernton, Shihao Yang, Yang Chen, Neil Shephard, Jun S. Liu

Abstract: We propose a weighting scheme for the proposals within Markov chain Monte Carlo algorithms and show how this can improve statistical efficiency at no extra computational cost. These methods are most powerful when combined with multi-proposal MCMC algorithms such as multiple-try Metropolis, which can efficiently exploit modern computer architectures with large numbers of cores. The locally weighted… ▽ More We propose a weighting scheme for the proposals within Markov chain Monte Carlo algorithms and show how this can improve statistical efficiency at no extra computational cost. These methods are most powerful when combined with multi-proposal MCMC algorithms such as multiple-try Metropolis, which can efficiently exploit modern computer architectures with large numbers of cores. The locally weighted Markov chain Monte Carlo method also improves upon a partial parallelization of the Metropolis-Hastings algorithm via Rao-Blackwellization. We derive the effective sample size of the output of our algorithm and show how to estimate this in practice. Illustrations and examples of the method are given and the algorithm is compared in theory and applications with existing methods. △ Less

Submitted 29 June, 2015; originally announced June 2015.

arXiv:1506.02371 [pdf, other]

Interpretable Selection and Visualization of Features and Interactions Using Bayesian Forests

Authors: Viktoriya Krakovna, Jiong Du, Jun S. Liu

Abstract: It is becoming increasingly important for machine learning methods to make predictions that are interpretable as well as accurate. In many practical applications, it is of interest which features and feature interactions are relevant to the prediction task. We present a novel method, Selective Bayesian Forest Classifier, that strikes a balance between predictive power and interpretability by simul… ▽ More It is becoming increasingly important for machine learning methods to make predictions that are interpretable as well as accurate. In many practical applications, it is of interest which features and feature interactions are relevant to the prediction task. We present a novel method, Selective Bayesian Forest Classifier, that strikes a balance between predictive power and interpretability by simultaneously performing classification, feature selection, feature interaction detection and visualization. It builds parsimonious yet flexible models using tree-structured Bayesian networks, and samples an ensemble of such models using Markov chain Monte Carlo. We build in feature selection by dividing the trees into two groups according to their relevance to the outcome of interest. Our method performs competitively on classification and feature selection benchmarks in low and high dimensions, and includes a visualization tool that provides insight into relevant features and interactions. △ Less

Submitted 7 February, 2016; v1 submitted 8 June, 2015; originally announced June 2015.

Comments: R package: github.com/vkrakovna/sbfc

arXiv:1411.3070 [pdf, other]

Bayesian nonparametric tests via sliced inverse modeling

Authors: Bo Jiang, Chao Ye, Jun S. Liu

Abstract: We study the problem of independence and conditional independence tests between categorical covariates and a continuous response variable, which has an immediate application in genetics. Instead of estimating the conditional distribution of the response given values of covariates, we model the conditional distribution of covariates given the discretized response (aka "slices"). By assigning a prio… ▽ More We study the problem of independence and conditional independence tests between categorical covariates and a continuous response variable, which has an immediate application in genetics. Instead of estimating the conditional distribution of the response given values of covariates, we model the conditional distribution of covariates given the discretized response (aka "slices"). By assigning a prior probability to each possible discretization scheme, we can compute efficiently a Bayes factor (BF)-statistic for the independence (or conditional independence) test using a dynamic programming algorithm. Asymptotic and finite-sample properties such as power and null distribution of the BF statistic are studied, and a stepwise variable selection method based on the BF statistic is further developed. We compare the BF statistic with some existing classical methods and demonstrate its statistical power through extensive simulation studies. We apply the proposed method to a mouse genetics data set aiming to detect quantitative trait loci (QTLs) and obtain promising results. △ Less

Submitted 1 May, 2015; v1 submitted 12 November, 2014; originally announced November 2014.

Comments: 32 pages, 7 figures

arXiv:1304.4056 [pdf, ps, other]

doi 10.1214/14-AOS1233

Variable selection for general index models via sliced inverse regression

Authors: Bo Jiang, Jun S. Liu

Abstract: Variable selection, also known as feature selection in machine learning, plays an important role in modeling high dimensional data and is key to data-driven scientific discoveries. We consider here the problem of detecting influential variables under the general index model, in which the response is dependent of predictors through an unknown function of one or more linear combinations of them. Ins… ▽ More Variable selection, also known as feature selection in machine learning, plays an important role in modeling high dimensional data and is key to data-driven scientific discoveries. We consider here the problem of detecting influential variables under the general index model, in which the response is dependent of predictors through an unknown function of one or more linear combinations of them. Instead of building a predictive model of the response given combinations of predictors, we model the conditional distribution of predictors given the response. This inverse modeling perspective motivates us to propose a stepwise procedure based on likelihood-ratio tests, which is effective and computationally efficient in identifying important variables without specifying a parametric relationship between predictors and the response. For example, the proposed procedure is able to detect variables with pairwise, three-way or even higher-order interactions among $p$ predictors with a computational time of $O(p)$ instead of $O(p^k)$ (with $k$ being the highest order of interactions). Its excellent empirical performance in comparison with existing methods is demonstrated through simulation studies as well as real data examples. Consistency of the variable selection procedure when both the number of predictors and the sample size go to infinity is established. △ Less

Submitted 23 September, 2014; v1 submitted 15 April, 2013; originally announced April 2013.

Comments: Published in at http://dx.doi.org/10.1214/14-AOS1233 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOS-AOS1233

Journal ref: Annals of Statistics 2014, Vol. 42, No. 5, 1751-1786

arXiv:1302.5206 [pdf, ps, other]

doi 10.1214/12-STS401

Lookahead Strategies for Sequential Monte Carlo

Authors: Ming Lin, Rong Chen, Jun S. Liu

Abstract: Based on the principles of importance sampling and resampling, sequential Monte Carlo (SMC) encompasses a large set of powerful techniques dealing with complex stochastic dynamic systems. Many of these systems possess strong memory, with which future information can help sharpen the inference about the current state. By providing theoretical justification of several existing algorithms and introdu… ▽ More Based on the principles of importance sampling and resampling, sequential Monte Carlo (SMC) encompasses a large set of powerful techniques dealing with complex stochastic dynamic systems. Many of these systems possess strong memory, with which future information can help sharpen the inference about the current state. By providing theoretical justification of several existing algorithms and introducing several new ones, we study systematically how to construct efficient SMC algorithms to take advantage of the "future" information without creating a substantially high computational burden. The main idea is to allow for lookahead in the Monte Carlo process so that future information can be utilized in weighting and generating Monte Carlo samples, or resampling from samples of the current state. △ Less

Submitted 21 February, 2013; originally announced February 2013.

Comments: Published in at http://dx.doi.org/10.1214/12-STS401 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS401

Journal ref: Statistical Science 2013, Vol. 28, No. 1, 69-94

arXiv:1111.5972 [pdf, ps, other]

doi 10.1214/11-AOAS469

Block-based Bayesian epistasis association map** with application to WTCCC type 1 diabetes data

Authors: Yu Zhang, **g Zhang, Jun S. Liu

Abstract: Interactions among multiple genes across the genome may contribute to the risks of many complex human diseases. Whole-genome single nucleotide polymorphisms (SNPs) data collected for many thousands of SNP markers from thousands of individuals under the case--control design promise to shed light on our understanding of such interactions. However, nearby SNPs are highly correlated due to linkage dis… ▽ More Interactions among multiple genes across the genome may contribute to the risks of many complex human diseases. Whole-genome single nucleotide polymorphisms (SNPs) data collected for many thousands of SNP markers from thousands of individuals under the case--control design promise to shed light on our understanding of such interactions. However, nearby SNPs are highly correlated due to linkage disequilibrium (LD) and the number of possible interactions is too large for exhaustive evaluation. We propose a novel Bayesian method for simultaneously partitioning SNPs into LD-blocks and selecting SNPs within blocks that are associated with the disease, either individually or interactively with other SNPs. When applied to homogeneous population data, the method gives posterior probabilities for LD-block boundaries, which not only result in accurate block partitions of SNPs, but also provide measures of partition uncertainty. When applied to case--control data for association map**, the method implicitly filters out SNP associations created merely by LD with disease loci within the same blocks. Simulation study showed that this approach is more powerful in detecting multi-locus associations than other methods we tested, including one of ours. When applied to the WTCCC type 1 diabetes data, the method identified many previously known T1D associated genes, including PTPN22, CTLA4, MHC, and IL2RA. △ Less

Submitted 25 November, 2011; originally announced November 2011.

Comments: Published in at http://dx.doi.org/10.1214/11-AOAS469 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS469

Journal ref: Annals of Applied Statistics 2011, Vol. 5, No. 3, 2052-2077

arXiv:1104.2180 [pdf, ps, other]

doi 10.1214/09-STS312

The EM Algorithm and the Rise of Computational Biology

Authors: Xiaodan Fan, Yuan Yuan, Jun S. Liu

Abstract: In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology prob… ▽ More In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis. △ Less

Submitted 12 April, 2011; originally announced April 2011.

Comments: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-STS-STS312

Journal ref: Statistical Science 2010, Vol. 25, No. 4, 476-491

arXiv:1011.2104 [pdf, ps, other]

doi 10.1214/09-AOAS300

Bayesian meta-analysis for identifying periodically expressed genes in fission yeast cell cycle

Authors: Xiaodan Fan, Saumyadipta Pyne, Jun S. Liu

Abstract: The effort to identify genes with periodic expression during the cell cycle from genome-wide microarray time series data has been ongoing for a decade. However, the lack of rigorous modeling of periodic expression as well as the lack of a comprehensive model for integrating information across genes and experiments has impaired the effort for the accurate identification of periodically expressed ge… ▽ More The effort to identify genes with periodic expression during the cell cycle from genome-wide microarray time series data has been ongoing for a decade. However, the lack of rigorous modeling of periodic expression as well as the lack of a comprehensive model for integrating information across genes and experiments has impaired the effort for the accurate identification of periodically expressed genes. To address the problem, we introduce a Bayesian model to integrate multiple independent microarray data sets from three recent genome-wide cell cycle studies on fission yeast. A hierarchical model was used for data integration. In order to facilitate an efficient Monte Carlo sampling from the joint posterior distribution, we develop a novel Metropolis--Hastings group move. A surprising finding from our integrated analysis is that more than 40% of the genes in fission yeast are significantly periodically expressed, greatly enhancing the reported 10--15% of the genes in the current literature. It calls for a reconsideration of the periodically expressed gene detection problem. △ Less

Submitted 9 November, 2010; originally announced November 2010.

Comments: Published in at http://dx.doi.org/10.1214/09-AOAS300 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS300

Journal ref: Annals of Applied Statistics 2010, Vol. 4, No. 2, 988-1013

arXiv:1005.5483 [pdf, ps, other]

doi 10.1111/rssb.12023

Model Selection Principles in Misspecified Models

Authors: **chi Lv, Jun S. Liu

Abstract: Model selection is of fundamental importance to high dimensional modeling featured in many contemporary applications. Classical principles of model selection include the Kullback-Leibler divergence principle and the Bayesian principle, which lead to the Akaike information criterion and Bayesian information criterion when models are correctly specified. Yet model misspecification is unavoidable whe… ▽ More Model selection is of fundamental importance to high dimensional modeling featured in many contemporary applications. Classical principles of model selection include the Kullback-Leibler divergence principle and the Bayesian principle, which lead to the Akaike information criterion and Bayesian information criterion when models are correctly specified. Yet model misspecification is unavoidable when we have no knowledge of the true model or when we have the correct family of distributions but miss some true predictor. In this paper, we propose a family of semi-Bayesian principles for model selection in misspecified models, which combine the strengths of the two well-known principles. We derive asymptotic expansions of the semi-Bayesian principles in misspecified generalized linear models, which give the new semi-Bayesian information criteria (SIC). A specific form of SIC admits a natural decomposition into the negative maximum quasi-log-likelihood, a penalty on model dimensionality, and a penalty on model misspecification directly. Numerical studies demonstrate the advantage of the newly proposed SIC methodology for model selection in both correctly specified and misspecified models. △ Less

Submitted 11 May, 2016; v1 submitted 29 May, 2010; originally announced May 2010.

Comments: 25 pages, 6 tables

MSC Class: 62J12(Primary); 62B10; 62F07; 62F15; 62J07(Secondary)

Journal ref: Journal of the Royal Statistical Society Series B 76, 141-167 (2014)

arXiv:0910.2090 [pdf, ps, other]

doi 10.1214/09-AOAS248

Doubly stochastic continuous-time hidden Markov approach for analyzing genome tiling arrays

Authors: W. Evan Johnson, X. Shirley Liu, Jun S. Liu

Abstract: Microarrays have been developed that tile the entire nonrepetitive genomes of many different organisms, allowing for the unbiased map** of active transcription regions or protein binding sites across the entire genome. These tiling array experiments produce massive correlated data sets that have many experimental artifacts, presenting many challenges to researchers that require innovative anal… ▽ More Microarrays have been developed that tile the entire nonrepetitive genomes of many different organisms, allowing for the unbiased map** of active transcription regions or protein binding sites across the entire genome. These tiling array experiments produce massive correlated data sets that have many experimental artifacts, presenting many challenges to researchers that require innovative analysis methods and efficient computational algorithms. This paper presents a doubly stochastic latent variable analysis method for transcript discovery and protein binding region localization using tiling array data. This model is unique in that it considers actual genomic distance between probes. Additionally, the model is designed to be robust to cross-hybridized and nonresponsive probes, which can often lead to false-positive results in microarray experiments. We apply our model to a transcript finding data set to illustrate the consistency of our method. Additionally, we apply our method to a spike-in experiment that can be used as a benchmark data set for researchers interested in develo** and comparing future tiling array methods. The results indicate that our method is very powerful, accurate and can be used on a single sample and without control experiments, thus defraying some of the overhead cost of conducting experiments on tiling arrays. △ Less

Submitted 12 October, 2009; originally announced October 2009.

Comments: Published in at http://dx.doi.org/10.1214/09-AOAS248 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Report number: IMS-AOAS-AOAS248

Journal ref: Annals of Applied Statistics 2009, Vol. 3, No. 3, 1183-1203

arXiv:math/0610655 [pdf, ps, other]

Bayesian Clustering of Transcription Factor Binding Motifs

Authors: Shane T. Jensen, Jun S. Liu

Abstract: Genes are often regulated in living cells by proteins called transcription factors (TFs) that bind directly to short segments of DNA in close proximity to specific genes. These binding sites have a conserved nucleotide appearance, which is called a motif. Several recent studies of transcriptional regulation require the reduction of a large collection of motifs into clusters based on the similari… ▽ More Genes are often regulated in living cells by proteins called transcription factors (TFs) that bind directly to short segments of DNA in close proximity to specific genes. These binding sites have a conserved nucleotide appearance, which is called a motif. Several recent studies of transcriptional regulation require the reduction of a large collection of motifs into clusters based on the similarity of their nucleotide composition. We present a principled approach to this clustering problem based upon a Bayesian hierarchical model that accounts for both within- and between-motif variability. We use a Dirichlet process prior distribution that allows the number of clusters to vary and we also present a novel generalization that allows the core width of each motif to vary. This clustering model is implemented, using a Gibbs sampling strategy, on several collections of transcription factor motif matrices. Our clusters provide a means by which to organize transcription factors based on binding motif similarities, which can be used to reduce motif redundancy within large databases such as JASPAR and TRANSFAC. Finally, our clustering procedure has been used in combination with discovery of evolutionarily-conserved motifs to predict co-regulated genes. An alternative to our Dirichlet process prior distribution is explored but shows no substantive difference in the clustering results for our datasets. Our Bayesian clustering model based on the Dirichlet process has several advantages over traditional clustering methods that could make our procedure appropriate and useful for many clustering applications. △ Less

Submitted 21 October, 2006; originally announced October 2006.

Comments: Submitted to the Journal of the American Statistical Association

Showing 1–46 of 46 results for author: Liu, J S