Search | arXiv e-print repository

Causal Customer Churn Analysis with Low-rank Tensor Block Hazard Model

Authors: Chenyin Gao, Zhiming Zhang, Shu Yang

Abstract: This study introduces an innovative method for analyzing the impact of various interventions on customer churn, using the potential outcomes framework. We present a new causal model, the tensorized latent factor block hazard model, which incorporates tensor completion methods for a principled causal analysis of customer churn. A crucial element of our approach is the formulation of a 1-bit tensor… ▽ More This study introduces an innovative method for analyzing the impact of various interventions on customer churn, using the potential outcomes framework. We present a new causal model, the tensorized latent factor block hazard model, which incorporates tensor completion methods for a principled causal analysis of customer churn. A crucial element of our approach is the formulation of a 1-bit tensor completion for the parameter tensor. This captures hidden customer characteristics and temporal elements from churn records, effectively addressing the binary nature of churn data and its time-monotonic trends. Our model also uniquely categorizes interventions by their similar impacts, enhancing the precision and practicality of implementing customer retention strategies. For computational efficiency, we apply a projected gradient descent algorithm combined with spectral clustering. We lay down the theoretical groundwork for our model, including its non-asymptotic properties. The efficacy and superiority of our model are further validated through comprehensive experiments on both simulated and real-world applications. △ Less

Submitted 18 May, 2024; originally announced May 2024.

Comments: Accepted for publication in ICML, 2024

arXiv:2402.01143 [pdf, other]

Learning Network Representations with Disentangled Graph Auto-Encoder

Authors: Di Fan, Chuanhou Gao

Abstract: The (variational) graph auto-encoder is extensively employed for learning representations of graph-structured data. However, the formation of real-world graphs is a complex and heterogeneous process influenced by latent factors. Existing encoders are fundamentally holistic, neglecting the entanglement of latent factors. This not only makes graph analysis tasks less effective but also makes it hard… ▽ More The (variational) graph auto-encoder is extensively employed for learning representations of graph-structured data. However, the formation of real-world graphs is a complex and heterogeneous process influenced by latent factors. Existing encoders are fundamentally holistic, neglecting the entanglement of latent factors. This not only makes graph analysis tasks less effective but also makes it harder to understand and explain the representations. Learning disentangled graph representations with (variational) graph auto-encoder poses significant challenges, and remains largely unexplored in the existing literature. In this article, we introduce the Disentangled Graph Auto-Encoder (DGA) and Disentangled Variational Graph Auto-Encoder (DVGA), approaches that leverage generative models to learn disentangled representations. Specifically, we first design a disentangled graph convolutional network with multi-channel message-passing layers, as the encoder aggregating information related to each disentangled latent factor. Subsequently, a component-wise flow is applied to each channel to enhance the expressive capabilities of disentangled variational graph auto-encoder. Additionally, we design a factor-wise decoder, considering the characteristics of disentangled representations. In order to further enhance the independence among representations, we introduce independence constraints on map** channels for different latent factors. Empirical experiments on both synthetic and real-world datasets show the superiority of our proposed method compared to several state-of-the-art baselines. △ Less

Submitted 1 February, 2024; originally announced February 2024.

Comments: 61 pages, 13 figures

arXiv:2401.06350 [pdf, ps, other]

Optimal estimation of the null distribution in large-scale inference

Authors: Subhodh Kotekal, Chao Gao

Abstract: The advent of large-scale inference has spurred reexamination of conventional statistical thinking. In a Gaussian model for $n$ many $z$-scores with at most $k < \frac{n}{2}$ nonnulls, Efron suggests estimating the location and scale parameters of the null distribution. Placing no assumptions on the nonnull effects, the statistical task can be viewed as a robust estimation problem. However, the be… ▽ More The advent of large-scale inference has spurred reexamination of conventional statistical thinking. In a Gaussian model for $n$ many $z$-scores with at most $k < \frac{n}{2}$ nonnulls, Efron suggests estimating the location and scale parameters of the null distribution. Placing no assumptions on the nonnull effects, the statistical task can be viewed as a robust estimation problem. However, the best known robust estimators fail to be consistent in the regime $k \asymp n$ which is especially relevant in large-scale inference. The failure of estimators which are minimax rate-optimal with respect to other formulations of robustness (e.g. Huber's contamination model) might suggest the impossibility of consistent estimation in this regime and, consequently, a major weakness of Efron's suggestion. A sound evaluation of Efron's model thus requires a complete understanding of consistency. We sharply characterize the regime of $k$ for which consistent estimation is possible and further establish the minimax estimation rates. It is shown consistent estimation of the location parameter is possible if and only if $\frac{n}{2} - k = ω(\sqrt{n})$, and consistent estimation of the scale parameter is possible in the entire regime $k < \frac{n}{2}$. Faster rates than those in Huber's contamination model are achievable by exploiting the Gaussian character of the data. The minimax upper bound is obtained by considering estimators based on the empirical characteristic function. The minimax lower bound involves constructing two marginal distributions whose characteristic functions match on a wide interval containing zero. The construction notably differs from those in the literature by sharply capturing a scaling of $n-2k$ in the minimax estimation rate of the location. △ Less

Submitted 11 January, 2024; originally announced January 2024.

arXiv:2312.09356 [pdf, other]

Sparsity meets correlation in Gaussian sequence model

Authors: Subhodh Kotekal, Chao Gao

Abstract: We study estimation of an $s$-sparse signal in the $p$-dimensional Gaussian sequence model with equicorrelated observations and derive the minimax rate. A new phenomenon emerges from correlation, namely the rate scales with respect to $p-2s$ and exhibits a phase transition at $p-2s \asymp \sqrt{p}$. Correlation is shown to be a blessing provided it is sufficiently strong, and the critical correlat… ▽ More We study estimation of an $s$-sparse signal in the $p$-dimensional Gaussian sequence model with equicorrelated observations and derive the minimax rate. A new phenomenon emerges from correlation, namely the rate scales with respect to $p-2s$ and exhibits a phase transition at $p-2s \asymp \sqrt{p}$. Correlation is shown to be a blessing provided it is sufficiently strong, and the critical correlation level exhibits a delicate dependence on the sparsity level. Due to correlation, the minimax rate is driven by two subproblems: estimation of a linear functional (the average of the signal) and estimation of the signal's $(p-1)$-dimensional projection onto the orthogonal subspace. The high-dimensional projection is estimated via sparse regression and the linear functional is cast as a robust location estimation problem. Existing robust estimators turn out to be suboptimal, and we show a kernel mode estimator with a widening bandwidth exploits the Gaussian character of the data to achieve the optimal estimation rate. △ Less

Submitted 14 December, 2023; originally announced December 2023.

arXiv:2310.04606 [pdf, ps, other]

Robust Transfer Learning with Unreliable Source Data

Authors: Jianqing Fan, Cheng Gao, Jason M. Klusowski

Abstract: This paper addresses challenges in robust transfer learning stemming from ambiguity in Bayes classifiers and weak transferable signals between the target and source distribution. We introduce a novel quantity called the ''ambiguity level'' that measures the discrepancy between the target and source regression functions, propose a simple transfer learning procedure, and establish a general theorem… ▽ More This paper addresses challenges in robust transfer learning stemming from ambiguity in Bayes classifiers and weak transferable signals between the target and source distribution. We introduce a novel quantity called the ''ambiguity level'' that measures the discrepancy between the target and source regression functions, propose a simple transfer learning procedure, and establish a general theorem that shows how this new quantity is related to the transferability of learning in terms of risk improvements. Our proposed ''Transfer Around Boundary'' (TAB) model, with a threshold balancing the performance of target and source data, is shown to be both efficient and robust, improving classification while avoiding negative transfer. Moreover, we demonstrate the effectiveness of the TAB model on non-parametric classification and logistic regression tasks, achieving upper bounds which are optimal up to logarithmic factors. Simulation studies lend further support to the effectiveness of TAB. We also provide simple approaches to bound the excess misclassification error without the need for specialized knowledge in transfer learning. △ Less

Submitted 6 October, 2023; originally announced October 2023.

Comments: 86 pages, 4 figures

arXiv:2309.07273 [pdf]

Real Effect or Bias? Best Practices for Evaluating the Robustness of Real-World Evidence through Quantitative Sensitivity Analysis for Unmeasured Confounding

Authors: Douglas Faries, Chenyin Gao, Xiang Zhang, Chad Hazlett, James Stamey, Shu Yang, Peng Ding, Mingyang Shan, Kristin Sheffield, Nancy Dreyer

Abstract: The assumption of no unmeasured confounders is a critical but unverifiable assumption required for causal inference yet quantitative sensitivity analyses to assess robustness of real-world evidence remains underutilized. The lack of use is likely in part due to complexity of implementation and often specific and restrictive data requirements required for application of each method. With the advent… ▽ More The assumption of no unmeasured confounders is a critical but unverifiable assumption required for causal inference yet quantitative sensitivity analyses to assess robustness of real-world evidence remains underutilized. The lack of use is likely in part due to complexity of implementation and often specific and restrictive data requirements required for application of each method. With the advent of sensitivity analyses methods that are broadly applicable in that they do not require identification of a specific unmeasured confounder, along with publicly available code for implementation, roadblocks toward broader use are decreasing. To spur greater application, here we present a best practice guidance to address the potential for unmeasured confounding at both the design and analysis stages, including a set of framing questions and an analytic toolbox for researchers. The questions at the design stage guide the research through steps evaluating the potential robustness of the design while encouraging gathering of additional data to reduce uncertainty due to potential confounding. At the analysis stage, the questions guide researchers to quantifying the robustness of the observed result and providing researchers with a clearer indication of the robustness of their conclusions. We demonstrate the application of the guidance using simulated data based on a real-world fibromyalgia study, applying multiple methods from our analytic toolbox for illustration purposes. △ Less

Submitted 13 September, 2023; originally announced September 2023.

Comments: 16 pages which includes 5 figures

MSC Class: Primary 62

arXiv:2308.15728 [pdf, ps, other]

Computational Lower Bounds for Graphon Estimation via Low-degree Polynomials

Authors: Yuetian Luo, Chao Gao

Abstract: Graphon estimation has been one of the most fundamental problems in network analysis and has received considerable attention in the past decade. From the statistical perspective, the minimax error rate of graphon estimation has been established by Gao et al (2015) for both stochastic block model and nonparametric graphon estimation. The statistical optimal estimators are based on constrained least… ▽ More Graphon estimation has been one of the most fundamental problems in network analysis and has received considerable attention in the past decade. From the statistical perspective, the minimax error rate of graphon estimation has been established by Gao et al (2015) for both stochastic block model and nonparametric graphon estimation. The statistical optimal estimators are based on constrained least squares and have computational complexity exponential in the dimension. From the computational perspective, the best-known polynomial-time estimator is based universal singular value thresholding, but it can only achieve a much slower estimation error rate than the minimax one. The computational optimality of the USVT or the existence of a computational barrier in graphon estimation has been a long-standing open problem. In this work, we provide rigorous evidence for the computational barrier in graphon estimation via low-degree polynomials. Specifically, in SBM graphon estimation, we show that for low-degree polynomial estimators, their estimation error rates cannot be significantly better than that of the USVT under a wide range of parameter regimes and in nonparametric graphon estimation, we show low-degree polynomial estimators achieve estimation error rates strictly slower than the minimax rate. Our results are proved based on the recent development of low-degree polynomials by Schramm and Wein (2022), while we overcome a few key challenges in applying it to the general graphon estimation problem. By leveraging our main results, we also provide a computational lower bound on the clustering error for community detection in SBM with a growing number of communities and this yields a new piece of evidence for the conjectured Kesten-Stigum threshold for efficient community recovery. Finally, we extend our computational lower bounds to sparse graphon estimation and biclustering. △ Less

Submitted 20 May, 2024; v1 submitted 29 August, 2023; originally announced August 2023.

Comments: Add low-degree upper bound in v2

arXiv:2307.00227 [pdf, other]

Causal Structure Learning by Using Intersection of Markov Blankets

Authors: Yiran Dong, Chuanhou Gao

Abstract: In this paper, we introduce a novel causal structure learning algorithm called Endogenous and Exogenous Markov Blankets Intersection (EEMBI), which combines the properties of Bayesian networks and Structural Causal Models (SCM). Furthermore, we propose an extended version of EEMBI, namely EEMBI-PC, which integrates the last step of the PC algorithm into EEMBI. In this paper, we introduce a novel causal structure learning algorithm called Endogenous and Exogenous Markov Blankets Intersection (EEMBI), which combines the properties of Bayesian networks and Structural Causal Models (SCM). Furthermore, we propose an extended version of EEMBI, namely EEMBI-PC, which integrates the last step of the PC algorithm into EEMBI. △ Less

Submitted 1 July, 2023; originally announced July 2023.

Comments: 41 pages, 13 figures

arXiv:2306.16642 [pdf, other]

Integrating Randomized Placebo-Controlled Trial Data with External Controls: A Semiparametric Approach with Selective Borrowing

Authors: Chenyin Gao, Shu Yang, Mingyang Shan, Wenyu Ye, Ilya Lipkovich, Douglas Faries

Abstract: In recent years, real-world external controls (ECs) have grown in popularity as a tool to empower randomized placebo-controlled trials (RPCTs), particularly in rare diseases or cases where balanced randomization is unethical or impractical. However, as ECs are not always comparable to the RPCTs, direct borrowing ECs without scrutiny may heavily bias the treatment effect estimator. Our paper propos… ▽ More In recent years, real-world external controls (ECs) have grown in popularity as a tool to empower randomized placebo-controlled trials (RPCTs), particularly in rare diseases or cases where balanced randomization is unethical or impractical. However, as ECs are not always comparable to the RPCTs, direct borrowing ECs without scrutiny may heavily bias the treatment effect estimator. Our paper proposes a data-adaptive integrative framework capable of preventing unknown biases of ECs. The adaptive nature is achieved by dynamically sorting out a set of comparable ECs via bias penalization. Our proposed method can simultaneously achieve (a) the semiparametric efficiency bound when the ECs are comparable and (b) selective borrowing that mitigates the impact of the existence of incomparable ECs. Furthermore, we establish statistical guarantees, including consistency, asymptotic distribution, and inference, providing type-I error control and good power. Extensive simulations and two real-data applications show that the proposed method leads to improved performance over the RPCT-only estimator across various bias-generating scenarios. △ Less

Submitted 28 June, 2023; originally announced June 2023.

arXiv:2305.17801 [pdf, other]

Pretest estimation in combining probability and non-probability samples

Authors: Chenyin Gao, Shu Yang

Abstract: Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool approach to general parameter estimation by combining gold-standard probability and non-probability samples. We focus on the case when the study variable is observed in bo… ▽ More Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool approach to general parameter estimation by combining gold-standard probability and non-probability samples. We focus on the case when the study variable is observed in both datasets for estimating the target parameters, and each contains other auxiliary variables. Utilizing the probability design, we conduct a pretest procedure to determine the comparability of the non-probability data with the probability data and decide whether or not to leverage the non-probability data in a pooled analysis. When the probability and non-probability data are comparable, our approach combines both data for efficient estimation. Otherwise, we retain only the probability data for estimation. We also characterize the asymptotic distribution of the proposed test-and-pool estimator under a local alternative and provide a data-adaptive procedure to select the critical tuning parameters that target the smallest mean square error of the test-and-pool estimator. Lastly, to deal with the non-regularity of the test-and-pool estimator, we construct a robust confidence interval that has a good finite-sample coverage property. △ Less

Submitted 28 May, 2023; originally announced May 2023.

Comments: Accepted in Electronic Journal of Statistics

arXiv:2304.09398 [pdf, ps, other]

Minimax Signal Detection in Sparse Additive Models

Authors: Subhodh Kotekal, Chao Gao

Abstract: Sparse additive models are an attractive choice in circumstances calling for modelling flexibility in the face of high dimensionality. We study the signal detection problem and establish the minimax separation rate for the detection of a sparse additive signal. Our result is nonasymptotic and applicable to the general case where the univariate component functions belong to a generic reproducing ke… ▽ More Sparse additive models are an attractive choice in circumstances calling for modelling flexibility in the face of high dimensionality. We study the signal detection problem and establish the minimax separation rate for the detection of a sparse additive signal. Our result is nonasymptotic and applicable to the general case where the univariate component functions belong to a generic reproducing kernel Hilbert space. Unlike the estimation theory, the minimax separation rate reveals a nontrivial interaction between sparsity and the choice of function space. We also investigate adaptation to sparsity and establish an adaptive testing rate for a generic function space; adaptation is possible in some spaces while others impose an unavoidable cost. Finally, adaptation to both sparsity and smoothness is studied in the setting of Sobolev space, and we correct some existing claims in the literature. △ Less

Submitted 18 April, 2023; originally announced April 2023.

Comments: 62 pages

arXiv:2304.09010 [pdf, other]

Causal Flow-based Variational Auto-Encoder for Disentangled Causal Representation Learning

Authors: Di Fan, Yannian Kou, Chuanhou Gao

Abstract: Disentangled representation learning aims to learn low-dimensional representations of data, where each dimension corresponds to an underlying generative factor. Currently, Variational Auto-Encoder (VAE) are widely used for disentangled representation learning, with the majority of methods assuming independence among generative factors. However, in real-world scenarios, generative factors typically… ▽ More Disentangled representation learning aims to learn low-dimensional representations of data, where each dimension corresponds to an underlying generative factor. Currently, Variational Auto-Encoder (VAE) are widely used for disentangled representation learning, with the majority of methods assuming independence among generative factors. However, in real-world scenarios, generative factors typically exhibit complex causal relationships. We thus design a new VAE-based framework named Disentangled Causal Variational Auto-Encoder (DCVAE), which includes a variant of autoregressive flows known as causal flows, capable of learning effective causal disentangled representations. We provide a theoretical analysis of the disentanglement identifiability of DCVAE, ensuring that our model can effectively learn causal disentangled representations. The performance of DCVAE is evaluated on both synthetic and real-world datasets, demonstrating its outstanding capability in achieving causal disentanglement and performing intervention experiments. Moreover, DCVAE exhibits remarkable performance on downstream tasks and has the potential to learn the true causal structure among factors. △ Less

Submitted 8 May, 2024; v1 submitted 18 April, 2023; originally announced April 2023.

Comments: 20 pages, 14 figures

arXiv:2302.04972 [pdf, ps, other]

Differentially Private Optimization for Smooth Nonconvex ERM

Authors: Changyu Gao, Stephen J. Wright

Abstract: We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order solution for nonconvex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches. We develop simple differentially private optimization algorithms that move along directions of (expected) descent to find an approximate second-order solution for nonconvex ERM. We use line search, mini-batching, and a two-phase strategy to improve the speed and practicality of the algorithm. Numerical experiments demonstrate the effectiveness of these approaches. △ Less

Submitted 9 June, 2023; v1 submitted 9 February, 2023; originally announced February 2023.

arXiv:2209.12715 [pdf, other]

Self-supervised Denoising via Low-rank Tensor Approximated Convolutional Neural Network

Authors: Chenyin Gao, Shu Yang, Anru R. Zhang

Abstract: Noise is ubiquitous during image acquisition. Sufficient denoising is often an important first step for image processing. In recent decades, deep neural networks (DNNs) have been widely used for image denoising. Most DNN-based image denoising methods require a large-scale dataset or focus on supervised settings, in which single/pairs of clean images or a set of noisy images are required. This pose… ▽ More Noise is ubiquitous during image acquisition. Sufficient denoising is often an important first step for image processing. In recent decades, deep neural networks (DNNs) have been widely used for image denoising. Most DNN-based image denoising methods require a large-scale dataset or focus on supervised settings, in which single/pairs of clean images or a set of noisy images are required. This poses a significant burden on the image acquisition process. Moreover, denoisers trained on datasets of limited scale may incur over-fitting. To mitigate these issues, we introduce a new self-supervised framework for image denoising based on the Tucker low-rank tensor approximation. With the proposed design, we are able to characterize our denoiser with fewer parameters and train it based on a single image, which considerably improves the model generalizability and reduces the cost of data acquisition. Extensive experiments on both synthetic and real-world noisy images have been conducted. Empirical results show that our proposed method outperforms existing non-learning-based methods (e.g., low-pass filter, non-local mean), single-image unsupervised denoisers (e.g., DIP, NN+BM3D) evaluated on both in-sample and out-sample datasets. The proposed method even achieves comparable performances with some supervised methods (e.g., DnCNN). △ Less

Submitted 26 September, 2022; originally announced September 2022.

arXiv:2206.01084 [pdf, other]

doi 10.1093/biomet/asad016

Soft calibration for selection bias problems under mixed-effects models

Authors: Chenyin Gao, Shu Yang, Jae Kwang Kim

Abstract: Calibration weighting has been widely used to correct selection biases in non-probability sampling, missing data, and causal inference. The main idea is to calibrate the biased sample to the benchmark by adjusting the subject weights. However, hard calibration can produce enormous weights when an exact calibration is enforced on a large set of extraneous covariates. This article proposes a soft ca… ▽ More Calibration weighting has been widely used to correct selection biases in non-probability sampling, missing data, and causal inference. The main idea is to calibrate the biased sample to the benchmark by adjusting the subject weights. However, hard calibration can produce enormous weights when an exact calibration is enforced on a large set of extraneous covariates. This article proposes a soft calibration scheme, in which the outcome and the selection indicator follow mixed-effects models. The scheme imposes an exact calibration on the fixed effects and an approximate calibration on the random effects. On the one hand, our soft calibration has an intrinsic connection with best linear unbiased prediction, which results in a more efficient estimation compared to hard calibration. On the other hand, soft calibration weighting estimation can be envisioned as penalized propensity score weight estimation, with the penalty term motivated by the mixed-effects structure. The asymptotic distribution and a valid variance estimator are derived for soft calibration. We demonstrate the superiority of the proposed estimator over other competitors in simulation studies and a real-data application. △ Less

Submitted 22 February, 2023; v1 submitted 2 June, 2022; originally announced June 2022.

Comments: Accepted for publication in Biometrika

arXiv:2204.09532 [pdf, other]

Gaussian mixture modeling of nodes in Bayesian network according to maximal parental cliques

Authors: Yiran Dong, Chuanhou Gao

Abstract: This paper uses Gaussian mixture model instead of linear Gaussian model to fit the distribution of every node in Bayesian network. We will explain why and how we use Gaussian mixture models in Bayesian network. Meanwhile we propose a new method, called double iteration algorithm, to optimize the mixture model, the double iteration algorithm combines the expectation maximization algorithm and gradi… ▽ More This paper uses Gaussian mixture model instead of linear Gaussian model to fit the distribution of every node in Bayesian network. We will explain why and how we use Gaussian mixture models in Bayesian network. Meanwhile we propose a new method, called double iteration algorithm, to optimize the mixture model, the double iteration algorithm combines the expectation maximization algorithm and gradient descent algorithm, and it performs perfectly on the Bayesian network with mixture models. In experiments we test the Gaussian mixture model and the optimization algorithm on different graphs which is generated by different structure learning algorithm on real data sets, and give the details of every experiment. △ Less

Submitted 16 May, 2022; v1 submitted 20 April, 2022; originally announced April 2022.

Comments: 22 pages 6 figures

arXiv:2202.11276 [pdf, other]

doi 10.1111/rssa.12841

Nearest neighbor ratio imputation with incomplete multi-nomial outcome in survey sampling

Authors: Chenyin Gao, Katherine Jenny Thompson, Shu Yang, Jae Kwang Kim

Abstract: Nonresponse is a common problem in survey sampling. Appropriate treatment can be challenging, especially when dealing with detailed breakdowns of totals. Often, the nearest neighbor imputation method is used to handle such incomplete multinomial data. In this article, we investigate the nearest neighbor ratio imputation estimator, in which auxiliary variables are used to identify the closest donor… ▽ More Nonresponse is a common problem in survey sampling. Appropriate treatment can be challenging, especially when dealing with detailed breakdowns of totals. Often, the nearest neighbor imputation method is used to handle such incomplete multinomial data. In this article, we investigate the nearest neighbor ratio imputation estimator, in which auxiliary variables are used to identify the closest donor and the vector of proportions from the donor is applied to the total of the recipient to implement ratio imputation. To estimate the asymptotic variance, we first treat the nearest neighbor ratio imputation as a special case of predictive matching imputation and apply the linearization method of \cite{yang2020asymptotic}. To account for the non-negligible sampling fractions, parametric and generalized additive models are employed to incorporate the smoothness of the imputation estimator, which results in a valid variance estimator. We apply the proposed method to estimate expenditures detail items based on empirical data from the 2018 collection of the Service Annual Survey, conducted by the United States Census Bureau. Our simulation results demonstrate the validity of our proposed estimators and also confirm that the derived variance estimators have good performance even when the sampling fraction is non-negligible. △ Less

Submitted 22 February, 2022; originally announced February 2022.

Comments: Accepted for publication in JRSS(A)

arXiv:2111.08493 [pdf, other]

ELBD: Efficient score algorithm for feature selection on latent variables of VAE

Authors: Yiran Dong, Chuanhou Gao

Abstract: In this paper, we develop the notion of evidence lower bound difference (ELBD), based on which an efficient score algorithm is presented to implement feature selection on latent variables of VAE and its variants. Further, we propose weak convergence approximation algorithms to optimize VAE related models through weighing the ``more important" latent variables selected and accordingly increasing ev… ▽ More In this paper, we develop the notion of evidence lower bound difference (ELBD), based on which an efficient score algorithm is presented to implement feature selection on latent variables of VAE and its variants. Further, we propose weak convergence approximation algorithms to optimize VAE related models through weighing the ``more important" latent variables selected and accordingly increasing evidence lower bound. We discuss two kinds of different Gaussian posteriors, mean-filed and full-covariance, for latent variables, and make corresponding theoretical analyses to support the effectiveness of algorithms. A great deal of comparative experiments are carried out between our algorithms and other 9 feature selection methods on 7 public datasets to address generative tasks. The results provide the experimental evidence of effectiveness of our algorithms. Finally, we extend ELBD to its generalized version, and apply the latter to tackling classification tasks of 5 new public datasets with satisfactory experimental results. △ Less

Submitted 10 October, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

Comments: 16 pages 7 figures

arXiv:2110.12966 [pdf, ps, other]

Minimax rates for sparse signal detection under correlation

Authors: Subhodh Kotekal, Chao Gao

Abstract: We fully characterize the nonasymptotic minimax separation rate for sparse signal detection in the Gaussian sequence model with $p$ equicorrelated observations, generalizing a result of Collier, Comminges, and Tsybakov. As a consequence of the rate characterization, we find that strong correlation is a blessing, moderate correlation is a curse, and weak correlation is irrelevant. Moreover, the thr… ▽ More We fully characterize the nonasymptotic minimax separation rate for sparse signal detection in the Gaussian sequence model with $p$ equicorrelated observations, generalizing a result of Collier, Comminges, and Tsybakov. As a consequence of the rate characterization, we find that strong correlation is a blessing, moderate correlation is a curse, and weak correlation is irrelevant. Moreover, the threshold correlation level yielding a blessing exhibits phase transitions at the $\sqrt{p}$ and $p-\sqrt{p}$ sparsity levels. We also establish the emergence of new phase transitions in the minimax separation rate with a subtle dependence on the correlation level. Additionally, we study group structured correlations and derive the minimax separation rate in a model including multiple random effects. The group structure turns out to fundamentally change the detection problem from the equicorrelated case and different phenomena appear in the separation rate. △ Less

Submitted 25 October, 2021; originally announced October 2021.

Comments: 74 pages

arXiv:2110.03874 [pdf, other]

Uncertainty quantification in the Bradley-Terry-Luce model

Authors: Chao Gao, Yandi Shen, Anderson Y. Zhang

Abstract: The Bradley-Terry-Luce (BTL) model is a benchmark model for pairwise comparisons between individuals. Despite recent progress on the first-order asymptotics of several popular procedures, the understanding of uncertainty quantification in the BTL model remains largely incomplete, especially when the underlying comparison graph is sparse. In this paper, we fill this gap by focusing on two estimator… ▽ More The Bradley-Terry-Luce (BTL) model is a benchmark model for pairwise comparisons between individuals. Despite recent progress on the first-order asymptotics of several popular procedures, the understanding of uncertainty quantification in the BTL model remains largely incomplete, especially when the underlying comparison graph is sparse. In this paper, we fill this gap by focusing on two estimators that have received much recent attention: the maximum likelihood estimator (MLE) and the spectral estimator. Using a unified proof strategy, we derive sharp and uniform non-asymptotic expansions for both estimators in the sparsest possible regime (up to some poly-logarithmic factors) of the underlying comparison graph. These expansions allow us to obtain: (i) finite-dimensional central limit theorems for both estimators; (ii) construction of confidence intervals for individual ranks; (iii) optimal constant of $\ell_2$ estimation, which is achieved by the MLE but not by the spectral estimator. Our proof is based on a self-consistent equation of the second-order remainder vector and a novel leave-two-out analysis. △ Less

Submitted 9 August, 2022; v1 submitted 7 October, 2021; originally announced October 2021.

arXiv:2109.13491 [pdf, ps, other]

Optimal Orthogonal Group Synchronization and Rotation Group Synchronization

Authors: Chao Gao, Anderson Y. Zhang

Abstract: We study the statistical estimation problem of orthogonal group synchronization and rotation group synchronization. The model is $Y_{ij} = Z_i^* Z_j^{*T} + σW_{ij}\in\mathbb{R}^{d\times d}$ where $W_{ij}$ is a Gaussian random matrix and $Z_i^*$ is either an orthogonal matrix or a rotation matrix, and each $Y_{ij}$ is observed independently with probability $p$. We analyze an iterative polar decomp… ▽ More We study the statistical estimation problem of orthogonal group synchronization and rotation group synchronization. The model is $Y_{ij} = Z_i^* Z_j^{*T} + σW_{ij}\in\mathbb{R}^{d\times d}$ where $W_{ij}$ is a Gaussian random matrix and $Z_i^*$ is either an orthogonal matrix or a rotation matrix, and each $Y_{ij}$ is observed independently with probability $p$. We analyze an iterative polar decomposition algorithm for the estimation of $Z^*$ and show it has an error of $(1+o(1))\frac{σ^2 d(d-1)}{2np}$ when initialized by spectral methods. A matching minimax lower bound is further established which leads to the optimality of the proposed algorithm as it achieves the exact minimax risk. △ Less

Submitted 25 April, 2022; v1 submitted 28 September, 2021; originally announced September 2021.

arXiv:2107.02847 [pdf, other]

Transfer Learning in Information Criteria-based Feature Selection

Authors: Shaohan Chen, Nikolaos V. Sahinidis, Chuanhou Gao

Abstract: This paper investigates the effectiveness of transfer learning based on Mallows' Cp. We propose a procedure that combines transfer learning with Mallows' Cp (TLCp) and prove that it outperforms the conventional Mallows' Cp criterion in terms of accuracy and stability. Our theoretical results indicate that, for any sample size in the target domain, the proposed TLCp estimator performs better than t… ▽ More This paper investigates the effectiveness of transfer learning based on Mallows' Cp. We propose a procedure that combines transfer learning with Mallows' Cp (TLCp) and prove that it outperforms the conventional Mallows' Cp criterion in terms of accuracy and stability. Our theoretical results indicate that, for any sample size in the target domain, the proposed TLCp estimator performs better than the Cp estimator by the mean squared error (MSE) metric in the case of orthogonal predictors, provided that i) the dissimilarity between the tasks from source domain and target domain is small, and ii) the procedure parameters (complexity penalties) are tuned according to certain explicit rules. Moreover, we show that our transfer learning framework can be extended to other feature selection criteria, such as the Bayesian information criterion. By analyzing the solution of the orthogonalized Cp, we identify an estimator that asymptotically approximates the solution of the Cp criterion in the case of non-orthogonal predictors. Similar results are obtained for the non-orthogonal TLCp. Finally, simulation studies and applications with real data demonstrate the usefulness of the TLCp scheme. △ Less

Submitted 29 May, 2022; v1 submitted 6 July, 2021; originally announced July 2021.

Comments: Accepted to the Journal of Machine Learning Research

ACM Class: I.3; I.5

arXiv:2106.15400 [pdf, other]

Online Interaction Detection for Click-Through Rate Prediction

Authors: Qiuqiang Lin, Chuanhou Gao

Abstract: Click-Through Rate prediction aims to predict the ratio of clicks to impressions of a specific link. This is a challenging task since (1) there are usually categorical features, and the inputs will be extremely high-dimensional if one-hot encoding is applied, (2) not only the original features but also their interactions are important, (3) an effective prediction may rely on different features and… ▽ More Click-Through Rate prediction aims to predict the ratio of clicks to impressions of a specific link. This is a challenging task since (1) there are usually categorical features, and the inputs will be extremely high-dimensional if one-hot encoding is applied, (2) not only the original features but also their interactions are important, (3) an effective prediction may rely on different features and interactions in different time periods. To overcome these difficulties, we propose a new interaction detection method, named Online Random Intersection Chains. The method, which is based on the idea of frequent itemset mining, detects informative interactions by observing the intersections of randomly chosen samples. The discovered interactions enjoy high interpretability as they can be comprehended as logical expressions. ORIC can be updated every time new data is collected, without being retrained on historical data. What's more, the importance of the historical and latest data can be controlled by a tuning parameter. A framework is designed to deal with the streaming interactions, so almost all existing models for CTR prediction can be applied after interaction detection. Empirical results demonstrate the efficiency and effectiveness of ORIC on three benchmark datasets. △ Less

Submitted 27 June, 2021; originally announced June 2021.

Comments: 11pages, 4 figures, 1 supplement

arXiv:2104.04714 [pdf, other]

Random Intersection Chains

Authors: Qiuqiang Lin, Chuanhou Gao

Abstract: Interactions between several features sometimes play an important role in prediction tasks. But taking all the interactions into consideration will lead to an extremely heavy computational burden. For categorical features, the situation is more complicated since the input will be extremely high-dimensional and sparse if one-hot encoding is applied. Inspired by association rule mining, we propose a… ▽ More Interactions between several features sometimes play an important role in prediction tasks. But taking all the interactions into consideration will lead to an extremely heavy computational burden. For categorical features, the situation is more complicated since the input will be extremely high-dimensional and sparse if one-hot encoding is applied. Inspired by association rule mining, we propose a method that selects interactions of categorical features, called Random Intersection Chains. It uses random intersections to detect frequent patterns, then selects the most meaningful ones among them. At first a number of chains are generated, in which each node is the intersection of the previous node and a random chosen observation. The frequency of patterns in the tail nodes is estimated by maximum likelihood estimation, then the patterns with largest estimated frequency are selected. After that, their confidence is calculated by Bayes' theorem. The most confident patterns are finally returned by Random Intersection Chains. We show that if the number and length of chains are appropriately chosen, the patterns in the tail nodes are indeed the most frequent ones in the data set. We analyze the computation complexity of the proposed algorithm and prove the convergence of the estimators. The results of a series of experiments verify the efficiency and effectiveness of the algorithm. △ Less

Submitted 10 April, 2021; originally announced April 2021.

arXiv:2101.08421 [pdf, other]

Optimal Full Ranking from Pairwise Comparisons

Authors: Pinhan Chen, Chao Gao, Anderson Y. Zhang

Abstract: We consider the problem of ranking $n$ players from partial pairwise comparison data under the Bradley-Terry-Luce model. For the first time in the literature, the minimax rate of this ranking problem is derived with respect to the Kendall's tau distance that measures the difference between two rank vectors by counting the number of inversions. The minimax rate of ranking exhibits a transition betw… ▽ More We consider the problem of ranking $n$ players from partial pairwise comparison data under the Bradley-Terry-Luce model. For the first time in the literature, the minimax rate of this ranking problem is derived with respect to the Kendall's tau distance that measures the difference between two rank vectors by counting the number of inversions. The minimax rate of ranking exhibits a transition between an exponential rate and a polynomial rate depending on the magnitude of the signal-to-noise ratio of the problem. To the best of our knowledge, this phenomenon is unique to full ranking and has not been seen in any other statistical estimation problem. To achieve the minimax rate, we propose a divide-and-conquer ranking algorithm that first divides the $n$ players into groups of similar skills and then computes local MLE within each group. The optimality of the proposed algorithm is established by a careful approximate independence argument between the two steps. △ Less

Submitted 20 January, 2021; originally announced January 2021.

arXiv:2101.02347 [pdf, other]

SDP Achieves Exact Minimax Optimality in Phase Synchronization

Authors: Chao Gao, Anderson Y. Zhang

Abstract: We study the phase synchronization problem with noisy measurements $Y=z^*z^{*H}+σW\in\mathbb{C}^{n\times n}$, where $z^*$ is an $n$-dimensional complex unit-modulus vector and $W$ is a complex-valued Gaussian random matrix. It is assumed that each entry $Y_{jk}$ is observed with probability $p$. We prove that an SDP relaxation of the MLE achieves the error bound $(1+o(1))\frac{σ^2}{2np}$ under a n… ▽ More We study the phase synchronization problem with noisy measurements $Y=z^*z^{*H}+σW\in\mathbb{C}^{n\times n}$, where $z^*$ is an $n$-dimensional complex unit-modulus vector and $W$ is a complex-valued Gaussian random matrix. It is assumed that each entry $Y_{jk}$ is observed with probability $p$. We prove that an SDP relaxation of the MLE achieves the error bound $(1+o(1))\frac{σ^2}{2np}$ under a normalized squared $\ell_2$ loss. This result matches the minimax lower bound of the problem, and even the leading constant is sharp. The analysis of the SDP is based on an equivalent non-convex programming whose solution can be characterized as a fixed point of the generalized power iteration lifted to a higher dimensional space. This viewpoint unifies the proofs of the statistical optimality of three different methods: MLE, SDP, and generalized power method. The technique is also applied to the analysis of the SDP for $\mathbb{Z}_2$ synchronization, and we achieve the minimax optimal error $\exp\left(-(1-o(1))\frac{np}{2σ^2}\right)$ with a sharp constant in the exponent. △ Less

Submitted 17 March, 2022; v1 submitted 6 January, 2021; originally announced January 2021.

arXiv:2009.03969 [pdf, ps, other]

Convergence Rates of Empirical Bayes Posterior Distributions: A Variational Perspective

Authors: Fengshuo Zhang, Chao Gao

Abstract: We study the convergence rates of empirical Bayes posterior distributions for nonparametric and high-dimensional inference. We show that as long as the hyperparameter set is discrete, the empirical Bayes posterior distribution induced by the maximum marginal likelihood estimator can be regarded as a variational approximation to a hierarchical Bayes posterior distribution. This connection between e… ▽ More We study the convergence rates of empirical Bayes posterior distributions for nonparametric and high-dimensional inference. We show that as long as the hyperparameter set is discrete, the empirical Bayes posterior distribution induced by the maximum marginal likelihood estimator can be regarded as a variational approximation to a hierarchical Bayes posterior distribution. This connection between empirical Bayes and variational Bayes allows us to leverage the recent results in the variational Bayes literature, and directly obtains the convergence rates of empirical Bayes posterior distributions from a variational perspective. For a more general hyperparameter set that is not necessarily discrete, we introduce a new technique called "prior decomposition" to deal with prior distributions that can be written as convex combinations of probability measures whose supports are low-dimensional subspaces. This leads to generalized versions of the classical "prior mass and testing" conditions for the convergence rates of empirical Bayes. Our theory is applied to a number of statistical estimation problems including nonparametric density estimation and sparse linear regression. △ Less

Submitted 8 September, 2020; originally announced September 2020.

arXiv:2009.02528 [pdf, other]

Structured Sparsity Modeling for Improved Multivariate Statistical Analysis based Fault Isolation

Authors: Wei Chen, Jiusun Zeng, Xiaobin Xu, Shihua Luo, Chuanhou Gao

Abstract: In order to improve the fault diagnosis capability of multivariate statistical methods, this article introduces a fault isolation framework based on structured sparsity modeling. The developed method relies on the reconstruction based contribution analysis and the process structure information can be incorporated into the reconstruction objective function in the form of structured sparsity regular… ▽ More In order to improve the fault diagnosis capability of multivariate statistical methods, this article introduces a fault isolation framework based on structured sparsity modeling. The developed method relies on the reconstruction based contribution analysis and the process structure information can be incorporated into the reconstruction objective function in the form of structured sparsity regularization terms. The structured sparsity terms allow selection of fault variables over structures like blocks or networks of process variables, hence more accurate fault isolation can be achieved. Four structured sparsity terms corresponding to different kinds of process information are considered, namely, partially known sparse support, block sparsity, clustered sparsity and tree-structured sparsity. The optimization problems involving the structured sparsity terms can be solved using the Alternating Direction Method of Multipliers (ADMM) algorithm, which is fast and efficient. Through a simulation example and an application study to a coal-fired power plant, it is verified that the proposed method can better isolate faulty variables by incorporating process structure information. △ Less

Submitted 21 December, 2020; v1 submitted 5 September, 2020; originally announced September 2020.

Comments: 36 pages, 12 figures

arXiv:2006.16485 [pdf, other]

Partial Recovery for Top-$k$ Ranking: Optimality of MLE and Sub-Optimality of Spectral Method

Authors: Pinhan Chen, Chao Gao, Anderson Y. Zhang

Abstract: Given partially observed pairwise comparison data generated by the Bradley-Terry-Luce (BTL) model, we study the problem of top-$k$ ranking. That is, to optimally identify the set of top-$k$ players. We derive the minimax rate with respect to a normalized Hamming loss. This provides the first result in the literature that characterizes the partial recovery error in terms of the proportion of mistak… ▽ More Given partially observed pairwise comparison data generated by the Bradley-Terry-Luce (BTL) model, we study the problem of top-$k$ ranking. That is, to optimally identify the set of top-$k$ players. We derive the minimax rate with respect to a normalized Hamming loss. This provides the first result in the literature that characterizes the partial recovery error in terms of the proportion of mistakes for top-$k$ ranking. We also derive the optimal signal to noise ratio condition for the exact recovery of the top-$k$ set. The maximum likelihood estimator (MLE) is shown to achieve both optimal partial recovery and optimal exact recovery. On the other hand, we show another popular algorithm, the spectral method, is in general sub-optimal. Our results complement the recent work by Chen et al. (2019) that shows both the MLE and the spectral method achieve the optimal sample complexity for exact recovery. It turns out the leading constants of the sample complexity are different for the two algorithms. Another contribution that may be of independent interest is the analysis of the MLE without any penalty or regularization for the BTL model. This closes an important gap between theory and practice in the literature of ranking. △ Less

Submitted 15 July, 2021; v1 submitted 29 June, 2020; originally announced June 2020.

arXiv:2005.10579 [pdf, other]

Elastic Integrative Analysis of Randomized Trial and Real-World Data for Treatment Heterogeneity Estimation

Authors: Shu Yang, Chenyin Gao, Donglin Zeng, Xiaofei Wang

Abstract: We propose a test-based elastic integrative analysis of the randomized trial and real-world data to estimate treatment effect heterogeneity with a vector of known effect modifiers. When the real-world data are not subject to bias, our approach combines the trial and real-world data for efficient estimation. Utilizing the trial design, we construct a test to decide whether or not to use real-world… ▽ More We propose a test-based elastic integrative analysis of the randomized trial and real-world data to estimate treatment effect heterogeneity with a vector of known effect modifiers. When the real-world data are not subject to bias, our approach combines the trial and real-world data for efficient estimation. Utilizing the trial design, we construct a test to decide whether or not to use real-world data. We characterize the asymptotic distribution of the test-based estimator under local alternatives. We provide a data-adaptive procedure to select the test threshold that promises the smallest mean square error and an elastic confidence interval with a good finite-sample coverage property. △ Less

Submitted 29 November, 2022; v1 submitted 21 May, 2020; originally announced May 2020.

arXiv:2005.09912 [pdf, other]

Model Repair: Robust Recovery of Over-Parameterized Statistical Models

Authors: Chao Gao, John Lafferty

Abstract: A new type of robust estimation problem is introduced where the goal is to recover a statistical model that has been corrupted after it has been estimated from data. Methods are proposed for "repairing" the model using only the design and not the response values used to fit the model in a supervised learning setting. Theory is developed which reveals that two important ingredients are necessary fo… ▽ More A new type of robust estimation problem is introduced where the goal is to recover a statistical model that has been corrupted after it has been estimated from data. Methods are proposed for "repairing" the model using only the design and not the response values used to fit the model in a supervised learning setting. Theory is developed which reveals that two important ingredients are necessary for model repair---the statistical model must be over-parameterized, and the estimator must incorporate redundancy. In particular, estimators based on stochastic gradient descent are seen to be well suited to model repair, but sparse estimators are not in general repairable. After formulating the problem and establishing a key technical lemma related to robust estimation, a series of results are presented for repair of over-parameterized linear models, random feature models, and artificial neural networks. Simulation studies are presented that corroborate and illustrate the theoretical findings. △ Less

Submitted 20 May, 2020; originally announced May 2020.

arXiv:2004.12908 [pdf, other]

A Simple Lifelong Learning Approach

Authors: Joshua T. Vogelstein, Jayanta Dey, Hayden S. Helm, Will LeVine, Ronak D. Mehta, Tyler M. Tomita, Haoyin Xu, Ali Geisa, Qingyang Wang, Gido M. van de Ven, Chenyu Gao, Weiwei Yang, Bryan Tower, Jonathan Larson, Christopher M. White, Carey E. Priebe

Abstract: In lifelong learning, data are used to improve performance not only on the present task, but also on past and future (unencountered) tasks. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain perf… ▽ More In lifelong learning, data are used to improve performance not only on the present task, but also on past and future (unencountered) tasks. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance on old tasks given new tasks. But striving to avoid forgetting sets the goal unnecessarily low. The goal of lifelong learning should be to use data to improve performance on both future tasks (forward transfer) and past tasks (backward transfer). In this paper, we show that a simple approach -- representation ensembling -- demonstrates both forward and backward transfer in a variety of simulated and benchmark data scenarios, including tabular, vision (CIFAR-100, 5-dataset, Split Mini-Imagenet, and Food1k), and speech (spoken digit), in contrast to various reference algorithms, which typically failed to transfer either forward or backward, or both. Moreover, our proposed approach can flexibly operate with or without a computational budget. △ Less

Submitted 11 June, 2024; v1 submitted 27 April, 2020; originally announced April 2020.

arXiv:2001.08290 [pdf, other]

Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

Authors: Haoran Miao, Gaofeng Cheng, Changfeng Gao, Pengyuan Zhang, Yonghong Yan

Abstract: Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based se… ▽ More Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66% character error rate (CER) with a 320 ms latency. Our online model yields as little as 0.19% absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models. △ Less

Submitted 11 February, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

Comments: Accepted by ICASSP 2020

arXiv:2001.05486 [pdf, other]

doi 10.1088/2632-2153/abab62

i-flow: High-dimensional Integration and Sampling with Normalizing Flows

Authors: Christina Gao, Joshua Isaacson, Claudius Krause

Abstract: In many fields of science, high-dimensional integration is required. Numerical methods have been developed to evaluate these complex integrals. We introduce the code i-flow, a python package that performs high-dimensional numerical integration utilizing normalizing flows. Normalizing flows are machine-learned, bijective map**s between two distributions. i-flow can also be used to sample random p… ▽ More In many fields of science, high-dimensional integration is required. Numerical methods have been developed to evaluate these complex integrals. We introduce the code i-flow, a python package that performs high-dimensional numerical integration utilizing normalizing flows. Normalizing flows are machine-learned, bijective map**s between two distributions. i-flow can also be used to sample random points according to complicated distributions in high dimensions. We compare i-flow to other algorithms for high-dimensional numerical integration and show that i-flow outperforms them for high dimensional correlated integrals. The i-flow code is publicly available on gitlab at https://gitlab.com/i-flow/i-flow. △ Less

Submitted 17 August, 2020; v1 submitted 15 January, 2020; originally announced January 2020.

Comments: 21 pages, 5 figures, 4 tables; v2: improved presentation and discussion, matches published version. Mach. Learn.: Sci. Technol (2020)

Report number: FERMILAB-PUB-20-010-T

arXiv:1911.05121 [pdf, other]

Detecting Patterns of Physiological Response to Hemodynamic Stress via Unsupervised Deep Learning

Authors: Chufan Gao, Fabian Falck, Mononito Goswami, Anthony Wertz, Michael R. Pinsky, Artur Dubrawski

Abstract: Monitoring physiological responses to hemodynamic stress can help in determining appropriate treatment and ensuring good patient outcomes. Physicians' intuition suggests that the human body has a number of physiological response patterns to hemorrhage which escalate as blood loss continues, however the exact etiology and phenotypes of such responses are not well known or understood only at a coars… ▽ More Monitoring physiological responses to hemodynamic stress can help in determining appropriate treatment and ensuring good patient outcomes. Physicians' intuition suggests that the human body has a number of physiological response patterns to hemorrhage which escalate as blood loss continues, however the exact etiology and phenotypes of such responses are not well known or understood only at a coarse level. Although previous research has shown that machine learning models can perform well in hemorrhage detection and survival prediction, it is unclear whether machine learning could help to identify and characterize the underlying physiological responses in raw vital sign data. We approach this problem by first transforming the high-dimensional vital sign time series into a tractable, lower-dimensional latent space using a dilated, causal convolutional encoder model trained purely unsupervised. Second, we identify informative clusters in the embeddings. By analyzing the clusters of latent embeddings and visualizing them over time, we hypothesize that the clusters correspond to the physiological response patterns that match physicians' intuition. Furthermore, we attempt to evaluate the latent embeddings using a variety of methods, such as predicting the cluster labels using explainable features. △ Less

Submitted 12 November, 2019; originally announced November 2019.

Comments: Machine Learning for Health (ML4H) at NeurIPS 2019 - Extended Abstract

arXiv:1911.01018 [pdf, ps, other]

Iterative Algorithm for Discrete Structure Recovery

Authors: Chao Gao, Anderson Y. Zhang

Abstract: We propose a general modeling and algorithmic framework for discrete structure recovery that can be applied to a wide range of problems. Under this framework, we are able to study the recovery of clustering labels, ranks of players, signs of regression coefficients, cyclic shifts, and even group elements from a unified perspective. A simple iterative algorithm is proposed for discrete structure re… ▽ More We propose a general modeling and algorithmic framework for discrete structure recovery that can be applied to a wide range of problems. Under this framework, we are able to study the recovery of clustering labels, ranks of players, signs of regression coefficients, cyclic shifts, and even group elements from a unified perspective. A simple iterative algorithm is proposed for discrete structure recovery, which generalizes methods including Lloyd's algorithm and the power method. A linear convergence result for the proposed algorithm is established in this paper under appropriate abstract conditions on stochastic errors and initialization. We illustrate our general theory by applying it on several representative problems: (1) clustering in Gaussian mixture model, (2) approximate ranking, (3) sign recovery in compressed sensing, (4) multireference alignment, and (5) group synchronization, and show that minimax rate is achieved in each case. △ Less

Submitted 27 September, 2020; v1 submitted 3 November, 2019; originally announced November 2019.

arXiv:1910.12797 [pdf, other]

Testing Equivalence of Clustering

Authors: Chao Gao, Zongming Ma

Abstract: In this paper, we test whether two datasets share a common clustering structure. As a leading example, we focus on comparing clustering structures in two independent random samples from two mixtures of multivariate normal distributions. Mean parameters of these normal distributions are treated as potentially unknown nuisance parameters and are allowed to differ. Assuming knowledge of mean paramete… ▽ More In this paper, we test whether two datasets share a common clustering structure. As a leading example, we focus on comparing clustering structures in two independent random samples from two mixtures of multivariate normal distributions. Mean parameters of these normal distributions are treated as potentially unknown nuisance parameters and are allowed to differ. Assuming knowledge of mean parameters, we first determine the phase diagram of the testing problem over the entire range of signal-to-noise ratios by providing both lower bounds and tests that achieve them. When nuisance parameters are unknown, we propose tests that achieve the detection boundary adaptively as long as ambient dimensions of the datasets grow at a sub-linear rate with the sample size. △ Less

Submitted 17 November, 2022; v1 submitted 28 October, 2019; originally announced October 2019.

arXiv:1909.04817 [pdf, other]

doi 10.3233/JSA-200450

Home Sweet Home: Quantifying Home Court Advantages For NCAA Basketball Statistics

Authors: Matthew van Bommel, Luke Bornn, Peter Chow-White, Chuancong Gao

Abstract: Box score statistics are the baseline measures of performance for National Collegiate Athletic Association (NCAA) basketball. Between the 2011-2012 and 2015-2016 seasons, NCAA teams performed better at home compared to on the road in nearly all box score statistics across both genders and all three divisions. Using box score data from over 100,000 games spanning the three divisions for both women… ▽ More Box score statistics are the baseline measures of performance for National Collegiate Athletic Association (NCAA) basketball. Between the 2011-2012 and 2015-2016 seasons, NCAA teams performed better at home compared to on the road in nearly all box score statistics across both genders and all three divisions. Using box score data from over 100,000 games spanning the three divisions for both women and men, we examine the factors underlying this discrepancy. The prevalence of neutral location games in the NCAA provides an additional angle through which to examine the gaps in box score statistic performance, which we believe has been underutilized in existing literature. We also estimate a regression model to quantify the home court advantages for box score statistics after controlling for other factors such as number of possessions, and team strength. Additionally, we examine the biases of scorekeepers and referees. We present evidence that scorekeepers tend to have greater home team biases when observing men compared to women, higher divisions compared to lower divisions, and stronger teams compared to weaker teams. Finally, we present statistically significant results indicating referee decisions are impacted by attendance, with larger crowds resulting in greater bias in favor of the home team. △ Less

Submitted 8 May, 2021; v1 submitted 10 September, 2019; originally announced September 2019.

Comments: 24 pages, 4 figures

Journal ref: Journal of Sports Analytics, vol. 7, no. 1, pp. 25-36, 2021

arXiv:1908.03682 [pdf]

Natural-Logarithm-Rectified Activation Function in Convolutional Neural Networks

Authors: Yang Liu, Jianpeng Zhang, Chao Gao, **ghua Qu, Lixin Ji

Abstract: Activation functions play a key role in providing remarkable performance in deep neural networks, and the rectified linear unit (ReLU) is one of the most widely used activation functions. Various new activation functions and improvements on ReLU have been proposed, but each carry performance drawbacks. In this paper, we propose an improved activation function, which we name the natural-logarithm-r… ▽ More Activation functions play a key role in providing remarkable performance in deep neural networks, and the rectified linear unit (ReLU) is one of the most widely used activation functions. Various new activation functions and improvements on ReLU have been proposed, but each carry performance drawbacks. In this paper, we propose an improved activation function, which we name the natural-logarithm-rectified linear unit (NLReLU). This activation function uses the parametric natural logarithmic transform to improve ReLU and is simply defined as. NLReLU not only retains the sparse activation characteristic of ReLU, but it also alleviates the "dying ReLU" and vanishing gradient problems to some extent. It also reduces the bias shift effect and heteroscedasticity of neuron data distributions among network layers in order to accelerate the learning process. The proposed method was verified across ten convolutional neural networks with different depths for two essential datasets. Experiments illustrate that convolutional neural networks with NLReLU exhibit higher accuracy than those with ReLU, and that NLReLU is comparable to other well-known activation functions. NLReLU provides 0.16% and 2.04% higher classification accuracy on average compared to ReLU when used in shallow convolutional neural networks with the MNIST and CIFAR-10 datasets, respectively. The average accuracy of deep convolutional neural networks with NLReLU is 1.35% higher on average with the CIFAR-10 dataset. △ Less

Submitted 24 August, 2019; v1 submitted 9 August, 2019; originally announced August 2019.

arXiv:1907.11788 [pdf, other]

On Hard Exploration for Reinforcement Learning: a Case Study in Pommerman

Authors: Chao Gao, Bilal Kartal, Pablo Hernandez-Leal, Matthew E. Taylor

Abstract: How to best explore in domains with sparse, delayed, and deceptive rewards is an important open problem for reinforcement learning (RL). This paper considers one such domain, the recently-proposed multi-agent benchmark of Pommerman. This domain is very challenging for RL --- past work has shown that model-free RL algorithms fail to achieve significant learning without artificially reducing the env… ▽ More How to best explore in domains with sparse, delayed, and deceptive rewards is an important open problem for reinforcement learning (RL). This paper considers one such domain, the recently-proposed multi-agent benchmark of Pommerman. This domain is very challenging for RL --- past work has shown that model-free RL algorithms fail to achieve significant learning without artificially reducing the environment's complexity. In this paper, we illuminate reasons behind this failure by providing a thorough analysis on the hardness of random exploration in Pommerman. While model-free random exploration is typically futile, we develop a model-based automatic reasoning module that can be used for safer exploration by pruning actions that will surely lead the agent to death. We empirically demonstrate that this module can significantly improve learning. △ Less

Submitted 26 July, 2019; originally announced July 2019.

Comments: AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment (AIIDE) 2019

arXiv:1907.10012 [pdf, other]

Minimax rates in sparse, high-dimensional changepoint detection

Authors: Haoyang Liu, Chao Gao, Richard J. Samworth

Abstract: We study the detection of a sparse change in a high-dimensional mean vector as a minimax testing problem. Our first main contribution is to derive the exact minimax testing rate across all parameter regimes for $n$ independent, $p$-variate Gaussian observations. This rate exhibits a phase transition when the sparsity level is of order $\sqrt{p \log \log (8n)}$ and has a very delicate dependence on… ▽ More We study the detection of a sparse change in a high-dimensional mean vector as a minimax testing problem. Our first main contribution is to derive the exact minimax testing rate across all parameter regimes for $n$ independent, $p$-variate Gaussian observations. This rate exhibits a phase transition when the sparsity level is of order $\sqrt{p \log \log (8n)}$ and has a very delicate dependence on the sample size: in a certain sparsity regime it involves a triple iterated logarithmic factor in~$n$. Further, in a dense asymptotic regime, we identify the sharp leading constant, while in the corresponding sparse asymptotic regime, this constant is determined to within a factor of $\sqrt{2}$. Extensions that cover spatial and temporal dependence, primarily in the dense case, are also provided. △ Less

Submitted 17 November, 2020; v1 submitted 23 July, 2019; originally announced July 2019.

arXiv:1905.11596 [pdf, other]

doi 10.1145/3292500.3330880

LambdaOpt: Learn to Regularize Recommender Models in Finer Levels

Authors: Yihong Chen, Bei Chen, Xiangnan He, Chen Gao, Yong Li, Jian-Guang Lou, Yue Wang

Abstract: Recommendation models mainly deal with categorical variables, such as user/item ID and attributes. Besides the high-cardinality issue, the interactions among such categorical variables are usually long-tailed, with the head made up of highly frequent values and a long tail of rare ones. This phenomenon results in the data sparsity issue, making it essential to regularize the models to ensure gener… ▽ More Recommendation models mainly deal with categorical variables, such as user/item ID and attributes. Besides the high-cardinality issue, the interactions among such categorical variables are usually long-tailed, with the head made up of highly frequent values and a long tail of rare ones. This phenomenon results in the data sparsity issue, making it essential to regularize the models to ensure generalization. The common practice is to employ grid search to manually tune regularization hyperparameters based on the validation data. However, it requires non-trivial efforts and large computation resources to search the whole candidate space; even so, it may not lead to the optimal choice, for which different parameters should have different regularization strengths. In this paper, we propose a hyperparameter optimization method, LambdaOpt, which automatically and adaptively enforces regularization during training. Specifically, it updates the regularization coefficients based on the performance of validation data. With LambdaOpt, the notorious tuning of regularization hyperparameters can be avoided; more importantly, it allows fine-grained regularization (i.e. each parameter can have an individualized regularization coefficient), leading to better generalized models. We show how to employ LambdaOpt on matrix factorization, a classical model that is representative of a large family of recommender models. Extensive experiments on two public benchmarks demonstrate the superiority of our method in boosting the performance of top-K recommendation. △ Less

Submitted 27 May, 2019; originally announced May 2019.

Comments: Accepted by KDD 2019

arXiv:1904.03779 [pdf, ps, other]

Cluster Develo** 1-Bit Matrix Completion

Authors: Chengkun Zhang. Junbin Gao, Stephen Lu

Abstract: Matrix completion has a long-time history of usage as the core technique of recommender systems. In particular, 1-bit matrix completion, which considers the prediction as a ``Recommended'' or ``Not Recommended'' question, has proved its significance and validity in the field. However, while customers and products aggregate into interacted clusters, state-of-the-art model-based 1-bit recommender sy… ▽ More Matrix completion has a long-time history of usage as the core technique of recommender systems. In particular, 1-bit matrix completion, which considers the prediction as a ``Recommended'' or ``Not Recommended'' question, has proved its significance and validity in the field. However, while customers and products aggregate into interacted clusters, state-of-the-art model-based 1-bit recommender systems do not take the consideration of grou** bias. To tackle the gap, this paper introduced Group-Specific 1-bit Matrix Completion (GS1MC) by first-time consolidating group-specific effects into 1-bit recommender systems under the low-rank latent variable framework. Additionally, to empower GS1MC even when grou** information is unobtainable, Cluster Develo** Matrix Completion (CDMC) was proposed by integrating the sparse subspace clustering technique into GS1MC. Namely, CDMC allows clustering users/items and to leverage their group effects into matrix completion at the same time. Experiments on synthetic and real-world data show that GS1MC outperforms the current 1-bit matrix completion methods. Meanwhile, it is compelling that CDMC can successfully capture items' genre features only based on sparse binary user-item interactive data. Notably, GS1MC provides a new insight to incorporate and evaluate the efficacy of clustering methods while CDMC can be served as a new tool to explore unrevealed social behavior or market phenomenon. △ Less

Submitted 7 April, 2019; originally announced April 2019.

Comments: 16 Pages

arXiv:1903.01944 [pdf, other]

Generative Adversarial Nets for Robust Scatter Estimation: A Proper Scoring Rule Perspective

Authors: Chao Gao, Yuan Yao, Weizhi Zhu

Abstract: Robust scatter estimation is a fundamental task in statistics. The recent discovery on the connection between robust estimation and generative adversarial nets (GANs) by Gao et al. (2018) suggests that it is possible to compute depth-like robust estimators using similar techniques that optimize GANs. In this paper, we introduce a general learning via classification framework based on the notion of… ▽ More Robust scatter estimation is a fundamental task in statistics. The recent discovery on the connection between robust estimation and generative adversarial nets (GANs) by Gao et al. (2018) suggests that it is possible to compute depth-like robust estimators using similar techniques that optimize GANs. In this paper, we introduce a general learning via classification framework based on the notion of proper scoring rules. This framework allows us to understand both matrix depth function and various GANs through the lens of variational approximations of $f$-divergences induced by proper scoring rules. We then propose a new class of robust scatter estimators in this framework by carefully constructing discriminators with appropriate neural network structures. These estimators are proved to achieve the minimax rate of scatter estimation under Huber's contamination model. Our numerical results demonstrate its good performance under various settings against competitors in the literature. △ Less

Submitted 5 March, 2019; originally announced March 2019.

arXiv:1902.03316 [pdf, other]

Bayesian Model Selection with Graph Structured Sparsity

Authors: Youngseok Kim, Chao Gao

Abstract: We propose a general algorithmic framework for Bayesian model selection. A spike-and-slab Laplacian prior is introduced to model the underlying structural assumption. Using the notion of effective resistance, we derive an EM-type algorithm with closed-form iterations to efficiently explore possible candidates for Bayesian model selection. The deterministic nature of the proposed algorithm makes it… ▽ More We propose a general algorithmic framework for Bayesian model selection. A spike-and-slab Laplacian prior is introduced to model the underlying structural assumption. Using the notion of effective resistance, we derive an EM-type algorithm with closed-form iterations to efficiently explore possible candidates for Bayesian model selection. The deterministic nature of the proposed algorithm makes it more scalable to large-scale and high-dimensional data sets compared with existing stochastic search algorithms. When applied to sparse linear regression, our framework recovers the EMVS algorithm [Rockova and George, 2014] as a special case. We also discuss extensions of our framework using tools from graph algebra to incorporate complex Bayesian models such as biclustering and submatrix localization. Extensive simulation studies and real data applications are conducted to demonstrate the superior performance of our methods over its frequentist competitors such as $\ell_0$ or $\ell_1$ penalization. △ Less

Submitted 23 May, 2020; v1 submitted 8 February, 2019; originally announced February 2019.

Journal ref: Journal of Machine Learning Research 21(109):1-61, 2020

arXiv:1811.06055 [pdf, other]

Minimax Rates in Network Analysis: Graphon Estimation, Community Detection and Hypothesis Testing

Authors: Chao Gao, Zongming Ma

Abstract: This paper surveys some recent developments in fundamental limits and optimal algorithms for network analysis. We focus on minimax optimal rates in three fundamental problems of network analysis: graphon estimation, community detection, and hypothesis testing. For each problem, we review state-of-the-art results in the literature followed by general principles behind the optimal procedures that le… ▽ More This paper surveys some recent developments in fundamental limits and optimal algorithms for network analysis. We focus on minimax optimal rates in three fundamental problems of network analysis: graphon estimation, community detection, and hypothesis testing. For each problem, we review state-of-the-art results in the literature followed by general principles behind the optimal procedures that lead to minimax estimation and testing. This allows us to connect problems in network analysis to other statistical inference problems from a general perspective. △ Less

Submitted 14 February, 2019; v1 submitted 14 November, 2018; originally announced November 2018.

arXiv:1811.02612 [pdf, other]

Mixing Time of Metropolis-Hastings for Bayesian Community Detection

Authors: Bumeng Zhuo, Chao Gao

Abstract: We study the computational complexity of a Metropolis-Hastings algorithm for Bayesian community detection. We first establish a posterior strong consistency result for a natural prior distribution on stochastic block models under the optimal signal-to-noise ratio condition in the literature. We then give a set of conditions that guarantee rapid mixing of a simple Metropolis-Hastings algorithm. The… ▽ More We study the computational complexity of a Metropolis-Hastings algorithm for Bayesian community detection. We first establish a posterior strong consistency result for a natural prior distribution on stochastic block models under the optimal signal-to-noise ratio condition in the literature. We then give a set of conditions that guarantee rapid mixing of a simple Metropolis-Hastings algorithm. The mixing time analysis is based on a careful study of posterior ratios and a canonical path argument to control the spectral gap of the Markov chain. △ Less

Submitted 6 November, 2018; originally announced November 2018.

arXiv:1810.02030 [pdf, other]

Robust Estimation and Generative Adversarial Nets

Authors: Chao Gao, Jiyi Liu, Yuan Yao, Weizhi Zhu

Abstract: Robust estimation under Huber's $ε$-contamination model has become an important topic in statistics and theoretical computer science. Statistically optimal procedures such as Tukey's median and other estimators based on depth functions are impractical because of their computational intractability. In this paper, we establish an intriguing connection between $f$-GANs and various depth functions thr… ▽ More Robust estimation under Huber's $ε$-contamination model has become an important topic in statistics and theoretical computer science. Statistically optimal procedures such as Tukey's median and other estimators based on depth functions are impractical because of their computational intractability. In this paper, we establish an intriguing connection between $f$-GANs and various depth functions through the lens of $f$-Learning. Similar to the derivation of $f$-GANs, we show that these depth functions that lead to statistically optimal robust estimators can all be viewed as variational lower bounds of the total variation distance in the framework of $f$-Learning. This connection opens the door of computing robust estimators using tools developed for training GANs. In particular, we show in both theory and experiments that some appropriate structures of discriminator networks with hidden layers in GANs lead to statistically optimal robust location estimators for both Gaussian distribution and general elliptical distributions where first moment may not exist. △ Less

Submitted 25 February, 2019; v1 submitted 3 October, 2018; originally announced October 2018.

arXiv:1809.01571 [pdf, ps, other]

Knowledge Integrated Classifier Design Based on Utility Optimization

Authors: Shaohan Chen, Chuanhou Gao

Abstract: This paper proposes a systematic framework to design a classification model that yields a classifier which optimizes a utility function based on prior knowledge. Specifically, as the data size grows, we prove that the produced classifier asymptotically converges to the optimal classifier, an extended version of the Bayes rule, which maximizes the utility function. Therefore, we provide a meaningfu… ▽ More This paper proposes a systematic framework to design a classification model that yields a classifier which optimizes a utility function based on prior knowledge. Specifically, as the data size grows, we prove that the produced classifier asymptotically converges to the optimal classifier, an extended version of the Bayes rule, which maximizes the utility function. Therefore, we provide a meaningful theoretical interpretation for modeling with the knowledge incorporated. Our knowledge incorporation method allows domain experts to guide the classifier towards correctly classifying data that they think to be more significant. △ Less

Submitted 5 September, 2018; originally announced September 2018.

arXiv:1805.12507 [pdf, other]

Efficacy of regularized multi-task learning based on SVM models

Authors: Shaohan Chen, Zhou Fang, Sijie Lu, Chuanhou Gao

Abstract: This paper investigates the efficacy of a regularized multi-task learning (MTL) framework based on SVM (M-SVM) to answer whether MTL always provides reliable results and how MTL outperforms independent learning. We first find that M-SVM is Bayes risk consistent in the limit of large sample size. This implies that despite the task dissimilarities, M-SVM always produces a reliable decision rule for… ▽ More This paper investigates the efficacy of a regularized multi-task learning (MTL) framework based on SVM (M-SVM) to answer whether MTL always provides reliable results and how MTL outperforms independent learning. We first find that M-SVM is Bayes risk consistent in the limit of large sample size. This implies that despite the task dissimilarities, M-SVM always produces a reliable decision rule for each task in terms of misclassification error when the data size is large enough. Furthermore, we find that the task-interaction vanishes as the data size goes to infinity, and the convergence rates of M-SVM and its single-task counterpart have the same upper bound. The former suggests that M-SVM cannot improve the limit classifier's performance; based on the latter, we conjecture that the optimal convergence rate is not improved when the task number is fixed. As a novel insight of MTL, our theoretical and experimental results achieved an excellent agreement that the benefit of the MTL methods lies in the improvement of the pre-convergence-rate factor (PCR, to be denoted in Section III) rather than the convergence rate. Moreover, this improvement of PCR factors is more significant when the data size is small. △ Less

Submitted 20 February, 2022; v1 submitted 31 May, 2018; originally announced May 2018.

Comments: 12 pages, 4 figures

Showing 1–50 of 74 results for author: Gao, C