Search | arXiv e-print repository

arXiv:2406.20088 [pdf, other]

Minimax And Adaptive Transfer Learning for Nonparametric Classification under Distributed Differential Privacy Constraints

Authors: Arnab Auddy, T. Tony Cai, Abhinav Chakraborty

Abstract: This paper considers minimax and adaptive transfer learning for nonparametric classification under the posterior drift model with distributed differential privacy constraints. Our study is conducted within a heterogeneous framework, encompassing diverse sample sizes, varying privacy parameters, and data heterogeneity across different servers. We first establish the minimax misclassification rate,… ▽ More This paper considers minimax and adaptive transfer learning for nonparametric classification under the posterior drift model with distributed differential privacy constraints. Our study is conducted within a heterogeneous framework, encompassing diverse sample sizes, varying privacy parameters, and data heterogeneity across different servers. We first establish the minimax misclassification rate, precisely characterizing the effects of privacy constraints, source samples, and target samples on classification accuracy. The results reveal interesting phase transition phenomena and highlight the intricate trade-offs between preserving privacy and achieving classification accuracy. We then develop a data-driven adaptive classifier that achieves the optimal rate within a logarithmic factor across a large collection of parameter spaces while satisfying the same set of differential privacy constraints. Simulation studies and real-world data applications further elucidate the theoretical analysis with numerical results. △ Less

Submitted 28 June, 2024; originally announced June 2024.

MSC Class: 62G08; 62G20

arXiv:2406.06755 [pdf, other]

Optimal Federated Learning for Nonparametric Regression with Heterogeneous Distributed Differential Privacy Constraints

Authors: T. Tony Cai, Abhinav Chakraborty, Lasse Vuursteen

Abstract: This paper studies federated learning for nonparametric regression in the context of distributed samples across different servers, each adhering to distinct differential privacy constraints. The setting we consider is heterogeneous, encompassing both varying sample sizes and differential privacy constraints across servers. Within this framework, both global and pointwise estimation are considered,… ▽ More This paper studies federated learning for nonparametric regression in the context of distributed samples across different servers, each adhering to distinct differential privacy constraints. The setting we consider is heterogeneous, encompassing both varying sample sizes and differential privacy constraints across servers. Within this framework, both global and pointwise estimation are considered, and optimal rates of convergence over the Besov spaces are established. Distributed privacy-preserving estimators are proposed and their risk properties are investigated. Matching minimax lower bounds, up to a logarithmic factor, are established for both global and pointwise estimation. Together, these findings shed light on the tradeoff between statistical accuracy and privacy preservation. In particular, we characterize the compromise not only in terms of the privacy budget but also concerning the loss incurred by distributing data within the privacy framework as a whole. This insight captures the folklore wisdom that it is easier to retain privacy in larger samples, and explores the differences between pointwise and global estimation under distributed privacy constraints. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 49 pages total, consisting of an article (24 pages) and a supplement (25 pages)

MSC Class: 62G08; 62C20; 68P27; 62F30;

arXiv:2406.06749 [pdf, other]

Federated Nonparametric Hypothesis Testing with Differential Privacy Constraints: Optimal Rates and Adaptive Tests

Authors: T. Tony Cai, Abhinav Chakraborty, Lasse Vuursteen

Abstract: Federated learning has attracted significant recent attention due to its applicability across a wide range of settings where data is collected and analyzed across disparate locations. In this paper, we study federated nonparametric goodness-of-fit testing in the white-noise-with-drift model under distributed differential privacy (DP) constraints. We first establish matching lower and upper bound… ▽ More Federated learning has attracted significant recent attention due to its applicability across a wide range of settings where data is collected and analyzed across disparate locations. In this paper, we study federated nonparametric goodness-of-fit testing in the white-noise-with-drift model under distributed differential privacy (DP) constraints. We first establish matching lower and upper bounds, up to a logarithmic factor, on the minimax separation rate. This optimal rate serves as a benchmark for the difficulty of the testing problem, factoring in model characteristics such as the number of observations, noise level, and regularity of the signal class, along with the strictness of the $(ε,δ)$-DP requirement. The results demonstrate interesting and novel phase transition phenomena. Furthermore, the results reveal an interesting phenomenon that distributed one-shot protocols with access to shared randomness outperform those without access to shared randomness. We also construct a data-driven testing procedure that possesses the ability to adapt to an unknown regularity parameter over a large collection of function classes with minimal additional cost, all while maintaining adherence to the same set of DP constraints. △ Less

Submitted 10 June, 2024; originally announced June 2024.

Comments: 77 pages total; consisting of a main article (28 pages) and supplement (49 pages)

MSC Class: 62G10; 62C20; 68P27; 62F30

arXiv:2405.09493 [pdf, ps, other]

C-Learner: Constrained Learning for Causal Inference and Semiparametric Statistics

Authors: Tiffany Tianhui Cai, Yuri Fonseca, Kaiwen Hou, Hongseok Namkoong

Abstract: Causal estimation (e.g. of the average treatment effect) requires estimating complex nuisance parameters (e.g. outcome models). To adjust for errors in nuisance parameter estimation, we present a novel correction method that solves for the best plug-in estimator under the constraint that the first-order error of the estimator with respect to the nuisance parameter estimate is zero. Our constrained… ▽ More Causal estimation (e.g. of the average treatment effect) requires estimating complex nuisance parameters (e.g. outcome models). To adjust for errors in nuisance parameter estimation, we present a novel correction method that solves for the best plug-in estimator under the constraint that the first-order error of the estimator with respect to the nuisance parameter estimate is zero. Our constrained learning framework provides a unifying perspective to prominent first-order correction approaches including one-step estimation (a.k.a. augmented inverse probability weighting) and targeting (a.k.a. targeted maximum likelihood estimation). Our semiparametric inference approach, which we call the "C-Learner", can be implemented with modern machine learning methods such as neural networks and tree ensembles, and enjoys standard guarantees like semiparametric efficiency and double robustness. Empirically, we demonstrate our approach on several datasets, including those with text features that require fine-tuning language models. We observe the C-Learner matches or outperforms other asymptotically optimal estimators, with better performance in settings with less estimated overlap. △ Less

Submitted 22 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

arXiv:2401.12331 [pdf, other]

Transfer Learning for Functional Mean Estimation: Phase Transition and Adaptive Algorithms

Authors: T. Tony Cai, Dongwoo Kim, Hongming Pu

Abstract: This paper studies transfer learning for estimating the mean of random functions based on discretely sampled data, where, in addition to observations from the target distribution, auxiliary samples from similar but distinct source distributions are available. The paper considers both common and independent designs and establishes the minimax rates of convergence for both designs. The results revea… ▽ More This paper studies transfer learning for estimating the mean of random functions based on discretely sampled data, where, in addition to observations from the target distribution, auxiliary samples from similar but distinct source distributions are available. The paper considers both common and independent designs and establishes the minimax rates of convergence for both designs. The results reveal an interesting phase transition phenomenon under the two designs and demonstrate the benefits of utilizing the source samples in the low sampling frequency regime. For practical applications, this paper proposes novel data-driven adaptive algorithms that attain the optimal rates of convergence within a logarithmic factor simultaneously over a large collection of parameter spaces. The theoretical findings are complemented by a simulation study that further supports the effectiveness of the proposed algorithms △ Less

Submitted 27 March, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

MSC Class: Primary 62J05; secondary 62G20

arXiv:2401.12272 [pdf, other]

Transfer Learning for Nonparametric Regression: Non-asymptotic Minimax Analysis and Adaptive Procedure

Authors: T. Tony Cai, Hongming Pu

Abstract: Transfer learning for nonparametric regression is considered. We first study the non-asymptotic minimax risk for this problem and develop a novel estimator called the confidence thresholding estimator, which is shown to achieve the minimax optimal risk up to a logarithmic factor. Our results demonstrate two unique phenomena in transfer learning: auto-smoothing and super-acceleration, which differe… ▽ More Transfer learning for nonparametric regression is considered. We first study the non-asymptotic minimax risk for this problem and develop a novel estimator called the confidence thresholding estimator, which is shown to achieve the minimax optimal risk up to a logarithmic factor. Our results demonstrate two unique phenomena in transfer learning: auto-smoothing and super-acceleration, which differentiate it from nonparametric regression in a traditional setting. We then propose a data-driven algorithm that adaptively achieves the minimax risk up to a logarithmic factor across a wide range of parameter spaces. Simulation studies are conducted to evaluate the numerical performance of the adaptive transfer learning algorithm, and a real-world example is provided to demonstrate the benefits of the proposed method. △ Less

Submitted 22 January, 2024; originally announced January 2024.

arXiv:2401.03820 [pdf, other]

Optimal Differentially Private PCA and Estimation for Spiked Covariance Matrices

Authors: T. Tony Cai, Dong Xia, Mengyue Zha

Abstract: Estimating a covariance matrix and its associated principal components is a fundamental problem in contemporary statistics. While optimal estimation procedures have been developed with well-understood properties, the increasing demand for privacy preservation introduces new complexities to this classical problem. In this paper, we study optimal differentially private Principal Component Analysis (… ▽ More Estimating a covariance matrix and its associated principal components is a fundamental problem in contemporary statistics. While optimal estimation procedures have been developed with well-understood properties, the increasing demand for privacy preservation introduces new complexities to this classical problem. In this paper, we study optimal differentially private Principal Component Analysis (PCA) and covariance estimation within the spiked covariance model. We precisely characterize the sensitivity of eigenvalues and eigenvectors under this model and establish the minimax rates of convergence for estimating both the principal components and covariance matrix. These rates hold up to logarithmic factors and encompass general Schatten norms, including spectral norm, Frobenius norm, and nuclear norm as special cases. We introduce computationally efficient differentially private estimators and prove their minimax optimality, up to logarithmic factors. Additionally, matching minimax lower bounds are established. Notably, in comparison with existing literature, our results accommodate a diverging rank, necessitate no eigengap condition between distinct principal components, and remain valid even if the sample size is much smaller than the dimension. △ Less

Submitted 8 January, 2024; originally announced January 2024.

arXiv:2305.00164 [pdf, other]

doi 10.1214/24-AOS2355

Estimation and inference for minimizer and minimum of convex functions: optimality, adaptivity and uncertainty principles

Authors: T. Tony Cai, Ran Chen, Yuancheng Zhu

Abstract: Optimal estimation and inference for both the minimizer and minimum of a convex regression function under the white noise and nonparametric regression models are studied in a nonasymptotic local minimax framework, where the performance of a procedure is evaluated at individual functions. Fully adaptive and computationally efficient algorithms are proposed and sharp minimax lower bounds are given f… ▽ More Optimal estimation and inference for both the minimizer and minimum of a convex regression function under the white noise and nonparametric regression models are studied in a nonasymptotic local minimax framework, where the performance of a procedure is evaluated at individual functions. Fully adaptive and computationally efficient algorithms are proposed and sharp minimax lower bounds are given for both the estimation accuracy and expected length of confidence intervals for the minimizer and minimum. The nonasymptotic local minimax framework brings out new phenomena in simultaneous estimation and inference for the minimizer and minimum. We establish a novel uncertainty principle that provides a fundamental limit on how well the minimizer and minimum can be estimated simultaneously for any convex regression function. A similar result holds for the expected length of the confidence intervals for the minimizer and minimum. △ Less

Submitted 9 March, 2024; v1 submitted 29 April, 2023; originally announced May 2023.

Journal ref: Ann. Statist. 52(1): 392-411 (February 2024)

arXiv:2303.07152 [pdf, ps, other]

Score Attack: A Lower Bound Technique for Optimal Differentially Private Learning

Authors: T. Tony Cai, Yichen Wang, Linjun Zhang

Abstract: Achieving optimal statistical performance while ensuring the privacy of personal data is a challenging yet crucial objective in modern data analysis. However, characterizing the optimality, particularly the minimax lower bound, under privacy constraints is technically difficult. To address this issue, we propose a novel approach called the score attack, which provides a lower bound on the differ… ▽ More Achieving optimal statistical performance while ensuring the privacy of personal data is a challenging yet crucial objective in modern data analysis. However, characterizing the optimality, particularly the minimax lower bound, under privacy constraints is technically difficult. To address this issue, we propose a novel approach called the score attack, which provides a lower bound on the differential-privacy-constrained minimax risk of parameter estimation. The score attack method is based on the tracing attack concept in differential privacy and can be applied to any statistical model with a well-defined score statistic. It can optimally lower bound the minimax risk of estimating unknown model parameters, up to a logarithmic factor, while ensuring differential privacy for a range of statistical problems. We demonstrate the effectiveness and optimality of this general method in various examples, such as the generalized linear model in both classical and high-dimensional sparse settings, the Bradley-Terry-Luce model for pairwise comparisons, and nonparametric regression over the Sobolev class. △ Less

Submitted 13 March, 2023; originally announced March 2023.

Comments: arXiv admin note: substantial text overlap with arXiv:2011.03900

MSC Class: 62F30; 62J12; 62G05

arXiv:2303.02011 [pdf, other]

Diagnosing Model Performance Under Distribution Shift

Authors: Tiffany Tianhui Cai, Hongseok Namkoong, Steve Yadlowsky

Abstract: Prediction models can perform poorly when deployed to target distributions different from the training distribution. To understand these operational failure modes, we develop a method, called DIstribution Shift DEcomposition (DISDE), to attribute a drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into terms for 1) an increase in harder but… ▽ More Prediction models can perform poorly when deployed to target distributions different from the training distribution. To understand these operational failure modes, we develop a method, called DIstribution Shift DEcomposition (DISDE), to attribute a drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into terms for 1) an increase in harder but frequently seen examples from training, 2) changes in the relationship between features and outcomes, and 3) poor performance on examples infrequent or unseen during training. These terms are defined by fixing a distribution on $X$ while varying the conditional distribution of $Y \mid X$ between training and target, or by fixing the conditional distribution of $Y \mid X$ while varying the distribution on $X$. In order to do this, we define a hypothetical distribution on $X$ consisting of values common in both training and target, over which it is easy to compare $Y \mid X$ and thus predictive performance. We estimate performance on this hypothetical distribution via reweighting methods. Empirically, we show how our method can 1) inform potential modeling improvements across distribution shifts for employment prediction on tabular census data, and 2) help to explain why certain domain adaptation methods fail to improve model performance for satellite image classification. △ Less

Submitted 10 July, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

arXiv:2301.10392 [pdf, other]

Statistical Inference and Large-scale Multiple Testing for High-dimensional Regression Models

Authors: T. Tony Cai, Zijian Guo, Yin Xia

Abstract: This paper presents a selective survey of recent developments in statistical inference and multiple testing for high-dimensional regression models, including linear and logistic regression. We examine the construction of confidence intervals and hypothesis tests for various low-dimensional objectives such as regression coefficients and linear and quadratic functionals. The key technique is to gene… ▽ More This paper presents a selective survey of recent developments in statistical inference and multiple testing for high-dimensional regression models, including linear and logistic regression. We examine the construction of confidence intervals and hypothesis tests for various low-dimensional objectives such as regression coefficients and linear and quadratic functionals. The key technique is to generate debiased and desparsified estimators for the targeted low-dimensional objectives and estimate their uncertainty. In addition to covering the motivations for and intuitions behind these statistical methods, we also discuss their optimality and adaptivity in the context of high-dimensional inference. In addition, we review the recent development of statistical inference based on multiple regression models and the advancement of large-scale multiple testing for high-dimensional regression. The R package SIHR has implemented some of the high-dimensional inference methods discussed in this paper. △ Less

Submitted 24 January, 2023; originally announced January 2023.

arXiv:2301.01381 [pdf, other]

Testing High-dimensional Multinomials with Applications to Text Analysis

Authors: T. Tony Cai, Zheng Tracy Ke, Paxton Turner

Abstract: Motivated by applications in text mining and discrete distribution inference, we investigate the testing for equality of probability mass functions of $K$ groups of high-dimensional multinomial distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null, is proposed. The optimal detection boundary is established, and the proposed test is shown… ▽ More Motivated by applications in text mining and discrete distribution inference, we investigate the testing for equality of probability mass functions of $K$ groups of high-dimensional multinomial distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null, is proposed. The optimal detection boundary is established, and the proposed test is shown to achieve this optimal detection boundary across the entire parameter space of interest. The proposed method is demonstrated in simulation studies and applied to analyze two real-world datasets to examine variation among consumer reviews of Amazon movies and diversity of statistical paper abstracts. △ Less

Submitted 24 November, 2023; v1 submitted 3 January, 2023; originally announced January 2023.

arXiv:2211.12612 [pdf, ps, other]

Transfer Learning for Contextual Multi-armed Bandits

Authors: Changxiao Cai, T. Tony Cai, Hongzhe Li

Abstract: Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected on source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the… ▽ More Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected on source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the minimax regret is proposed. The results quantify the contribution of the data from the source domains for learning in the target domain in the context of nonparametric contextual multi-armed bandits. In view of the general impossibility of adaptation to unknown smoothness, we develop a data-driven algorithm that achieves near-optimal statistical guarantees (up to a logarithmic factor) while automatically adapting to the unknown parameters over a large collection of parameter spaces under an additional self-similarity assumption. A simulation study is carried out to illustrate the benefits of utilizing the data from the auxiliary source domains for learning in the target domain. △ Less

Submitted 24 January, 2024; v1 submitted 22 November, 2022; originally announced November 2022.

Comments: Accepted to the Annals of Statistics

arXiv:2203.11461 [pdf, other]

Locally Adaptive Algorithms for Multiple Testing with Network Structure, with Application to Genome-Wide Association Studies

Authors: Ziyi Liang, T. Tony Cai, Wenguang Sun, Yin Xia

Abstract: Linkage analysis has provided valuable insights to the GWAS studies, particularly in revealing that SNPs in linkage disequilibrium (LD) can jointly influence disease phenotypes. However, the potential of LD network data has often been overlooked or underutilized in the literature. In this paper, we propose a locally adaptive structure learning algorithm (LASLA) that provides a principled and gener… ▽ More Linkage analysis has provided valuable insights to the GWAS studies, particularly in revealing that SNPs in linkage disequilibrium (LD) can jointly influence disease phenotypes. However, the potential of LD network data has often been overlooked or underutilized in the literature. In this paper, we propose a locally adaptive structure learning algorithm (LASLA) that provides a principled and generic framework for incorporating network data or multiple samples of auxiliary data from related source domains; possibly in different dimensions/structures and from diverse populations. LASLA employs a $p$-value weighting approach, utilizing structural insights to assign data-driven weights to individual test points. Theoretical analysis shows that LASLA can asymptotically control FDR with independent or weakly dependent primary statistics, and achieve higher power when the network data is informative. Efficiency again of LASLA is illustrated through various synthetic experiments and an application to T2D-associated SNP identification. △ Less

Submitted 16 August, 2023; v1 submitted 22 March, 2022; originally announced March 2022.

Comments: 33 pages, 7 figures

arXiv:2202.10007 [pdf, other]

Statistical Inference for Genetic Relatedness Based on High-Dimensional Logistic Regression

Authors: Rong Ma, Zijian Guo, T. Tony Cai, Hongzhe Li

Abstract: This paper studies the problem of statistical inference for genetic relatedness between binary traits based on individual-level genome-wide association data. Specifically, under the high-dimensional logistic regression models, we define parameters characterizing the cross-trait genetic correlation, the genetic covariance and the trait-specific genetic variance. A novel weighted debiasing method is… ▽ More This paper studies the problem of statistical inference for genetic relatedness between binary traits based on individual-level genome-wide association data. Specifically, under the high-dimensional logistic regression models, we define parameters characterizing the cross-trait genetic correlation, the genetic covariance and the trait-specific genetic variance. A novel weighted debiasing method is developed for the logistic Lasso estimator and computationally efficient debiased estimators are proposed. The rates of convergence for these estimators are studied and their asymptotic normality is established under mild conditions. Moreover, we construct confidence intervals and statistical tests for these parameters, and provide theoretical justifications for the methods, including the coverage probability and expected length of the confidence intervals, as well as the size and power of the proposed tests. Numerical studies are conducted under both model generated data and simulated genetic data to show the superiority of the proposed methods. By analyzing a real data set on autoimmune diseases, we demonstrate its ability to obtain novel insights about the shared genetic architecture between ten pediatric autoimmune diseases. △ Less

Submitted 5 October, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

arXiv:2201.06438 [pdf, other]

Matrix Reordering for Noisy Disordered Matrices: Optimality and Computationally Efficient Algorithms

Authors: T. Tony Cai, Rong Ma

Abstract: Motivated by applications in single-cell biology and metagenomics, we investigate the problem of matrix reordering based on a noisy disordered monotone Toeplitz matrix model. We establish the fundamental statistical limit for this problem in a decision-theoretic framework and demonstrate that a constrained least squares estimator achieves the optimal rate. However, due to its computational complex… ▽ More Motivated by applications in single-cell biology and metagenomics, we investigate the problem of matrix reordering based on a noisy disordered monotone Toeplitz matrix model. We establish the fundamental statistical limit for this problem in a decision-theoretic framework and demonstrate that a constrained least squares estimator achieves the optimal rate. However, due to its computational complexity, we analyze a popular polynomial-time algorithm, spectral seriation, and show that it is suboptimal. To address this, we propose a novel polynomial-time adaptive sorting algorithm with guaranteed performance improvement. Simulations and analyses of two real single-cell RNA sequencing datasets demonstrate the superiority of our algorithm over existing methods. △ Less

Submitted 13 August, 2023; v1 submitted 17 January, 2022; originally announced January 2022.

Comments: accepted by IEEE Transactions on Information Theory

arXiv:2201.03727 [pdf, ps, other]

Estimation and Inference with Proxy Data and its Genetic Applications

Authors: Sai Li, T. Tony Cai, Hongzhe Li

Abstract: Existing high-dimensional statistical methods are largely established for analyzing individual-level data. In this work, we study estimation and inference for high-dimensional linear models where we only observe "proxy data", which include the marginal statistics and sample covariance matrix that are computed based on different sets of individuals. We develop a rate optimal method for estimation a… ▽ More Existing high-dimensional statistical methods are largely established for analyzing individual-level data. In this work, we study estimation and inference for high-dimensional linear models where we only observe "proxy data", which include the marginal statistics and sample covariance matrix that are computed based on different sets of individuals. We develop a rate optimal method for estimation and inference for the regression coefficient vector and its linear functionals based on the proxy data. Moreover, we show the intrinsic limitations in the proxy-data based inference: the minimax optimal rate for estimation is slower than that in the conventional case where individual data are observed; the power for testing and multiple testing does not go to one as the signal strength goes to infinity. These interesting findings are illustrated through simulation studies and an analysis of a dataset concerning the genetic associations of hindlimb muscle weight in a mouse population. △ Less

Submitted 10 January, 2022; originally announced January 2022.

arXiv:2109.03365 [pdf, other]

SIHR: Statistical Inference in High-Dimensional Linear and Logistic Regression Models

Authors: Prabrisha Rakshit, Zhenyu Wang, T. Tony Cai, Zijian Guo

Abstract: We introduce the R package \CRANpkg{SIHR} for statistical inference in high-dimensional generalized linear models with continuous and binary outcomes. The package provides functionalities for constructing confidence intervals and performing hypothesis tests for low-dimensional objectives in both one-sample and two-sample regression settings. We illustrate the usage of \CRANpkg{SIHR} through numeri… ▽ More We introduce the R package \CRANpkg{SIHR} for statistical inference in high-dimensional generalized linear models with continuous and binary outcomes. The package provides functionalities for constructing confidence intervals and performing hypothesis tests for low-dimensional objectives in both one-sample and two-sample regression settings. We illustrate the usage of \CRANpkg{SIHR} through numerical examples and present real data applications to demonstrate the package's performance and practicality. △ Less

Submitted 1 May, 2023; v1 submitted 7 September, 2021; originally announced September 2021.

arXiv:2107.00179 [pdf]

Distributed Nonparametric Function Estimation: Optimal Rate of Convergence and Cost of Adaptation

Authors: T. Tony Cai, Hongji Wei

Abstract: Distributed minimax estimation and distributed adaptive estimation under communication constraints for Gaussian sequence model and white noise model are studied. The minimax rate of convergence for distributed estimation over a given Besov class, which serves as a benchmark for the cost of adaptation, is established. We then quantify the exact communication cost for adaptation and construct an opt… ▽ More Distributed minimax estimation and distributed adaptive estimation under communication constraints for Gaussian sequence model and white noise model are studied. The minimax rate of convergence for distributed estimation over a given Besov class, which serves as a benchmark for the cost of adaptation, is established. We then quantify the exact communication cost for adaptation and construct an optimally adaptive procedure for distributed estimation over a range of Besov classes. The results demonstrate significant differences between nonparametric function estimation in the distributed setting and the conventional centralized setting. For global estimation, adaptation in general cannot be achieved for free in the distributed setting. The new technical tools to obtain the exact characterization for the cost of adaptation can be of independent interest. △ Less

Submitted 30 June, 2021; originally announced July 2021.

MSC Class: 62F30

arXiv:2105.07536 [pdf, other]

Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data

Authors: T. Tony Cai, Rong Ma

Abstract: This paper investigates the theoretical foundations of the t-distributed stochastic neighbor embedding (t-SNE) algorithm, a popular nonlinear dimension reduction and data visualization method. A novel theoretical framework for the analysis of t-SNE based on the gradient descent approach is presented. For the early exaggeration stage of t-SNE, we show its asymptotic equivalence to power iterations… ▽ More This paper investigates the theoretical foundations of the t-distributed stochastic neighbor embedding (t-SNE) algorithm, a popular nonlinear dimension reduction and data visualization method. A novel theoretical framework for the analysis of t-SNE based on the gradient descent approach is presented. For the early exaggeration stage of t-SNE, we show its asymptotic equivalence to power iterations based on the underlying graph Laplacian, characterize its limiting behavior, and uncover its deep connection to Laplacian spectral clustering, and fundamental principles including early stop** as implicit regularization. The results explain the intrinsic mechanism and the empirical benefits of such a computational strategy. For the embedding stage of t-SNE, we characterize the kinematics of the low-dimensional map throughout the iterations, and identify an amplification phase, featuring the intercluster repulsion and the expansive behavior of the low-dimensional map, and a stabilization phase. The general theory explains the fast convergence rate and the exceptional empirical performance of t-SNE for visualizing clustered data, brings forth interpretations of the t-SNE visualizations, and provides theoretical guidance for applying t-SNE and selecting its tuning parameters in various applications. △ Less

Submitted 31 October, 2022; v1 submitted 16 May, 2021; originally announced May 2021.

Comments: Accepted by Journal of Machine Learning Research

arXiv:2011.03900 [pdf, other]

The Cost of Privacy in Generalized Linear Models: Algorithms and Minimax Lower Bounds

Authors: T. Tony Cai, Yichen Wang, Linjun Zhang

Abstract: We propose differentially private algorithms for parameter estimation in both low-dimensional and high-dimensional sparse generalized linear models (GLMs) by constructing private versions of projected gradient descent. We show that the proposed algorithms are nearly rate-optimal by characterizing their statistical performance and establishing privacy-constrained minimax lower bounds for GLMs. The… ▽ More We propose differentially private algorithms for parameter estimation in both low-dimensional and high-dimensional sparse generalized linear models (GLMs) by constructing private versions of projected gradient descent. We show that the proposed algorithms are nearly rate-optimal by characterizing their statistical performance and establishing privacy-constrained minimax lower bounds for GLMs. The lower bounds are obtained via a novel technique, which is based on Stein's Lemma and generalizes the tracing attack technique for privacy-constrained lower bounds. This lower bound argument can be of independent interest as it is applicable to general parametric models. Simulated and real data experiments are conducted to demonstrate the numerical performance of our algorithms. △ Less

Submitted 5 December, 2020; v1 submitted 7 November, 2020; originally announced November 2020.

Comments: 56 pages, 6 figures

arXiv:2011.03598 [pdf, other]

Estimation, Confidence Intervals, and Large-Scale Hypotheses Testing for High-Dimensional Mixed Linear Regression

Authors: Linjun Zhang, Rong Ma, T. Tony Cai, Hongzhe Li

Abstract: This paper studies the high-dimensional mixed linear regression (MLR) where the output variable comes from one of the two linear regression models with an unknown mixing proportion and an unknown covariance structure of the random covariates. Building upon a high-dimensional EM algorithm, we propose an iterative procedure for estimating the two regression vectors and establish their rates of conve… ▽ More This paper studies the high-dimensional mixed linear regression (MLR) where the output variable comes from one of the two linear regression models with an unknown mixing proportion and an unknown covariance structure of the random covariates. Building upon a high-dimensional EM algorithm, we propose an iterative procedure for estimating the two regression vectors and establish their rates of convergence. Based on the iterative estimators, we further construct debiased estimators and establish their asymptotic normality. For individual coordinates, confidence intervals centered at the debiased estimators are constructed. Furthermore, a large-scale multiple testing procedure is proposed for testing the regression coefficients and is shown to control the false discovery rate (FDR) asymptotically. Simulation studies are carried out to examine the numerical performance of the proposed methods and their superiority over existing methods. The proposed methods are further illustrated through an analysis of a dataset of multiplex image cytometry, which investigates the interaction networks among the cellular phenotypes that include the expression levels of 20 epitopes or combinations of markers. △ Less

Submitted 6 November, 2020; originally announced November 2020.

arXiv:2010.11037 [pdf, ps, other]

Transfer Learning in Large-scale Gaussian Graphical Models with False Discovery Rate Control

Authors: Sai Li, T. Tony Cai, Hongzhe Li

Abstract: Transfer learning for high-dimensional Gaussian graphical models (GGMs) is studied with the goal of estimating the target GGM by utilizing the data from similar and related auxiliary studies. The similarity between the target graph and each auxiliary graph is characterized by the sparsity of a divergence matrix. An estimation algorithm, Trans-CLIME, is proposed and shown to attain a faster converg… ▽ More Transfer learning for high-dimensional Gaussian graphical models (GGMs) is studied with the goal of estimating the target GGM by utilizing the data from similar and related auxiliary studies. The similarity between the target graph and each auxiliary graph is characterized by the sparsity of a divergence matrix. An estimation algorithm, Trans-CLIME, is proposed and shown to attain a faster convergence rate than the minimax rate in the single study setting. Furthermore, a debiased Trans-CLIME estimator is introduced and shown to be element-wise asymptotically normal. It is used to construct a multiple testing procedure for edge detection with false discovery rate control. The proposed estimation and multiple testing procedures demonstrate superior numerical performance in simulations and are applied to infer the gene networks in a target brain tissue by leveraging the gene expressions from multiple other brain tissues. A significant decrease in prediction errors and a significant increase in power for link detection are observed. △ Less

Submitted 21 October, 2020; originally announced October 2020.

arXiv:2010.06682 [pdf, other]

Are all negatives created equal in contrastive instance discrimination?

Authors: Tiffany Tianhui Cai, Jonathan Frankle, David J. Schwab, Ari S. Morcos

Abstract: Self-supervised learning has recently begun to rival supervised learning on computer vision tasks. Many of the recent approaches have been based on contrastive instance discrimination (CID), in which the network is trained to recognize two augmented versions of the same instance (a query and positive) while discriminating against a pool of other instances (negatives). The learned representation is… ▽ More Self-supervised learning has recently begun to rival supervised learning on computer vision tasks. Many of the recent approaches have been based on contrastive instance discrimination (CID), in which the network is trained to recognize two augmented versions of the same instance (a query and positive) while discriminating against a pool of other instances (negatives). The learned representation is then used on downstream tasks such as image classification. Using methodology from MoCo v2 (Chen et al., 2020), we divided negatives by their difficulty for a given query and studied which difficulty ranges were most important for learning useful representations. We found a minority of negatives -- the hardest 5% -- were both necessary and sufficient for the downstream task to reach nearly full accuracy. Conversely, the easiest 95% of negatives were unnecessary and insufficient. Moreover, the very hardest 0.1% of negatives were unnecessary and sometimes detrimental. Finally, we studied the properties of negatives that affect their hardness, and found that hard negatives were more semantically similar to the query, and that some negatives were more consistently easy or hard than we would expect by chance. Together, our results indicate that negatives vary in importance and that CID may benefit from more intelligent negative treatment. △ Less

Submitted 25 October, 2020; v1 submitted 13 October, 2020; originally announced October 2020.

Comments: Fixed author name error

arXiv:2008.12434 [pdf, ps, other]

On the Non-Asymptotic Concentration of Heteroskedastic Wishart-type Matrix

Authors: T. Tony Cai, Rungang Han, Anru R. Zhang

Abstract: This paper focuses on the non-asymptotic concentration of the heteroskedastic Wishart-type matrices. Suppose $Z$ is a $p_1$-by-$p_2$ random matrix and $Z_{ij} \sim N(0,σ_{ij}^2)$ independently, we prove the expected spectral norm of Wishart matrix deviations (i.e., $\mathbb{E} \left\|ZZ^\top - \mathbb{E} ZZ^\top\right\|$) is upper bounded by \begin{equation*} \begin{split} (1+ε)\left\{2σ_Cσ_R… ▽ More This paper focuses on the non-asymptotic concentration of the heteroskedastic Wishart-type matrices. Suppose $Z$ is a $p_1$-by-$p_2$ random matrix and $Z_{ij} \sim N(0,σ_{ij}^2)$ independently, we prove the expected spectral norm of Wishart matrix deviations (i.e., $\mathbb{E} \left\|ZZ^\top - \mathbb{E} ZZ^\top\right\|$) is upper bounded by \begin{equation*} \begin{split} (1+ε)\left\{2σ_Cσ_R + σ_C^2 + Cσ_Rσ_*\sqrt{\log(p_1 \wedge p_2)} + Cσ_*^2\log(p_1 \wedge p_2)\right\}, \end{split} \end{equation*} where $σ_C^2 := \max_j \sum_{i=1}^{p_1}σ_{ij}^2$, $σ_R^2 := \max_i \sum_{j=1}^{p_2}σ_{ij}^2$ and $σ_*^2 := \max_{i,j}σ_{ij}^2$. A minimax lower bound is developed that matches this upper bound. Then, we derive the concentration inequalities, moments, and tail bounds for the heteroskedastic Wishart-type matrix under more general distributions, such as sub-Gaussian and heavy-tailed distributions. Next, we consider the cases where $Z$ has homoskedastic columns or rows (i.e., $σ_{ij} \approx σ_i$ or $σ_{ij} \approx σ_j$) and derive the rate-optimal Wishart-type concentration bounds. Finally, we apply the developed tools to identify the sharp signal-to-noise ratio threshold for consistent clustering in the heteroskedastic clustering problem. △ Less

Submitted 16 February, 2022; v1 submitted 27 August, 2020; originally announced August 2020.

Comments: Electronic Journal of Probability, to appear

arXiv:2006.10593 [pdf, ps, other]

Transfer Learning for High-dimensional Linear Regression: Prediction, Estimation, and Minimax Optimality

Authors: Sai Li, T. Tony Cai, Hongzhe Li

Abstract: This paper considers the estimation and prediction of a high-dimensional linear regression in the setting of transfer learning, using samples from the target model as well as auxiliary samples from different but possibly related regression models. When the set of "informative" auxiliary samples is known, an estimator and a predictor are proposed and their optimality is established. The optimal rat… ▽ More This paper considers the estimation and prediction of a high-dimensional linear regression in the setting of transfer learning, using samples from the target model as well as auxiliary samples from different but possibly related regression models. When the set of "informative" auxiliary samples is known, an estimator and a predictor are proposed and their optimality is established. The optimal rates of convergence for prediction and estimation are faster than the corresponding rates without using the auxiliary samples. This implies that knowledge from the informative auxiliary samples can be transferred to improve the learning performance of the target problem. In the case that the set of informative auxiliary samples is unknown, we propose a data-driven procedure for transfer learning, called Trans-Lasso, and reveal its robustness to non-informative auxiliary samples and its efficiency in knowledge transfer. The proposed procedures are demonstrated in numerical studies and are applied to a dataset concerning the associations among gene expressions. It is shown that Trans-Lasso leads to improved performance in gene expression prediction in a target tissue by incorporating the data from multiple different tissues as auxiliary samples. △ Less

Submitted 18 June, 2020; originally announced June 2020.

arXiv:2006.01393 [pdf, other]

Two Robust Tools for Inference about Causal Effects with Invalid Instruments

Authors: Hyunseung Kang, You** Lee, T. Tony Cai, Dylan S. Small

Abstract: Instrumental variables have been widely used to estimate the causal effect of a treatment on an outcome. Existing confidence intervals for causal effects based on instrumental variables assume that all of the putative instrumental variables are valid; a valid instrumental variable is a variable that affects the outcome only by affecting the treatment and is not related to unmeasured confounders. H… ▽ More Instrumental variables have been widely used to estimate the causal effect of a treatment on an outcome. Existing confidence intervals for causal effects based on instrumental variables assume that all of the putative instrumental variables are valid; a valid instrumental variable is a variable that affects the outcome only by affecting the treatment and is not related to unmeasured confounders. However, in practice, some of the putative instrumental variables are likely to be invalid. This paper presents two tools to conduct valid inference and tests in the presence of invalid instruments. First, we propose a simple and general approach to construct confidence intervals based on taking unions of well-known confidence intervals. Second, we propose a novel test for the null causal effect based on a collider bias. Our two proposals, especially when fused together, outperform traditional instrumental variable confidence intervals when invalid instruments are present, and can also be used as a sensitivity analysis when there is concern that instrumental variables assumptions are violated. The new approach is applied to a Mendelian randomization study on the causal effect of low-density lipoprotein on the incidence of cardiovascular diseases. △ Less

Submitted 2 June, 2020; originally announced June 2020.

arXiv:2002.07624 [pdf, other]

Optimal Structured Principal Subspace Estimation: Metric Entropy and Minimax Rates

Authors: T. Tony Cai, Hongzhe Li, Rong Ma

Abstract: Driven by a wide range of applications, many principal subspace estimation problems have been studied individually under different structural constraints. This paper presents a unified framework for the statistical analysis of a general structured principal subspace estimation problem which includes as special cases non-negative PCA/SVD, sparse PCA/SVD, subspace constrained PCA/SVD, and spectral c… ▽ More Driven by a wide range of applications, many principal subspace estimation problems have been studied individually under different structural constraints. This paper presents a unified framework for the statistical analysis of a general structured principal subspace estimation problem which includes as special cases non-negative PCA/SVD, sparse PCA/SVD, subspace constrained PCA/SVD, and spectral clustering. General minimax lower and upper bounds are established to characterize the interplay between the information-geometric complexity of the structural set for the principal subspaces, the signal-to-noise ratio (SNR), and the dimensionality. The results yield interesting phase transition phenomena concerning the rates of convergence as a function of the SNRs and the fundamental limit for consistent estimation. Applying the general results to the specific settings yields the minimax rates of convergence for those problems, including the previous unknown optimal rates for non-negative PCA/SVD, sparse SVD and subspace constrained PCA/SVD. △ Less

Submitted 16 November, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

arXiv:2001.08877 [pdf, other]

Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms

Authors: T. Tony Cai, Hongji Wei

Abstract: We study distributed estimation of a Gaussian mean under communication constraints in a decision theoretical framework. Minimax rates of convergence, which characterize the tradeoff between the communication costs and statistical accuracy, are established in both the univariate and multivariate settings. Communication-efficient and statistically optimal procedures are developed. In the univariate… ▽ More We study distributed estimation of a Gaussian mean under communication constraints in a decision theoretical framework. Minimax rates of convergence, which characterize the tradeoff between the communication costs and statistical accuracy, are established in both the univariate and multivariate settings. Communication-efficient and statistically optimal procedures are developed. In the univariate case, the optimal rate depends only on the total communication budget, so long as each local machine has at least one bit. However, in the multivariate case, the minimax rate depends on the specific allocations of the communication budgets among the local machines. Although optimal estimation of a Gaussian mean is relatively simple in the conventional setting, it is quite involved under the communication constraints, both in terms of the optimal procedure design and lower bound argument. The techniques developed in this paper can be of independent interest. An essential step is the decomposition of the minimax estimation problem into two stages, localization and refinement. This critical decomposition provides a framework for both the lower bound analysis and optimal procedure design. △ Less

Submitted 23 January, 2020; originally announced January 2020.

arXiv:1912.02872 [pdf, ps, other]

A Convex Optimization Approach to High-Dimensional Sparse Quadratic Discriminant Analysis

Authors: T. Tony Cai, Linjun Zhang

Abstract: In this paper, we study high-dimensional sparse Quadratic Discriminant Analysis (QDA) and aim to establish the optimal convergence rates for the classification error. Minimax lower bounds are established to demonstrate the necessity of structural assumptions such as sparsity conditions on the discriminating direction and differential graph for the possible construction of consistent high-dimension… ▽ More In this paper, we study high-dimensional sparse Quadratic Discriminant Analysis (QDA) and aim to establish the optimal convergence rates for the classification error. Minimax lower bounds are established to demonstrate the necessity of structural assumptions such as sparsity conditions on the discriminating direction and differential graph for the possible construction of consistent high-dimensional QDA rules. We then propose a classification algorithm called SDAR using constrained convex optimization under the sparsity assumptions. Both minimax upper and lower bounds are obtained and this classification rule is shown to be simultaneously rate optimal over a collection of parameter spaces, up to a logarithmic factor. Simulation studies demonstrate that SDAR performs well numerically. The algorithm is also illustrated through an analysis of prostate cancer data and colon tissue data. The methodology and theory developed for high-dimensional QDA for two groups in the Gaussian setting are also extended to multi-group classification and to classification under the Gaussian copula model. △ Less

Submitted 5 December, 2019; originally announced December 2019.

arXiv:1911.12516 [pdf, other]

doi 10.1093/biomet/asaa082

Optimal Estimation of Bacterial Growth Rates Based on Permuted Monotone Matrix

Authors: Rong Ma, T. Tony Cai, Hongzhe Li

Abstract: Motivated by the problem of estimating the bacterial growth rates for genome assemblies from shotgun metagenomic data, we consider the permuted monotone matrix model $Y=ΘΠ+Z$, where $Y\in \mathbb{R}^{n\times p}$ is observed, $Θ\in \mathbb{R}^{n\times p}$ is an unknown approximately rank-one signal matrix with monotone rows, $Π\in \mathbb{R}^{p\times p}$ is an unknown permutation matrix, and… ▽ More Motivated by the problem of estimating the bacterial growth rates for genome assemblies from shotgun metagenomic data, we consider the permuted monotone matrix model $Y=ΘΠ+Z$, where $Y\in \mathbb{R}^{n\times p}$ is observed, $Θ\in \mathbb{R}^{n\times p}$ is an unknown approximately rank-one signal matrix with monotone rows, $Π\in \mathbb{R}^{p\times p}$ is an unknown permutation matrix, and $Z\in \mathbb{R}^{n\times p}$ is the noise matrix. This paper studies the estimation of the extreme values associated to the signal matrix $Θ$, including its first and last columns, as well as their difference. Treating these estimation problems as compound decision problems, minimax rate-optimal estimators are constructed using the spectral column sorting method. Numerical experiments through simulated and synthetic microbiome metagenomic data are presented, showing the superiority of the proposed methods over the alternatives. The methods are illustrated by comparing the growth rates of gut bacteria between inflammatory bowel disease patients and normal controls. △ Less

Submitted 26 August, 2020; v1 submitted 27 November, 2019; originally announced November 2019.

Journal ref: Biometrika (2020)

arXiv:1911.11345 [pdf, other]

High Dimensional M-Estimation with Missing Outcomes: A Semi-Parametric Framework

Authors: Abhishek Chakrabortty, Jiarui Lu, T. Tony Cai, Hongzhe Li

Abstract: We consider high dimensional $M$-estimation in settings where the response $Y$ is possibly missing at random and the covariates $\mathbf{X} \in \mathbb{R}^p$ can be high dimensional compared to the sample size $n$. The parameter of interest $\boldsymbolθ_0 \in \mathbb{R}^d$ is defined as the minimizer of the risk of a convex loss, under a fully non-parametric model, and $\boldsymbolθ_0$ itself is… ▽ More We consider high dimensional $M$-estimation in settings where the response $Y$ is possibly missing at random and the covariates $\mathbf{X} \in \mathbb{R}^p$ can be high dimensional compared to the sample size $n$. The parameter of interest $\boldsymbolθ_0 \in \mathbb{R}^d$ is defined as the minimizer of the risk of a convex loss, under a fully non-parametric model, and $\boldsymbolθ_0$ itself is high dimensional which is a key distinction from existing works. Standard high dimensional regression and series estimation with possibly misspecified models and missing $Y$ are included as special cases, as well as their counterparts in causal inference using 'potential outcomes'. Assuming $\boldsymbolθ_0$ is $s$-sparse ($s \ll n$), we propose an $L_1$-regularized debiased and doubly robust (DDR) estimator of $\boldsymbolθ_0$ based on a high dimensional adaptation of the traditional double robust (DR) estimator's construction. Under mild tail assumptions and arbitrarily chosen (working) models for the propensity score (PS) and the outcome regression (OR) estimators, satisfying only some high-level conditions, we establish finite sample performance bounds for the DDR estimator showing its (optimal) $L_2$ error rate to be $\sqrt{s (\log d)/ n}$ when both models are correct, and its consistency and DR properties when only one of them is correct. Further, when both the models are correct, we propose a desparsified version of our DDR estimator that satisfies an asymptotic linear expansion and facilitates inference on low dimensional components of $\boldsymbolθ_0$. Finally, we discuss various of choices of high dimensional parametric/semi-parametric working models for the PS and OR estimators. All results are validated via detailed simulations. △ Less

Submitted 26 November, 2019; originally announced November 2019.

Comments: 34 pages, 4 tables; (Supplement: 58 pages, 10 tables);

arXiv:1911.10604 [pdf, other]

doi 10.1080/01621459.2020.1713794

Optimal Permutation Recovery in Permuted Monotone Matrix Model

Authors: Rong Ma, T. Tony Cai, Hongzhe Li

Abstract: Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model $Y=ΘΠ+Z$, where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, $Θ$ is an unknown mean matrix… ▽ More Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model $Y=ΘΠ+Z$, where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, $Θ$ is an unknown mean matrix with monotone entries for each row, $Π$ is a permutation matrix that permutes the columns of $Θ$, and $Z$ is a noise matrix. This paper studies the problem of estimation/recovery of $Π$ given the observed noisy matrix $Y$. We propose an estimator based on the best linear projection, which is shown to be minimax rate-optimal for both exact recovery, as measured by the 0-1 loss, and partial recovery, as quantified by the normalized Kendall's tau distance. Simulation studies demonstrate the superior empirical performance of the proposed estimator over alternative methods. We demonstrate the methods using a synthetic metagenomics dataset of 45 closely related bacterial species and a real metagenomic dataset to compare the bacterial growth dynamics between the responders and the non-responders of the IBD patients after 8 weeks of treatment. △ Less

Submitted 13 July, 2020; v1 submitted 24 November, 2019; originally announced November 2019.

Journal ref: Journal of the American Statistical Association, 2020

arXiv:1909.09851 [pdf, other]

Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference

Authors: T. Tony Cai, Anru R. Zhang, Yuchen Zhou

Abstract: We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model -- an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are establishe… ▽ More We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model -- an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results. △ Less

Submitted 6 May, 2022; v1 submitted 21 September, 2019; originally announced September 2019.

Comments: IEEE Transactions on Information Theory, to appear

arXiv:1909.01503 [pdf, other]

Group Inference in High Dimensions with Applications to Hierarchical Testing

Authors: Zijian Guo, Claude Renaux, Peter Bühlmann, T. Tony Cai

Abstract: High-dimensional group inference is an essential part of statistical methods for analysing complex data sets, including hierarchical testing, tests of interaction, detection of heterogeneous treatment effects and inference for local heritability. Group inference in regression models can be measured with respect to a weighted quadratic functional of the regression sub-vector corresponding to the gr… ▽ More High-dimensional group inference is an essential part of statistical methods for analysing complex data sets, including hierarchical testing, tests of interaction, detection of heterogeneous treatment effects and inference for local heritability. Group inference in regression models can be measured with respect to a weighted quadratic functional of the regression sub-vector corresponding to the group. Asymptotically unbiased estimators of these weighted quadratic functionals are constructed and a novel procedure using these estimators for inference is proposed. We derive its asymptotic Gaussian distribution which enables the construction of asymptotically valid confidence intervals and tests which perform well in terms of length or power. The proposed test is computationally efficient even for a large group, statistically valid for any group size and achieving good power performance for testing large groups with many small regression coefficients. We apply the methodology to several interesting statistical problems and demonstrate its strength and usefulness on simulated and real data. △ Less

Submitted 30 November, 2020; v1 submitted 3 September, 2019; originally announced September 2019.

arXiv:1907.06116 [pdf, ps, other]

Inference for high-dimensional linear mixed-effects models: A quasi-likelihood approach

Authors: Sai Li, Tony T. Cai, Hongzhe Li

Abstract: Linear mixed-effects models are widely used in analyzing clustered or repeated measures data. We propose a quasi-likelihood approach for estimation and inference of the unknown parameters in linear mixed-effects models with high-dimensional fixed effects. The proposed method is applicable to general settings where the dimension of the random effects and the cluster sizes are possibly large. Regard… ▽ More Linear mixed-effects models are widely used in analyzing clustered or repeated measures data. We propose a quasi-likelihood approach for estimation and inference of the unknown parameters in linear mixed-effects models with high-dimensional fixed effects. The proposed method is applicable to general settings where the dimension of the random effects and the cluster sizes are possibly large. Regarding the fixed effects, we provide rate optimal estimators and valid inference procedures that do not rely on the structural information of the variance components. We also study the estimation of variance components with high-dimensional fixed effects in general settings. The algorithms are easy to implement and computationally fast. The proposed methods are assessed in various simulation settings and are applied to a real study regarding the associations between body mass index and genetic polymorphic markers in a heterogeneous stock mice population. △ Less

Submitted 9 March, 2021; v1 submitted 13 July, 2019; originally announced July 2019.

Comments: 32 pages, 3 figures

MSC Class: 62H15; 62J07

arXiv:1906.02903 [pdf, other]

Transfer Learning for Nonparametric Classification: Minimax Rate and Adaptive Classifier

Authors: T. Tony Cai, Hongji Wei

Abstract: Human learners have the natural ability to use knowledge gained in one setting for learning in a different but related setting. This ability to transfer knowledge from one task to another is essential for effective learning. In this paper, we study transfer learning in the context of nonparametric classification based on observations from different distributions under the posterior drift model, wh… ▽ More Human learners have the natural ability to use knowledge gained in one setting for learning in a different but related setting. This ability to transfer knowledge from one task to another is essential for effective learning. In this paper, we study transfer learning in the context of nonparametric classification based on observations from different distributions under the posterior drift model, which is a general framework and arises in many practical problems. We first establish the minimax rate of convergence and construct a rate-optimal two-sample weighted $K$-NN classifier. The results characterize precisely the contribution of the observations from the source distribution to the classification task under the target distribution. A data-driven adaptive classifier is then proposed and is shown to simultaneously attain within a logarithmic factor of the optimal rate over a large collection of parameter spaces. Simulation studies and real data applications are carried out where the numerical results further illustrate the theoretical analysis. Extensions to the case of multiple source distributions are also considered. △ Less

Submitted 7 June, 2019; originally announced June 2019.

arXiv:1905.08757 [pdf, other]

Asymptotic Analysis for Extreme Eigenvalues of Principal Minors of Random Matrices

Authors: T. Tony Cai, Tiefeng Jiang, Xiaoou Li

Abstract: Consider a standard white Wishart matrix with parameters $n$ and $p$. Motivated by applications in high-dimensional statistics and signal processing, we perform asymptotic analysis on the maxima and minima of the eigenvalues of all the $m \times m$ principal minors, under the asymptotic regime that $n,p,m$ go to infinity. Asymptotic results concerning extreme eigenvalues of principal minors of rea… ▽ More Consider a standard white Wishart matrix with parameters $n$ and $p$. Motivated by applications in high-dimensional statistics and signal processing, we perform asymptotic analysis on the maxima and minima of the eigenvalues of all the $m \times m$ principal minors, under the asymptotic regime that $n,p,m$ go to infinity. Asymptotic results concerning extreme eigenvalues of principal minors of real Wigner matrices are also obtained. In addition, we discuss an application of the theoretical results to the construction of compressed sensing matrices, which provides insights to compressed sensing in signal processing and high dimensional linear regression in statistics. △ Less

Submitted 21 May, 2019; originally announced May 2019.

arXiv:1902.04495 [pdf, other]

The Cost of Privacy: Optimal Rates of Convergence for Parameter Estimation with Differential Privacy

Authors: T. Tony Cai, Yichen Wang, Linjun Zhang

Abstract: Privacy-preserving data analysis is a rising challenge in contemporary statistics, as the privacy guarantees of statistical methods are often achieved at the expense of accuracy. In this paper, we investigate the tradeoff between statistical accuracy and privacy in mean estimation and linear regression, under both the classical low-dimensional and modern high-dimensional settings. A primary focus… ▽ More Privacy-preserving data analysis is a rising challenge in contemporary statistics, as the privacy guarantees of statistical methods are often achieved at the expense of accuracy. In this paper, we investigate the tradeoff between statistical accuracy and privacy in mean estimation and linear regression, under both the classical low-dimensional and modern high-dimensional settings. A primary focus is to establish minimax optimality for statistical estimation with the $(\varepsilon,δ)$-differential privacy constraint. To this end, we find that classical lower bound arguments fail to yield sharp results, and new technical tools are called for. By refining the "tracing adversary" technique for lower bounds in the theoretical computer science literature, we formulate a general lower bound argument for minimax risks with differential privacy constraints, and apply this argument to high-dimensional mean estimation and linear regression problems. We also design computationally efficient algorithms that attain the minimax lower bounds up to a logarithmic factor. In particular, for the high-dimensional linear regression, a novel private iterative hard thresholding pursuit algorithm is proposed, based on a privately truncated version of stochastic gradient descent. The numerical performance of these algorithms is demonstrated by simulation studies and applications to real data containing sensitive information, for which privacy-preserving statistical methods are necessary. △ Less

Submitted 10 November, 2020; v1 submitted 12 February, 2019; originally announced February 2019.

Comments: 33 pages, 4 figures

arXiv:1810.08316 [pdf, other]

Heteroskedastic PCA: Algorithm, Optimality, and Applications

Authors: Anru R. Zhang, T. Tony Cai, Yihong Wu

Abstract: A general framework for principal component analysis (PCA) in the presence of heteroskedastic noise is introduced. We propose an algorithm called HeteroPCA, which involves iteratively imputing the diagonal entries of the sample covariance matrix to remove estimation bias due to heteroskedasticity. This procedure is computationally efficient and provably optimal under the generalized spiked covaria… ▽ More A general framework for principal component analysis (PCA) in the presence of heteroskedastic noise is introduced. We propose an algorithm called HeteroPCA, which involves iteratively imputing the diagonal entries of the sample covariance matrix to remove estimation bias due to heteroskedasticity. This procedure is computationally efficient and provably optimal under the generalized spiked covariance model. A key technical step is a deterministic robust perturbation analysis on singular subspaces, which can be of independent interest. The effectiveness of the proposed algorithm is demonstrated in a suite of problems in high-dimensional statistics, including singular value decomposition (SVD) under heteroskedastic noise, Poisson PCA, and SVD for heteroskedastic and incomplete data. △ Less

Submitted 1 April, 2021; v1 submitted 18 October, 2018; originally announced October 2018.

arXiv:1806.06179 [pdf, other]

Semi-supervised Inference for Explained Variance in High-dimensional Linear Regression and Its Applications

Authors: T. Tony Cai, Zijian Guo

Abstract: This paper considers statistical inference for the explained variance $β^{\intercal}Σβ$ under the high-dimensional linear model $Y=Xβ+ε$ in the semi-supervised setting, where $β$ is the regression vector and $Σ$ is the design covariance matrix. A calibrated estimator, which efficiently integrates both labelled and unlabelled data, is proposed. It is shown that the estimator achieves the minimax op… ▽ More This paper considers statistical inference for the explained variance $β^{\intercal}Σβ$ under the high-dimensional linear model $Y=Xβ+ε$ in the semi-supervised setting, where $β$ is the regression vector and $Σ$ is the design covariance matrix. A calibrated estimator, which efficiently integrates both labelled and unlabelled data, is proposed. It is shown that the estimator achieves the minimax optimal rate of convergence in the general semi-supervised framework. The optimality result characterizes how the unlabelled data contributes to the estimation accuracy. Moreover, the limiting distribution for the proposed estimator is established and the unlabelled data has also proven useful in reducing the length of the confidence interval for the explained variance. The proposed method is extended to the semi-supervised inference for the unweighted quadratic functional, $\|β\|_2^2$. The obtained inference results are then applied to a range of high-dimensional statistical problems, including signal detection and global testing, prediction accuracy evaluation, and confidence ball construction. The numerical improvement of incorporating the unlabelled data is demonstrated through simulation studies and an analysis of estimating heritability for a yeast segregant data set with multiple traits. △ Less

Submitted 30 November, 2020; v1 submitted 16 June, 2018; originally announced June 2018.

arXiv:1805.06970 [pdf, other]

doi 10.1080/01621459.2019.1699421

Global and Simultaneous Hypothesis Testing for High-Dimensional Logistic Regression Models

Authors: Rong Ma, T. Tony Cai, Hongzhe Li

Abstract: High-dimensional logistic regression is widely used in analyzing data with binary outcomes. In this paper, global testing and large-scale multiple testing for the regression coefficients are considered in both single- and two-regression settings. A test statistic for testing the global null hypothesis is constructed using a generalized low-dimensional projection for bias correction and its asympto… ▽ More High-dimensional logistic regression is widely used in analyzing data with binary outcomes. In this paper, global testing and large-scale multiple testing for the regression coefficients are considered in both single- and two-regression settings. A test statistic for testing the global null hypothesis is constructed using a generalized low-dimensional projection for bias correction and its asymptotic null distribution is derived. A lower bound for the global testing is established, which shows that the proposed test is asymptotically minimax optimal over some sparsity range. For testing the individual coefficients simultaneously, multiple testing procedures are proposed and shown to control the false discovery rate (FDR) and falsely discovered variables (FDV) asymptotically. Simulation studies are carried out to examine the numerical performance of the proposed tests and their superiority over existing methods. The testing procedures are also illustrated by analyzing a data set of a metabolomics study that investigates the association between fecal metabolites and pediatric Crohn's disease and the effects of treatment on such associations. △ Less

Submitted 19 November, 2020; v1 submitted 17 May, 2018; originally announced May 2018.

Comments: Typos corrected

Journal ref: Journal of the American Statistical Association (2019)

arXiv:1804.03018 [pdf, other]

High-dimensional Linear Discriminant Analysis: Optimality, Adaptive Algorithm, and Missing Data

Authors: T. Tony Cai, Linjun Zhang

Abstract: This paper aims to develop an optimality theory for linear discriminant analysis in the high-dimensional setting. A data-driven and tuning free classification rule, which is based on an adaptive constrained $\ell_1$ minimization approach, is proposed and analyzed. Minimax lower bounds are obtained and this classification rule is shown to be simultaneously rate optimal over a collection of paramete… ▽ More This paper aims to develop an optimality theory for linear discriminant analysis in the high-dimensional setting. A data-driven and tuning free classification rule, which is based on an adaptive constrained $\ell_1$ minimization approach, is proposed and analyzed. Minimax lower bounds are obtained and this classification rule is shown to be simultaneously rate optimal over a collection of parameter spaces. In addition, we consider classification with incomplete data under the missing completely at random (MCR) model. An adaptive classifier with theoretical guarantees is introduced and optimal rate of convergence for high-dimensional linear discriminant analysis under the MCR model is established. The technical analysis for the case of missing data is much more challenging than that for the complete data. We establish a large deviation result for the generalized sample covariance matrix, which serves as a key technical tool and can be of independent interest. An application to lung cancer and leukemia studies is also discussed. △ Less

Submitted 9 April, 2018; originally announced April 2018.

arXiv:1801.08120 [pdf, other]

doi 10.5705/ss.202019.0445

Optimal Estimation of Simultaneous Signals Using Absolute Inner Product with Applications to Integrative Genomics

Authors: Rong Ma, T. Tony Cai, Hongzhe Li

Abstract: Integrating the summary statistics from genome-wide association study (\textsc{gwas}) and expression quantitative trait loci (e\textsc{qtl}) data provides a powerful way of identifying the genes whose expression levels are potentially associated with complex diseases. A parameter called $T$-score that quantifies the genetic overlap between a gene and the disease phenotype based on the summary stat… ▽ More Integrating the summary statistics from genome-wide association study (\textsc{gwas}) and expression quantitative trait loci (e\textsc{qtl}) data provides a powerful way of identifying the genes whose expression levels are potentially associated with complex diseases. A parameter called $T$-score that quantifies the genetic overlap between a gene and the disease phenotype based on the summary statistics is introduced based on the mean values of two Gaussian sequences. Specifically, given two independent samples $\mathbf{x}_n\sim N(θ, Σ_1)$ and $\mathbf{y}_n\sim N(μ, Σ_2)$, the $T$-score is defined as $\sum_{i=1}^n |θ_iμ_i|$, a non-smooth functional, which characterizes the amount of shared signals between two absolute normal mean vectors $|θ|$ and $|μ|$. Using approximation theory, estimators are constructed and shown to be minimax rate-optimal and adaptive over various parameter spaces. Simulation studies demonstrate the superiority of the proposed estimators over existing methods. The method is applied to an integrative analysis of heart failure genomics datasets and we identify several genes and biological pathways that are potentially causal to human heart failure. △ Less

Submitted 4 October, 2020; v1 submitted 24 January, 2018; originally announced January 2018.

Journal ref: Statistica Sinica (2020)

arXiv:1801.00518 [pdf, ps, other]

Statistical and Computational Limits for Sparse Matrix Detection

Authors: T. Tony Cai, Yihong Wu

Abstract: This paper investigates the fundamental limits for detecting a high-dimensional sparse matrix contaminated by white Gaussian noise from both the statistical and computational perspectives. We consider $p\times p$ matrices whose rows and columns are individually $k$-sparse. We provide a tight characterization of the statistical and computational limits for sparse matrix detection, which precisely d… ▽ More This paper investigates the fundamental limits for detecting a high-dimensional sparse matrix contaminated by white Gaussian noise from both the statistical and computational perspectives. We consider $p\times p$ matrices whose rows and columns are individually $k$-sparse. We provide a tight characterization of the statistical and computational limits for sparse matrix detection, which precisely describe when achieving optimal detection is easy, hard, or impossible, respectively. Although the sparse matrices considered in this paper have no apparent submatrix structure and the corresponding estimation problem has no computational issue at all, the detection problem has a surprising computational barrier when the sparsity level $k$ exceeds the cubic root of the matrix size $p$: attaining the optimal detection boundary is computationally at least as hard as solving the planted clique problem. The same statistical and computational limits also hold in the sparse covariance matrix model, where each variable is correlated with at most $k$ others. A key step in the construction of the statistically optimal test is a structural property for sparse matrices, which can be of independent interest. △ Less

Submitted 1 January, 2018; originally announced January 2018.

arXiv:1709.03907 [pdf, other]

Weighted Message Passing and Minimum Energy Flow for Heterogeneous Stochastic Block Models with Side Information

Authors: T. Tony Cai, Tengyuan Liang, Alexander Rakhlin

Abstract: We study the misclassification error for community detection in general heterogeneous stochastic block models (SBM) with noisy or partial label information. We establish a connection between the misclassification rate and the notion of minimum energy on the local neighborhood of the SBM. We develop an optimally weighted message passing algorithm to reconstruct labels for SBM based on the minimum e… ▽ More We study the misclassification error for community detection in general heterogeneous stochastic block models (SBM) with noisy or partial label information. We establish a connection between the misclassification rate and the notion of minimum energy on the local neighborhood of the SBM. We develop an optimally weighted message passing algorithm to reconstruct labels for SBM based on the minimum energy flow and the eigenvectors of a certain Markov transition matrix. The general SBM considered in this paper allows for unequal-size communities, degree heterogeneity, and different connection probabilities among blocks. We focus on how to optimally weigh the message passing to improve misclassification. △ Less

Submitted 12 September, 2017; originally announced September 2017.

Comments: 31 pages, 1 figures

Journal ref: Journal of Machine Learning Research 21 (2020) 1-34

arXiv:1609.06713 [pdf, other]

Testing Endogeneity with High Dimensional Covariates

Authors: Zijian Guo, Hyunseung Kang, T. Tony Cai, Dylan S. Small

Abstract: Modern, high dimensional data has renewed investigation on instrumental variables (IV) analysis, primarily focusing on estimation of effects of endogenous variables and putting little attention towards specification tests. This paper studies in high dimensions the Durbin-Wu-Hausman (DWH) test, a popular specification test for endogeneity in IV regression. We show, surprisingly, that the DWH test m… ▽ More Modern, high dimensional data has renewed investigation on instrumental variables (IV) analysis, primarily focusing on estimation of effects of endogenous variables and putting little attention towards specification tests. This paper studies in high dimensions the Durbin-Wu-Hausman (DWH) test, a popular specification test for endogeneity in IV regression. We show, surprisingly, that the DWH test maintains its size in high dimensions, but at an expense of power. We propose a new test that remedies this issue and has better power than the DWH test. Simulation studies reveal that our test achieves near-oracle performance to detect endogeneity. △ Less

Submitted 7 March, 2018; v1 submitted 21 September, 2016; originally announced September 2016.

arXiv:1606.07268 [pdf, other]

Semi-supervised Inference: General Theory and Estimation of Means

Authors: Anru Zhang, Lawrence D. Brown, T. Tony Cai

Abstract: We propose a general semi-supervised inference framework focused on the estimation of the population mean. As usual in semi-supervised settings, there exists an unlabeled sample of covariate vectors and a labeled sample consisting of covariate vectors along with real-valued responses ("labels"). Otherwise, the formulation is "assumption-lean" in that no major conditions are imposed on the statisti… ▽ More We propose a general semi-supervised inference framework focused on the estimation of the population mean. As usual in semi-supervised settings, there exists an unlabeled sample of covariate vectors and a labeled sample consisting of covariate vectors along with real-valued responses ("labels"). Otherwise, the formulation is "assumption-lean" in that no major conditions are imposed on the statistical or functional form of the data. We consider both the ideal semi-supervised setting where infinitely many unlabeled samples are available, as well as the ordinary semi-supervised setting in which only a finite number of unlabeled samples is available. Estimators are proposed along with corresponding confidence intervals for the population mean. Theoretical analysis on both the asymptotic distribution and $\ell_2$-risk for the proposed procedures are given. Surprisingly, the proposed estimators, based on a simple form of the least squares method, outperform the ordinary sample mean. The simple, transparent form of the estimator lends confidence to the perception that its asymptotic improvement over the ordinary sample mean also nearly holds even for moderate size samples. The method is further extended to a nonparametric setting, in which the oracle rate can be achieved asymptotically. The proposed estimators are further illustrated by simulation studies and a real data example involving estimation of the homeless population. △ Less

Submitted 13 August, 2018; v1 submitted 23 June, 2016; originally announced June 2016.

arXiv:1605.07244 [pdf, other]

Optimal Estimation of Co-heritability in High-dimensional Linear Models

Authors: Zijian Guo, Wanjie Wang, T. Tony Cai, Hongzhe Li

Abstract: Co-heritability is an important concept that characterizes the genetic associations within pairs of quantitative traits. There has been significant recent interest in estimating the co-heritability based on data from the genome-wide association studies (GWAS). This paper introduces two measures of co-heritability in the high-dimensional linear model framework, including the inner product of the tw… ▽ More Co-heritability is an important concept that characterizes the genetic associations within pairs of quantitative traits. There has been significant recent interest in estimating the co-heritability based on data from the genome-wide association studies (GWAS). This paper introduces two measures of co-heritability in the high-dimensional linear model framework, including the inner product of the two regression vectors and a normalized inner product by their lengths. Functional de-biased estimators (FDEs) are developed to estimate these two co-heritability measures. In addition, estimators of quadratic functionals of the regression vectors are proposed. Both theoretical and numerical properties of the estimators are investigated. In particular, minimax rates of convergence are established and the proposed estimators of the inner product, the quadratic functionals and the normalized inner product are shown to be rate-optimal. Simulation results show that the FDEs significantly outperform the naive plug-in estimates. The FDEs are also applied to analyze a yeast segregant data set with multiple traits to estimate heritability and co-heritability among the traits. △ Less

Submitted 23 May, 2016; originally announced May 2016.

arXiv:1605.04358 [pdf, other]

Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data

Authors: T. Tony Cai, Anru Zhang

Abstract: Missing data occur frequently in a wide range of applications. In this paper, we consider estimation of high-dimensional covariance matrices in the presence of missing observations under a general missing completely at random model in the sense that the missingness is not dependent on the values of the data. Based on incomplete data, estimators for bandable and sparse covariance matrices are propo… ▽ More Missing data occur frequently in a wide range of applications. In this paper, we consider estimation of high-dimensional covariance matrices in the presence of missing observations under a general missing completely at random model in the sense that the missingness is not dependent on the values of the data. Based on incomplete data, estimators for bandable and sparse covariance matrices are proposed and their theoretical and numerical properties are investigated. Minimax rates of convergence are established under the spectral norm loss and the proposed estimators are shown to be rate-optimal under mild regularity conditions. Simulation studies demonstrate that the estimators perform well numerically. The methods are also illustrated through an application to data from four ovarian cancer studies. The key technical tools developed in this paper are of independent interest and potentially useful for a range of related problems in high-dimensional statistical inference with missing data. △ Less

Submitted 13 May, 2016; originally announced May 2016.

Showing 1–50 of 112 results for author: Cai, T T