Skip to main content

Showing 1–50 of 112 results for author: Cai, T T

.
  1. arXiv:2406.20088  [pdf, other

    math.ST stat.ME stat.ML

    Minimax And Adaptive Transfer Learning for Nonparametric Classification under Distributed Differential Privacy Constraints

    Authors: Arnab Auddy, T. Tony Cai, Abhinav Chakraborty

    Abstract: This paper considers minimax and adaptive transfer learning for nonparametric classification under the posterior drift model with distributed differential privacy constraints. Our study is conducted within a heterogeneous framework, encompassing diverse sample sizes, varying privacy parameters, and data heterogeneity across different servers. We first establish the minimax misclassification rate,… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    MSC Class: 62G08; 62G20

  2. arXiv:2406.06755  [pdf, other

    math.ST cs.LG stat.ML

    Optimal Federated Learning for Nonparametric Regression with Heterogeneous Distributed Differential Privacy Constraints

    Authors: T. Tony Cai, Abhinav Chakraborty, Lasse Vuursteen

    Abstract: This paper studies federated learning for nonparametric regression in the context of distributed samples across different servers, each adhering to distinct differential privacy constraints. The setting we consider is heterogeneous, encompassing both varying sample sizes and differential privacy constraints across servers. Within this framework, both global and pointwise estimation are considered,… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 49 pages total, consisting of an article (24 pages) and a supplement (25 pages)

    MSC Class: 62G08; 62C20; 68P27; 62F30;

  3. arXiv:2406.06749  [pdf, other

    math.ST cs.LG stat.ML

    Federated Nonparametric Hypothesis Testing with Differential Privacy Constraints: Optimal Rates and Adaptive Tests

    Authors: T. Tony Cai, Abhinav Chakraborty, Lasse Vuursteen

    Abstract: Federated learning has attracted significant recent attention due to its applicability across a wide range of settings where data is collected and analyzed across disparate locations. In this paper, we study federated nonparametric goodness-of-fit testing in the white-noise-with-drift model under distributed differential privacy (DP) constraints. We first establish matching lower and upper bound… ▽ More

    Submitted 10 June, 2024; originally announced June 2024.

    Comments: 77 pages total; consisting of a main article (28 pages) and supplement (49 pages)

    MSC Class: 62G10; 62C20; 68P27; 62F30

  4. arXiv:2405.09493  [pdf, ps, other

    stat.ML cs.LG

    C-Learner: Constrained Learning for Causal Inference and Semiparametric Statistics

    Authors: Tiffany Tianhui Cai, Yuri Fonseca, Kaiwen Hou, Hongseok Namkoong

    Abstract: Causal estimation (e.g. of the average treatment effect) requires estimating complex nuisance parameters (e.g. outcome models). To adjust for errors in nuisance parameter estimation, we present a novel correction method that solves for the best plug-in estimator under the constraint that the first-order error of the estimator with respect to the nuisance parameter estimate is zero. Our constrained… ▽ More

    Submitted 22 May, 2024; v1 submitted 15 May, 2024; originally announced May 2024.

  5. arXiv:2401.12331  [pdf, other

    math.ST

    Transfer Learning for Functional Mean Estimation: Phase Transition and Adaptive Algorithms

    Authors: T. Tony Cai, Dongwoo Kim, Hongming Pu

    Abstract: This paper studies transfer learning for estimating the mean of random functions based on discretely sampled data, where, in addition to observations from the target distribution, auxiliary samples from similar but distinct source distributions are available. The paper considers both common and independent designs and establishes the minimax rates of convergence for both designs. The results revea… ▽ More

    Submitted 27 March, 2024; v1 submitted 22 January, 2024; originally announced January 2024.

    MSC Class: Primary 62J05; secondary 62G20

  6. arXiv:2401.12272  [pdf, other

    stat.ML cs.LG

    Transfer Learning for Nonparametric Regression: Non-asymptotic Minimax Analysis and Adaptive Procedure

    Authors: T. Tony Cai, Hongming Pu

    Abstract: Transfer learning for nonparametric regression is considered. We first study the non-asymptotic minimax risk for this problem and develop a novel estimator called the confidence thresholding estimator, which is shown to achieve the minimax optimal risk up to a logarithmic factor. Our results demonstrate two unique phenomena in transfer learning: auto-smoothing and super-acceleration, which differe… ▽ More

    Submitted 22 January, 2024; originally announced January 2024.

  7. arXiv:2401.03820  [pdf, other

    math.ST cs.IT stat.ME stat.ML

    Optimal Differentially Private PCA and Estimation for Spiked Covariance Matrices

    Authors: T. Tony Cai, Dong Xia, Mengyue Zha

    Abstract: Estimating a covariance matrix and its associated principal components is a fundamental problem in contemporary statistics. While optimal estimation procedures have been developed with well-understood properties, the increasing demand for privacy preservation introduces new complexities to this classical problem. In this paper, we study optimal differentially private Principal Component Analysis (… ▽ More

    Submitted 8 January, 2024; originally announced January 2024.

  8. arXiv:2305.00164  [pdf, other

    math.ST stat.ME

    Estimation and inference for minimizer and minimum of convex functions: optimality, adaptivity and uncertainty principles

    Authors: T. Tony Cai, Ran Chen, Yuancheng Zhu

    Abstract: Optimal estimation and inference for both the minimizer and minimum of a convex regression function under the white noise and nonparametric regression models are studied in a nonasymptotic local minimax framework, where the performance of a procedure is evaluated at individual functions. Fully adaptive and computationally efficient algorithms are proposed and sharp minimax lower bounds are given f… ▽ More

    Submitted 9 March, 2024; v1 submitted 29 April, 2023; originally announced May 2023.

    Journal ref: Ann. Statist. 52(1): 392-411 (February 2024)

  9. arXiv:2303.07152  [pdf, ps, other

    math.ST cs.CR cs.LG stat.ME stat.ML

    Score Attack: A Lower Bound Technique for Optimal Differentially Private Learning

    Authors: T. Tony Cai, Yichen Wang, Linjun Zhang

    Abstract: Achieving optimal statistical performance while ensuring the privacy of personal data is a challenging yet crucial objective in modern data analysis. However, characterizing the optimality, particularly the minimax lower bound, under privacy constraints is technically difficult. To address this issue, we propose a novel approach called the score attack, which provides a lower bound on the differ… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2011.03900

    MSC Class: 62F30; 62J12; 62G05

  10. arXiv:2303.02011  [pdf, other

    stat.ML cs.LG

    Diagnosing Model Performance Under Distribution Shift

    Authors: Tiffany Tianhui Cai, Hongseok Namkoong, Steve Yadlowsky

    Abstract: Prediction models can perform poorly when deployed to target distributions different from the training distribution. To understand these operational failure modes, we develop a method, called DIstribution Shift DEcomposition (DISDE), to attribute a drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into terms for 1) an increase in harder but… ▽ More

    Submitted 10 July, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

  11. arXiv:2301.10392  [pdf, other

    stat.ME math.ST

    Statistical Inference and Large-scale Multiple Testing for High-dimensional Regression Models

    Authors: T. Tony Cai, Zijian Guo, Yin Xia

    Abstract: This paper presents a selective survey of recent developments in statistical inference and multiple testing for high-dimensional regression models, including linear and logistic regression. We examine the construction of confidence intervals and hypothesis tests for various low-dimensional objectives such as regression coefficients and linear and quadratic functionals. The key technique is to gene… ▽ More

    Submitted 24 January, 2023; originally announced January 2023.

  12. arXiv:2301.01381  [pdf, other

    stat.ME math.ST stat.ML

    Testing High-dimensional Multinomials with Applications to Text Analysis

    Authors: T. Tony Cai, Zheng Tracy Ke, Paxton Turner

    Abstract: Motivated by applications in text mining and discrete distribution inference, we investigate the testing for equality of probability mass functions of $K$ groups of high-dimensional multinomial distributions. A test statistic, which is shown to have an asymptotic standard normal distribution under the null, is proposed. The optimal detection boundary is established, and the proposed test is shown… ▽ More

    Submitted 24 November, 2023; v1 submitted 3 January, 2023; originally announced January 2023.

  13. arXiv:2211.12612  [pdf, ps, other

    stat.ML cs.LG math.ST

    Transfer Learning for Contextual Multi-armed Bandits

    Authors: Changxiao Cai, T. Tony Cai, Hongzhe Li

    Abstract: Motivated by a range of applications, we study in this paper the problem of transfer learning for nonparametric contextual multi-armed bandits under the covariate shift model, where we have data collected on source bandits before the start of the target bandit learning. The minimax rate of convergence for the cumulative regret is established and a novel transfer learning algorithm that attains the… ▽ More

    Submitted 24 January, 2024; v1 submitted 22 November, 2022; originally announced November 2022.

    Comments: Accepted to the Annals of Statistics

  14. arXiv:2203.11461  [pdf, other

    stat.ME stat.ML

    Locally Adaptive Algorithms for Multiple Testing with Network Structure, with Application to Genome-Wide Association Studies

    Authors: Ziyi Liang, T. Tony Cai, Wenguang Sun, Yin Xia

    Abstract: Linkage analysis has provided valuable insights to the GWAS studies, particularly in revealing that SNPs in linkage disequilibrium (LD) can jointly influence disease phenotypes. However, the potential of LD network data has often been overlooked or underutilized in the literature. In this paper, we propose a locally adaptive structure learning algorithm (LASLA) that provides a principled and gener… ▽ More

    Submitted 16 August, 2023; v1 submitted 22 March, 2022; originally announced March 2022.

    Comments: 33 pages, 7 figures

  15. arXiv:2202.10007  [pdf, other

    stat.ME math.ST stat.AP

    Statistical Inference for Genetic Relatedness Based on High-Dimensional Logistic Regression

    Authors: Rong Ma, Zijian Guo, T. Tony Cai, Hongzhe Li

    Abstract: This paper studies the problem of statistical inference for genetic relatedness between binary traits based on individual-level genome-wide association data. Specifically, under the high-dimensional logistic regression models, we define parameters characterizing the cross-trait genetic correlation, the genetic covariance and the trait-specific genetic variance. A novel weighted debiasing method is… ▽ More

    Submitted 5 October, 2022; v1 submitted 21 February, 2022; originally announced February 2022.

  16. arXiv:2201.06438  [pdf, other

    math.ST stat.ME stat.ML

    Matrix Reordering for Noisy Disordered Matrices: Optimality and Computationally Efficient Algorithms

    Authors: T. Tony Cai, Rong Ma

    Abstract: Motivated by applications in single-cell biology and metagenomics, we investigate the problem of matrix reordering based on a noisy disordered monotone Toeplitz matrix model. We establish the fundamental statistical limit for this problem in a decision-theoretic framework and demonstrate that a constrained least squares estimator achieves the optimal rate. However, due to its computational complex… ▽ More

    Submitted 13 August, 2023; v1 submitted 17 January, 2022; originally announced January 2022.

    Comments: accepted by IEEE Transactions on Information Theory

  17. arXiv:2201.03727  [pdf, ps, other

    stat.ME math.ST

    Estimation and Inference with Proxy Data and its Genetic Applications

    Authors: Sai Li, T. Tony Cai, Hongzhe Li

    Abstract: Existing high-dimensional statistical methods are largely established for analyzing individual-level data. In this work, we study estimation and inference for high-dimensional linear models where we only observe "proxy data", which include the marginal statistics and sample covariance matrix that are computed based on different sets of individuals. We develop a rate optimal method for estimation a… ▽ More

    Submitted 10 January, 2022; originally announced January 2022.

  18. arXiv:2109.03365  [pdf, other

    stat.CO stat.ME stat.OT

    SIHR: Statistical Inference in High-Dimensional Linear and Logistic Regression Models

    Authors: Prabrisha Rakshit, Zhenyu Wang, T. Tony Cai, Zijian Guo

    Abstract: We introduce the R package \CRANpkg{SIHR} for statistical inference in high-dimensional generalized linear models with continuous and binary outcomes. The package provides functionalities for constructing confidence intervals and performing hypothesis tests for low-dimensional objectives in both one-sample and two-sample regression settings. We illustrate the usage of \CRANpkg{SIHR} through numeri… ▽ More

    Submitted 1 May, 2023; v1 submitted 7 September, 2021; originally announced September 2021.

  19. arXiv:2107.00179  [pdf

    math.ST cs.DC cs.LG stat.ML

    Distributed Nonparametric Function Estimation: Optimal Rate of Convergence and Cost of Adaptation

    Authors: T. Tony Cai, Hongji Wei

    Abstract: Distributed minimax estimation and distributed adaptive estimation under communication constraints for Gaussian sequence model and white noise model are studied. The minimax rate of convergence for distributed estimation over a given Besov class, which serves as a benchmark for the cost of adaptation, is established. We then quantify the exact communication cost for adaptation and construct an opt… ▽ More

    Submitted 30 June, 2021; originally announced July 2021.

    MSC Class: 62F30

  20. arXiv:2105.07536  [pdf, other

    stat.ML cs.LG math.ST

    Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data

    Authors: T. Tony Cai, Rong Ma

    Abstract: This paper investigates the theoretical foundations of the t-distributed stochastic neighbor embedding (t-SNE) algorithm, a popular nonlinear dimension reduction and data visualization method. A novel theoretical framework for the analysis of t-SNE based on the gradient descent approach is presented. For the early exaggeration stage of t-SNE, we show its asymptotic equivalence to power iterations… ▽ More

    Submitted 31 October, 2022; v1 submitted 16 May, 2021; originally announced May 2021.

    Comments: Accepted by Journal of Machine Learning Research

  21. arXiv:2011.03900  [pdf, other

    stat.ML cs.CR cs.LG math.ST stat.ME

    The Cost of Privacy in Generalized Linear Models: Algorithms and Minimax Lower Bounds

    Authors: T. Tony Cai, Yichen Wang, Linjun Zhang

    Abstract: We propose differentially private algorithms for parameter estimation in both low-dimensional and high-dimensional sparse generalized linear models (GLMs) by constructing private versions of projected gradient descent. We show that the proposed algorithms are nearly rate-optimal by characterizing their statistical performance and establishing privacy-constrained minimax lower bounds for GLMs. The… ▽ More

    Submitted 5 December, 2020; v1 submitted 7 November, 2020; originally announced November 2020.

    Comments: 56 pages, 6 figures

  22. arXiv:2011.03598  [pdf, other

    stat.ME stat.ML

    Estimation, Confidence Intervals, and Large-Scale Hypotheses Testing for High-Dimensional Mixed Linear Regression

    Authors: Linjun Zhang, Rong Ma, T. Tony Cai, Hongzhe Li

    Abstract: This paper studies the high-dimensional mixed linear regression (MLR) where the output variable comes from one of the two linear regression models with an unknown mixing proportion and an unknown covariance structure of the random covariates. Building upon a high-dimensional EM algorithm, we propose an iterative procedure for estimating the two regression vectors and establish their rates of conve… ▽ More

    Submitted 6 November, 2020; originally announced November 2020.

  23. arXiv:2010.11037  [pdf, ps, other

    stat.ME stat.ML

    Transfer Learning in Large-scale Gaussian Graphical Models with False Discovery Rate Control

    Authors: Sai Li, T. Tony Cai, Hongzhe Li

    Abstract: Transfer learning for high-dimensional Gaussian graphical models (GGMs) is studied with the goal of estimating the target GGM by utilizing the data from similar and related auxiliary studies. The similarity between the target graph and each auxiliary graph is characterized by the sparsity of a divergence matrix. An estimation algorithm, Trans-CLIME, is proposed and shown to attain a faster converg… ▽ More

    Submitted 21 October, 2020; originally announced October 2020.

  24. arXiv:2010.06682  [pdf, other

    cs.CV cs.LG eess.IV

    Are all negatives created equal in contrastive instance discrimination?

    Authors: Tiffany Tianhui Cai, Jonathan Frankle, David J. Schwab, Ari S. Morcos

    Abstract: Self-supervised learning has recently begun to rival supervised learning on computer vision tasks. Many of the recent approaches have been based on contrastive instance discrimination (CID), in which the network is trained to recognize two augmented versions of the same instance (a query and positive) while discriminating against a pool of other instances (negatives). The learned representation is… ▽ More

    Submitted 25 October, 2020; v1 submitted 13 October, 2020; originally announced October 2020.

    Comments: Fixed author name error

  25. arXiv:2008.12434  [pdf, ps, other

    math.ST math.PR

    On the Non-Asymptotic Concentration of Heteroskedastic Wishart-type Matrix

    Authors: T. Tony Cai, Rungang Han, Anru R. Zhang

    Abstract: This paper focuses on the non-asymptotic concentration of the heteroskedastic Wishart-type matrices. Suppose $Z$ is a $p_1$-by-$p_2$ random matrix and $Z_{ij} \sim N(0,σ_{ij}^2)$ independently, we prove the expected spectral norm of Wishart matrix deviations (i.e., $\mathbb{E} \left\|ZZ^\top - \mathbb{E} ZZ^\top\right\|$) is upper bounded by \begin{equation*} \begin{split} (1+ε)\left\{2σ_Cσ_R… ▽ More

    Submitted 16 February, 2022; v1 submitted 27 August, 2020; originally announced August 2020.

    Comments: Electronic Journal of Probability, to appear

  26. arXiv:2006.10593  [pdf, ps, other

    stat.ME stat.ML

    Transfer Learning for High-dimensional Linear Regression: Prediction, Estimation, and Minimax Optimality

    Authors: Sai Li, T. Tony Cai, Hongzhe Li

    Abstract: This paper considers the estimation and prediction of a high-dimensional linear regression in the setting of transfer learning, using samples from the target model as well as auxiliary samples from different but possibly related regression models. When the set of "informative" auxiliary samples is known, an estimator and a predictor are proposed and their optimality is established. The optimal rat… ▽ More

    Submitted 18 June, 2020; originally announced June 2020.

  27. arXiv:2006.01393  [pdf, other

    stat.ME stat.AP

    Two Robust Tools for Inference about Causal Effects with Invalid Instruments

    Authors: Hyunseung Kang, You** Lee, T. Tony Cai, Dylan S. Small

    Abstract: Instrumental variables have been widely used to estimate the causal effect of a treatment on an outcome. Existing confidence intervals for causal effects based on instrumental variables assume that all of the putative instrumental variables are valid; a valid instrumental variable is a variable that affects the outcome only by affecting the treatment and is not related to unmeasured confounders. H… ▽ More

    Submitted 2 June, 2020; originally announced June 2020.

  28. arXiv:2002.07624  [pdf, other

    math.ST stat.ML

    Optimal Structured Principal Subspace Estimation: Metric Entropy and Minimax Rates

    Authors: T. Tony Cai, Hongzhe Li, Rong Ma

    Abstract: Driven by a wide range of applications, many principal subspace estimation problems have been studied individually under different structural constraints. This paper presents a unified framework for the statistical analysis of a general structured principal subspace estimation problem which includes as special cases non-negative PCA/SVD, sparse PCA/SVD, subspace constrained PCA/SVD, and spectral c… ▽ More

    Submitted 16 November, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

  29. arXiv:2001.08877  [pdf, other

    math.ST cs.DC cs.IT cs.LG stat.ML

    Distributed Gaussian Mean Estimation under Communication Constraints: Optimal Rates and Communication-Efficient Algorithms

    Authors: T. Tony Cai, Hongji Wei

    Abstract: We study distributed estimation of a Gaussian mean under communication constraints in a decision theoretical framework. Minimax rates of convergence, which characterize the tradeoff between the communication costs and statistical accuracy, are established in both the univariate and multivariate settings. Communication-efficient and statistically optimal procedures are developed. In the univariate… ▽ More

    Submitted 23 January, 2020; originally announced January 2020.

  30. arXiv:1912.02872  [pdf, ps, other

    stat.ME

    A Convex Optimization Approach to High-Dimensional Sparse Quadratic Discriminant Analysis

    Authors: T. Tony Cai, Linjun Zhang

    Abstract: In this paper, we study high-dimensional sparse Quadratic Discriminant Analysis (QDA) and aim to establish the optimal convergence rates for the classification error. Minimax lower bounds are established to demonstrate the necessity of structural assumptions such as sparsity conditions on the discriminating direction and differential graph for the possible construction of consistent high-dimension… ▽ More

    Submitted 5 December, 2019; originally announced December 2019.

  31. Optimal Estimation of Bacterial Growth Rates Based on Permuted Monotone Matrix

    Authors: Rong Ma, T. Tony Cai, Hongzhe Li

    Abstract: Motivated by the problem of estimating the bacterial growth rates for genome assemblies from shotgun metagenomic data, we consider the permuted monotone matrix model $Y=ΘΠ+Z$, where $Y\in \mathbb{R}^{n\times p}$ is observed, $Θ\in \mathbb{R}^{n\times p}$ is an unknown approximately rank-one signal matrix with monotone rows, $Π\in \mathbb{R}^{p\times p}$ is an unknown permutation matrix, and… ▽ More

    Submitted 26 August, 2020; v1 submitted 27 November, 2019; originally announced November 2019.

    Journal ref: Biometrika (2020)

  32. arXiv:1911.11345  [pdf, other

    stat.ME math.ST stat.ML

    High Dimensional M-Estimation with Missing Outcomes: A Semi-Parametric Framework

    Authors: Abhishek Chakrabortty, Jiarui Lu, T. Tony Cai, Hongzhe Li

    Abstract: We consider high dimensional $M$-estimation in settings where the response $Y$ is possibly missing at random and the covariates $\mathbf{X} \in \mathbb{R}^p$ can be high dimensional compared to the sample size $n$. The parameter of interest $\boldsymbolθ_0 \in \mathbb{R}^d$ is defined as the minimizer of the risk of a convex loss, under a fully non-parametric model, and $\boldsymbolθ_0$ itself is… ▽ More

    Submitted 26 November, 2019; originally announced November 2019.

    Comments: 34 pages, 4 tables; (Supplement: 58 pages, 10 tables);

  33. Optimal Permutation Recovery in Permuted Monotone Matrix Model

    Authors: Rong Ma, T. Tony Cai, Hongzhe Li

    Abstract: Motivated by recent research on quantifying bacterial growth dynamics based on genome assemblies, we consider a permuted monotone matrix model $Y=ΘΠ+Z$, where the rows represent different samples, the columns represent contigs in genome assemblies and the elements represent log-read counts after preprocessing steps and Guanine-Cytosine (GC) adjustment. In this model, $Θ$ is an unknown mean matrix… ▽ More

    Submitted 13 July, 2020; v1 submitted 24 November, 2019; originally announced November 2019.

    Journal ref: Journal of the American Statistical Association, 2020

  34. arXiv:1909.09851  [pdf, other

    math.ST cs.LG stat.ML

    Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference

    Authors: T. Tony Cai, Anru R. Zhang, Yuchen Zhou

    Abstract: We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model -- an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are establishe… ▽ More

    Submitted 6 May, 2022; v1 submitted 21 September, 2019; originally announced September 2019.

    Comments: IEEE Transactions on Information Theory, to appear

  35. arXiv:1909.01503  [pdf, other

    stat.ME

    Group Inference in High Dimensions with Applications to Hierarchical Testing

    Authors: Zijian Guo, Claude Renaux, Peter Bühlmann, T. Tony Cai

    Abstract: High-dimensional group inference is an essential part of statistical methods for analysing complex data sets, including hierarchical testing, tests of interaction, detection of heterogeneous treatment effects and inference for local heritability. Group inference in regression models can be measured with respect to a weighted quadratic functional of the regression sub-vector corresponding to the gr… ▽ More

    Submitted 30 November, 2020; v1 submitted 3 September, 2019; originally announced September 2019.

  36. arXiv:1907.06116  [pdf, ps, other

    stat.ME

    Inference for high-dimensional linear mixed-effects models: A quasi-likelihood approach

    Authors: Sai Li, Tony T. Cai, Hongzhe Li

    Abstract: Linear mixed-effects models are widely used in analyzing clustered or repeated measures data. We propose a quasi-likelihood approach for estimation and inference of the unknown parameters in linear mixed-effects models with high-dimensional fixed effects. The proposed method is applicable to general settings where the dimension of the random effects and the cluster sizes are possibly large. Regard… ▽ More

    Submitted 9 March, 2021; v1 submitted 13 July, 2019; originally announced July 2019.

    Comments: 32 pages, 3 figures

    MSC Class: 62H15; 62J07

  37. arXiv:1906.02903  [pdf, other

    math.ST cs.LG stat.ME stat.ML

    Transfer Learning for Nonparametric Classification: Minimax Rate and Adaptive Classifier

    Authors: T. Tony Cai, Hongji Wei

    Abstract: Human learners have the natural ability to use knowledge gained in one setting for learning in a different but related setting. This ability to transfer knowledge from one task to another is essential for effective learning. In this paper, we study transfer learning in the context of nonparametric classification based on observations from different distributions under the posterior drift model, wh… ▽ More

    Submitted 7 June, 2019; originally announced June 2019.

  38. arXiv:1905.08757  [pdf, other

    math.ST math.PR

    Asymptotic Analysis for Extreme Eigenvalues of Principal Minors of Random Matrices

    Authors: T. Tony Cai, Tiefeng Jiang, Xiaoou Li

    Abstract: Consider a standard white Wishart matrix with parameters $n$ and $p$. Motivated by applications in high-dimensional statistics and signal processing, we perform asymptotic analysis on the maxima and minima of the eigenvalues of all the $m \times m$ principal minors, under the asymptotic regime that $n,p,m$ go to infinity. Asymptotic results concerning extreme eigenvalues of principal minors of rea… ▽ More

    Submitted 21 May, 2019; originally announced May 2019.

  39. arXiv:1902.04495  [pdf, other

    stat.ML cs.CR cs.DS cs.LG

    The Cost of Privacy: Optimal Rates of Convergence for Parameter Estimation with Differential Privacy

    Authors: T. Tony Cai, Yichen Wang, Linjun Zhang

    Abstract: Privacy-preserving data analysis is a rising challenge in contemporary statistics, as the privacy guarantees of statistical methods are often achieved at the expense of accuracy. In this paper, we investigate the tradeoff between statistical accuracy and privacy in mean estimation and linear regression, under both the classical low-dimensional and modern high-dimensional settings. A primary focus… ▽ More

    Submitted 10 November, 2020; v1 submitted 12 February, 2019; originally announced February 2019.

    Comments: 33 pages, 4 figures

  40. arXiv:1810.08316  [pdf, other

    math.ST stat.CO stat.ME stat.ML

    Heteroskedastic PCA: Algorithm, Optimality, and Applications

    Authors: Anru R. Zhang, T. Tony Cai, Yihong Wu

    Abstract: A general framework for principal component analysis (PCA) in the presence of heteroskedastic noise is introduced. We propose an algorithm called HeteroPCA, which involves iteratively imputing the diagonal entries of the sample covariance matrix to remove estimation bias due to heteroskedasticity. This procedure is computationally efficient and provably optimal under the generalized spiked covaria… ▽ More

    Submitted 1 April, 2021; v1 submitted 18 October, 2018; originally announced October 2018.

  41. arXiv:1806.06179  [pdf, other

    stat.ME math.ST

    Semi-supervised Inference for Explained Variance in High-dimensional Linear Regression and Its Applications

    Authors: T. Tony Cai, Zijian Guo

    Abstract: This paper considers statistical inference for the explained variance $β^{\intercal}Σβ$ under the high-dimensional linear model $Y=Xβ+ε$ in the semi-supervised setting, where $β$ is the regression vector and $Σ$ is the design covariance matrix. A calibrated estimator, which efficiently integrates both labelled and unlabelled data, is proposed. It is shown that the estimator achieves the minimax op… ▽ More

    Submitted 30 November, 2020; v1 submitted 16 June, 2018; originally announced June 2018.

  42. Global and Simultaneous Hypothesis Testing for High-Dimensional Logistic Regression Models

    Authors: Rong Ma, T. Tony Cai, Hongzhe Li

    Abstract: High-dimensional logistic regression is widely used in analyzing data with binary outcomes. In this paper, global testing and large-scale multiple testing for the regression coefficients are considered in both single- and two-regression settings. A test statistic for testing the global null hypothesis is constructed using a generalized low-dimensional projection for bias correction and its asympto… ▽ More

    Submitted 19 November, 2020; v1 submitted 17 May, 2018; originally announced May 2018.

    Comments: Typos corrected

    Journal ref: Journal of the American Statistical Association (2019)

  43. arXiv:1804.03018  [pdf, other

    stat.ME

    High-dimensional Linear Discriminant Analysis: Optimality, Adaptive Algorithm, and Missing Data

    Authors: T. Tony Cai, Linjun Zhang

    Abstract: This paper aims to develop an optimality theory for linear discriminant analysis in the high-dimensional setting. A data-driven and tuning free classification rule, which is based on an adaptive constrained $\ell_1$ minimization approach, is proposed and analyzed. Minimax lower bounds are obtained and this classification rule is shown to be simultaneously rate optimal over a collection of paramete… ▽ More

    Submitted 9 April, 2018; originally announced April 2018.

  44. Optimal Estimation of Simultaneous Signals Using Absolute Inner Product with Applications to Integrative Genomics

    Authors: Rong Ma, T. Tony Cai, Hongzhe Li

    Abstract: Integrating the summary statistics from genome-wide association study (\textsc{gwas}) and expression quantitative trait loci (e\textsc{qtl}) data provides a powerful way of identifying the genes whose expression levels are potentially associated with complex diseases. A parameter called $T$-score that quantifies the genetic overlap between a gene and the disease phenotype based on the summary stat… ▽ More

    Submitted 4 October, 2020; v1 submitted 24 January, 2018; originally announced January 2018.

    Journal ref: Statistica Sinica (2020)

  45. arXiv:1801.00518  [pdf, ps, other

    math.ST cs.IT

    Statistical and Computational Limits for Sparse Matrix Detection

    Authors: T. Tony Cai, Yihong Wu

    Abstract: This paper investigates the fundamental limits for detecting a high-dimensional sparse matrix contaminated by white Gaussian noise from both the statistical and computational perspectives. We consider $p\times p$ matrices whose rows and columns are individually $k$-sparse. We provide a tight characterization of the statistical and computational limits for sparse matrix detection, which precisely d… ▽ More

    Submitted 1 January, 2018; originally announced January 2018.

  46. arXiv:1709.03907  [pdf, other

    math.ST stat.ML

    Weighted Message Passing and Minimum Energy Flow for Heterogeneous Stochastic Block Models with Side Information

    Authors: T. Tony Cai, Tengyuan Liang, Alexander Rakhlin

    Abstract: We study the misclassification error for community detection in general heterogeneous stochastic block models (SBM) with noisy or partial label information. We establish a connection between the misclassification rate and the notion of minimum energy on the local neighborhood of the SBM. We develop an optimally weighted message passing algorithm to reconstruct labels for SBM based on the minimum e… ▽ More

    Submitted 12 September, 2017; originally announced September 2017.

    Comments: 31 pages, 1 figures

    Journal ref: Journal of Machine Learning Research 21 (2020) 1-34

  47. arXiv:1609.06713  [pdf, other

    math.ST

    Testing Endogeneity with High Dimensional Covariates

    Authors: Zijian Guo, Hyunseung Kang, T. Tony Cai, Dylan S. Small

    Abstract: Modern, high dimensional data has renewed investigation on instrumental variables (IV) analysis, primarily focusing on estimation of effects of endogenous variables and putting little attention towards specification tests. This paper studies in high dimensions the Durbin-Wu-Hausman (DWH) test, a popular specification test for endogeneity in IV regression. We show, surprisingly, that the DWH test m… ▽ More

    Submitted 7 March, 2018; v1 submitted 21 September, 2016; originally announced September 2016.

  48. arXiv:1606.07268  [pdf, other

    stat.ME math.ST stat.ML

    Semi-supervised Inference: General Theory and Estimation of Means

    Authors: Anru Zhang, Lawrence D. Brown, T. Tony Cai

    Abstract: We propose a general semi-supervised inference framework focused on the estimation of the population mean. As usual in semi-supervised settings, there exists an unlabeled sample of covariate vectors and a labeled sample consisting of covariate vectors along with real-valued responses ("labels"). Otherwise, the formulation is "assumption-lean" in that no major conditions are imposed on the statisti… ▽ More

    Submitted 13 August, 2018; v1 submitted 23 June, 2016; originally announced June 2016.

  49. arXiv:1605.07244  [pdf, other

    stat.ME math.ST

    Optimal Estimation of Co-heritability in High-dimensional Linear Models

    Authors: Zijian Guo, Wanjie Wang, T. Tony Cai, Hongzhe Li

    Abstract: Co-heritability is an important concept that characterizes the genetic associations within pairs of quantitative traits. There has been significant recent interest in estimating the co-heritability based on data from the genome-wide association studies (GWAS). This paper introduces two measures of co-heritability in the high-dimensional linear model framework, including the inner product of the tw… ▽ More

    Submitted 23 May, 2016; originally announced May 2016.

  50. arXiv:1605.04358  [pdf, other

    stat.ME math.ST

    Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data

    Authors: T. Tony Cai, Anru Zhang

    Abstract: Missing data occur frequently in a wide range of applications. In this paper, we consider estimation of high-dimensional covariance matrices in the presence of missing observations under a general missing completely at random model in the sense that the missingness is not dependent on the values of the data. Based on incomplete data, estimators for bandable and sparse covariance matrices are propo… ▽ More

    Submitted 13 May, 2016; originally announced May 2016.