-
Sketched Gaussian Model Linear Discriminant Analysis via the Randomized Kaczmarz Method
Authors:
Jocelyn T. Chi,
Deanna Needell
Abstract:
We present sketched linear discriminant analysis, an iterative randomized approach to binary-class Gaussian model linear discriminant analysis (LDA) for very large data. We harness a least squares formulation and mobilize the stochastic gradient descent framework. Therefore, we obtain a randomized classifier with performance that is very comparable to that of full data LDA while requiring access t…
▽ More
We present sketched linear discriminant analysis, an iterative randomized approach to binary-class Gaussian model linear discriminant analysis (LDA) for very large data. We harness a least squares formulation and mobilize the stochastic gradient descent framework. Therefore, we obtain a randomized classifier with performance that is very comparable to that of full data LDA while requiring access to only one row of the training data at a time. We present convergence guarantees for the sketched predictions on new data within a fixed number of iterations. These guarantees account for both the Gaussian modeling assumptions on the data and algorithmic randomness from the sketching procedure. Finally, we demonstrate performance with varying step-sizes and numbers of iterations. Our numerical experiments demonstrate that sketched LDA can offer a very viable alternative to full data LDA when the data may be too large for full data analysis.
△ Less
Submitted 10 November, 2022;
originally announced November 2022.
-
Population-Based Hierarchical Non-negative Matrix Factorization for Survey Data
Authors:
Xiaofu Ding,
Xinyu Dong,
Olivia McGough,
Chenxin Shen,
Annie Ulichney,
Ruiyao Xu,
William Swartworth,
Jocelyn T. Chi,
Deanna Needell
Abstract:
Motivated by the problem of identifying potential hierarchical population structure on modern survey data containing a wide range of complex data types, we introduce population-based hierarchical non-negative matrix factorization (PHNMF). PHNMF is a variant of hierarchical non-negative matrix factorization based on feature similarity. As such, it enables an automatic and interpretable approach for…
▽ More
Motivated by the problem of identifying potential hierarchical population structure on modern survey data containing a wide range of complex data types, we introduce population-based hierarchical non-negative matrix factorization (PHNMF). PHNMF is a variant of hierarchical non-negative matrix factorization based on feature similarity. As such, it enables an automatic and interpretable approach for identifying and understanding hierarchical structure in a data matrix constructed from a wide range of data types. Our numerical experiments on synthetic and real survey data demonstrate that PHNMF can recover latent hierarchical population structure in complex data with high accuracy. Moreover, the recovered subpopulation structure is meaningful and can be useful for improving downstream inference.
△ Less
Submitted 11 September, 2022;
originally announced September 2022.
-
SEAGLE: A Scalable Exact Algorithm for Large-Scale Set-Based GxE Tests in Biobank Data
Authors:
Jocelyn T. Chi,
Ilse C. F. Ipsen,
Tzu-Hung Hsiao,
Ching-Heng Lin,
Li-San Wang,
Wan-** Lee,
Tzu-Pin Lu,
Jung-Ying Tzeng
Abstract:
The explosion of biobank data offers immediate opportunities for gene-environment (GxE) interaction studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE assessment, especially for set-based GxE variance component (VC) tests, which are…
▽ More
The explosion of biobank data offers immediate opportunities for gene-environment (GxE) interaction studies of complex diseases because of the large sample sizes and the rich collection in genetic and non-genetic information. However, the extremely large sample size also introduces new computational challenges in GxE assessment, especially for set-based GxE variance component (VC) tests, which are a widely used strategy to boost overall GxE signals and to evaluate the joint GxE effect of multiple variants from a biologically meaningful unit (e.g., gene). In this work, we focus on continuous traits and present SEAGLE, a Scalable Exact AlGorithm for Large-scale set-based GxE tests, to permit GxE VC tests for biobank-scale data. SEAGLE employs modern matrix computations to achieve the same "exact" results as the original GxE VC tests without imposing additional assumptions or relying on approximations. SEAGLE can easily accommodate sample sizes in the order of $10^5$, is implementable on standard laptops, and does not require specialized computing equipment. We demonstrate SEAGLE's performance through extensive simulations. We illustrate its utility by conducting genome-wide gene-based GxE analysis on the Taiwan Biobank data to explore the interaction of gene and physical activity status on body mass index.
△ Less
Submitted 14 May, 2021; v1 submitted 7 May, 2021;
originally announced May 2021.
-
A User-Friendly Computational Framework for Robust Structured Regression with the L$_2$ Criterion
Authors:
Jocelyn T. Chi,
Eric C. Chi
Abstract:
We introduce a user-friendly computational framework for implementing robust versions of a wide variety of structured regression methods with the L$_{2}$ criterion. In addition to introducing an algorithm for performing L$_{2}$E regression, our framework enables robust regression with the L$_{2}$ criterion for additional structural constraints, works without requiring complex tuning procedures on…
▽ More
We introduce a user-friendly computational framework for implementing robust versions of a wide variety of structured regression methods with the L$_{2}$ criterion. In addition to introducing an algorithm for performing L$_{2}$E regression, our framework enables robust regression with the L$_{2}$ criterion for additional structural constraints, works without requiring complex tuning procedures on the precision parameter, can be used to identify heterogeneous subpopulations, and can incorporate readily available non-robust structured regression solvers. We provide convergence guarantees for the framework and demonstrate its flexibility with some examples. Supplementary materials for this article are available online.
△ Less
Submitted 13 September, 2021; v1 submitted 8 October, 2020;
originally announced October 2020.
-
Multiplicative Perturbation Bounds for Multivariate Multiple Linear Regression in Schatten $p$-Norms
Authors:
Jocelyn T. Chi,
Ilse C. F. Ipsen
Abstract:
Multivariate multiple linear regression (MMLR), which occurs in a number of practical applications, generalizes traditional least squares (multivariate linear regression) to multiple right-hand sides. We extend recent MLR analyses to sketched MMLR in general Schatten $p$-norms by interpreting the sketched problem as a multiplicative perturbation. Our work represents an extension of Maher's results…
▽ More
Multivariate multiple linear regression (MMLR), which occurs in a number of practical applications, generalizes traditional least squares (multivariate linear regression) to multiple right-hand sides. We extend recent MLR analyses to sketched MMLR in general Schatten $p$-norms by interpreting the sketched problem as a multiplicative perturbation. Our work represents an extension of Maher's results on Schatten $p$-norms. We derive expressions for the exact and perturbed solutions in terms of projectors for easy geometric interpretation. We also present a geometric interpretation of the action of the sketching matrix in terms of relevant subspaces. We show that a key term in assessing the accuracy of the sketched MMLR solution can be viewed as a tangent of a largest principal angle between subspaces under some assumptions. Our results enable additional interpretation of the difference between an orthogonal and oblique projector with the same range.
△ Less
Submitted 12 July, 2020;
originally announced July 2020.
-
A Projector-Based Approach to Quantifying Total and Excess Uncertainties for Sketched Linear Regression
Authors:
Jocelyn T. Chi,
Ilse C. F. Ipsen
Abstract:
Linear regression is a classic method of data analysis. In recent years, sketching -- a method of dimension reduction using random sampling, random projections, or both -- has gained popularity as an effective computational approximation when the number of observations greatly exceeds the number of variables. In this paper, we address the following question: How does sketching affect the statistic…
▽ More
Linear regression is a classic method of data analysis. In recent years, sketching -- a method of dimension reduction using random sampling, random projections, or both -- has gained popularity as an effective computational approximation when the number of observations greatly exceeds the number of variables. In this paper, we address the following question: How does sketching affect the statistical properties of the solution and key quantities derived from it?
To answer this question, we present a projector-based approach to sketched linear regression that is exact and that requires minimal assumptions on the sketching matrix. Therefore, downstream analyses hold exactly and generally for all sketching schemes. Additionally, a projector-based approach enables derivation of key quantities from classic linear regression that account for the combined model- and algorithm-induced uncertainties. We demonstrate the usefulness of a projector-based approach in quantifying and enabling insight on excess uncertainties and bias-variance decompositions for sketched linear regression. Finally, we demonstrate how the insights from our projector-based analyses can be used to produce practical sketching diagnostics to aid the design of judicious sketching schemes.
△ Less
Submitted 3 August, 2020; v1 submitted 17 August, 2018;
originally announced August 2018.
-
$k$-POD: A Method for $k$-Means Clustering of Missing Data
Authors:
Jocelyn T. Chi,
Eric C. Chi,
Richard G. Baraniuk
Abstract:
The $k$-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, is common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our $k$-POD method presents a simp…
▽ More
The $k$-means algorithm is often used in clustering applications but its usage requires a complete data matrix. Missing data, however, is common in many applications. Mainstream approaches to clustering missing data reduce the missing data problem to a complete data formulation through either deletion or imputation but these solutions may incur significant costs. Our $k$-POD method presents a simple extension of $k$-means clustering for missing data that works even when the missingness mechanism is unknown, when external information is unavailable, and when there is significant missingness in the data.
△ Less
Submitted 27 January, 2016; v1 submitted 25 November, 2014;
originally announced November 2014.