HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.
failed: epic
Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.
High-dimensional covariance regression with application to co-expression QTL detection
Rakheon Kim
Department of Statistical Science, Baylor University
and
**gfei Zhang
Goizueta Business School, Emory University
Abstract
While covariance matrices have been widely studied in many scientific fields, relatively limited progress has been made on estimating conditional covariances that permits a large covariance matrix to vary with high-dimensional subject-level covariates. In this paper, we present a new sparse multivariate regression framework that models the covariance matrix as a function of subject-level covariates. In the context of co-expression quantitative trait locus (QTL) studies, our method can be used to determine if and how gene co-expressions vary with genetic variations. To accommodate high-dimensional responses and covariates, we stipulate a combined sparsity structure that encourages covariates with non-zero effects and edges that are modulated by these covariates to be simultaneously sparse. We approach parameter estimation with a blockwise coordinate descent algorithm, and investigate the convergence rate of the estimated parameters. In addition, we propose a computationally efficient debiased inference procedure for uncertainty quantification. The efficacy of the proposed method is demonstrated through numerical experiments and an application to a gene co-expression network study with brain cancer patients.
Keywords: Covariance regression; subject-specific covariance matrix; multivariate linear regression; sparse group lasso; co-expression QTL.
1 Introduction
A covariance matrix measures the associations amongst a set of variables and its estimation and analysis play an important role in a wide range of applications,
such as genetics (Butte et al., 2000; Su et al., 2023), neuroscience (Zhang et al., 2020, 2023), finance (El Karoui et al., 2010; Xue et al., 2012) and climatology (Bickel et al., 2008a).
For example, in genetics, the covariance matrix estimated from gene expressions across different biological samples, often referred to as a co-expression network, is routinely used in identifying functional gene modules and dysregulated pathways in disease (Langfelder and Horvath, 2008; Su et al., 2023).
Although most co-expression analyses to date assume a common covariance matrix for different subjects, the structure and degree of covariance may depend on individual’s characteristics such as age, sex and genotype, which are referred to as individual-level covariates or covariates in this paper when there is no ambiguity. For example, it is known that co-expressions among genes can be affected by individual genetic variants, clinical and environmental factors (Van Der Wijst et al., 2018). In particular, a genetic variant that affects co-expressions between a pair of genes is termed a co-expression quantitative trait loci (QTL). Identifying co-expression QTLs is of great scientific interests and can be crucial in develo** gene therapies that target specific gene or pathway disruptions (Van Der Wijst et al., 2018; Zhang and Zhao, 2023).
Although the literature on estimating large covariance matrices is steadily increasing (Wu and Pourahmadi, 2003; Huang et al., 2006; Bickel et al., 2008a, b; Rothman et al., 2009; Lam and Fan, 2009; Bien and Tibshirani, 2011, and others), the majority of existing methods assume a homogeneous population obeying a common covariance model.
Some others have considered modeling covariate-dependent covariance matrices.
For example, Anderson (1973) modeled the covariance matrix as a linear combination of a given set of symmetric matrices; Chiu et al. (1996) modeled elements in the logarithm of , denoted as , as a linear function of covariates .
As noted by the authors, parameter interpretation for this model can be difficult, as a submatrix of is not generally the matrix exponential of the same submatrix of , and so the entries in do not directly relate to the corresponding entries in .
Pourahmadi (1999) modeled elements of the Cholesky decomposition of as linear functions of , though this model is not invariant to the reorderings of response variables.
Zou et al. (2017) related to a linear combination of similarity matrices of covariates. However, the covariates considered in this work are variable-specific and not individual-specific (e.g., covariates of genes but not of individuals). As such, the estimated covariance could not account for individual-level heterogeneity due to clinical covariates and genotypes.
Notably, Hoff and Niu (2012) proposed to model as a quadratic function of covariates written as , , which also admits a nice random-effects model representation; model estimation is carried out using the expectation–maximization (EM) algorithm or an Markov chain Monte Carlo (MCMC) via Gibbs sampling. Their modeling framework is further extended in Fox and Dunson (2015) by considering non-linear effects, in Franks (2021) by considering high dimensional response variables and in Alakus et al. (2022) by considering random forests.
The computational costs of the above extended methods can be prohibitive when dimensions of the response variables and covariates are both high. Moreover, due to the quadratic form of , sparsity in parameter does not directly translate to sparse effects of covariates, possibly limiting model interpretability.
Zhao et al. (2021); Park (2023) studied a principal regression approach that models , where is an unknown rotation vector, as a generalized linear model of . Parameter interpretation for this model may not be straightforward, as elements in are not directly modeled as a function of . It is also challenging to further extend this approach to the high dimensional setting.
To flexibly model large covariance matrices modulated by individual-level covariates, we propose a covariance regression model that allows the structure and degree of covariance to vary with discrete and continuous covariates of high dimensions. Specifically, the covariance matrix is modeled as a linear function of covariates with matrix-valued coefficients, subject to constraints that ensure positive semi-definiteness. Our model needs not to make specific assumptions on the distribution of response variables, such as the Gaussian assumption imposed in Hoff and Niu (2012). Using method of moments, we formulate coefficient estimation as a multivariate linear regression problem and impose a sparse group lasso penalty that simultaneously encourages effective covariates and their effects on the covariance matrix to be sparse. This combined sparsity assumption facilitates model estimability and interpretability, and is closely connected with multi-tasking learning (Argyriou et al., 2008). However, it brings substantial challenges to our theoretical analysis. In theory, we investigate the convergence rate of the proposed estimator, allowing both the response variables and covariates to be high-dimensional. Under our modeling framework, we further formulate a debiased inferential procedure inspired by the recent literature on debiasing lasso (Javanmard and Montanari, 2014; Zhang and Zhang, 2014; Cai et al., 2022) that can also handle non-Gaussian and heteroskedastic errors. We show that the coefficient matrices associated with covariates (plus intercept) can be debiased separately, a result that significantly reduces the computational cost. Although motivated by a biological application, our method provides a general framework for modeling covariance matrices with covariates and is broadly applicable to other scientific fields that involve covariance estimation.
The rest of the paper is organized as follows. Section 2 introduces the covariance regression model and Section 3 discusses the sparse multivariate regression estimation. Section 4 investigates theoretically the convergence rate of the proposed estimator and also proposes a debiased inferential procedure. Section 5 carried out comprehensive simulation studies and Section 6 conducts a co-expression QTL analysis using a brain cancer genomics data set. A short discussion section concludes the paper.
2 Covariance Regression Models
We start with some notation. Write . Given a vector , we use , and to denote the vector , and norms, respectively.
For a matrix , we let , and denote the matrix element-wise norm, Frobenius norm and element-wise max norm, respectively, and let represent the vectorization of the upper triangular part of and represent the concatenation of columns in . We use and to denote the smallest and largest eigenvalues of a matrix, respectively.
Given a vector of response variables denoted as , and a vector of covariates denoted as satisfying for , we assume that , where , , and
(1)
where is a symmetric and positive definite (PD) matrix of dimension and are symmetric matrices of dimension . Here, specifies the covariance at the population level and represents the effect of covariate on the covariance matrix.
Let be the eigendecomposition of and define matrices and such that and , leading to .
We assume that
(2)
where . This is a sufficient condition for a PD as
where the last inequality holds by the condition that .
When , we have and condition (2) reduces to , ensuring the variance to be positive regardless of the values of ’s.
This is a reasonable assumption. For instance, in our motivating data example, subjects with a specific genetic variant mutation may have a lower variance in gene expression than others. In this case, condition (2) implies that these subjects should still have a positive gene expression variance as is dominant.
We note that requiring the covariates to be in is not restrictive. In our data example, the covariates are genetic variants, which are often coded as . If a covariate takes negative values, one can add a shift such that the shifted covariate is nonnegative; if a covariate can be greater than one, one can rescale the covariate such that the scaled covariate is less than one. Such a transformation of does not affect the interpretability of ; see details in Section 7.
To expose key ideas, we assume and are known in the ensuing development, and focus on the estimation of . Extensions with estimated and are straightforward, but with more involved notation.
With independent observations denoted as , we aim to estimate via (1). This is a challenging task, as even in the simple Gaussian case, the log likelihood function is
where denotes the trace of a matrix and .
Due to the sums involved in the matrix trace and inverse calculations, this loglikelihood is not convex or biconvex with respect to , and cannot be directly optimized using iterative algorithmic solutions such as the EM and coordinate descent algorithms. To overcome this challenge, we consider a moment-based approach that is highly efficient to implement and need not to make distributional assumptions on ’s.
where and denotes the th entry of .
Letting be an matrix whose elements are
and
be the design matrix (including the intercept term), the observation in (3) facilitates the estimation of via the following multivariate linear regression,
(4)
where is the coefficient matrix whose elements are and is the matrix whose elements are .
When both and are large, to ensure the estimability and facilitate the interpretability of , we impose to be sparse. In particular, we assume is group sparse, corresponding to sparse effective covariates, that is, only a subset of the covariates may impact edges (termed effective covariates).
We further assume is element-wise sparse. That is, effective covariates may influence only a subset of the edges. These simultaneous sparsity assumptions are well supported by genetic studies (Gardner et al., 2003; Vierstra et al., 2020), and improve model interpretability when compared to using the group sparsity or element-wise sparsity alone. To encourage simultaneous sparsity, we consider the following penalty
(5)
where are as defined in the model (1) and are tuning parameters.
The term is a lasso penalty that encourages the effect of effective covariates to be sparse.
We exclude from the group sparse penalty (but not the element-wise sparse penalty), as it determines the population level network.
Also, the diagonal elements of are excluded from element-wise sparse penalty to ensure the response variables have non-zero variances at the population level.
The term is a group lasso penalty (Yuan and Lin, 2006) that encourages the effective covariates to be sparse, achieved by regularizing across regression tasks from (3) simultaneously. Correspondingly, this penalty term facilitates a multi-task learning approach (Argyriou et al., 2008).
The penalty term in (5) is similar to the sparse group lasso considered in Simon et al. (2013); Li et al. (2015), though it is not exactly the same as some parameters are included in the element-wise sparsity penalty but not the group sparsity penalty. This adds additional complexity to the estimation procedure and theoretical analysis.
3 Estimation
Given , and to estimate , we consider the following optimization problem,
(6)
where is specified as in (5). When there are no covariates, (6) reduces to the standard sparse covariance estimation problem (Rothman et al., 2009), written as
which is minimized at for and is the soft-thresholding operator at (Bickel et al., 2008a; Rothman et al., 2009).
The optimization problem in (6) is nontrivial, as the constraint set is nonconvex. To tackle this challenge, we propose an efficient two-step estimation procedure that first solves the non-constrained optimization and then finds a projection of the estimator to the constrained set.
This sequential estimation procedure greatly simplifies the computation and gives asymptotically consistent estimators. Similar sequential procedures have been commonly employed in statistical learning and optimization problems; see, e.g., Li et al. (2010); Zhang et al. (2020).
Algorithm 1 Sparse covariance regression with multivariate sparse group lasso
Input: Tuning parameters , , and as the initial estimator of in (4).
repeat
Step 1: For , compute as:
Step 2: For , update by
Step 3: For , check the condition below
If the condition above is satisfied, set .
If not, update by
To facilitate estimation, we reorder the columns of such that , where is a matrix collecting diagonals of , and is a matrix collecting off-diagonals of .
Similarly, we reorder the columns of in (4) such that .
For , let , and be the th row of , and , respectively, and let
. Write as the partial residual matrix.
We first summarize our optimization procedure in Algorithm 1, and then describe each step in detail.
For optimization, we adopt the blockwise coordinate descent algorithm as described in Algorithm 1.
For , the solution to is obtained by the least squares estimator, as the diagonal elements of are not penalized, and the solution to is obtained by the lasso estimator, as the off-diagonal elements of are not penalized by the group lasso penalty. For , the solution to is obtained by the sparse group lasso estimator.
In particular, it can be shown that the Karush-Kuhn-Tucker condition for the sparse group lasso (Simon et al., 2013)
is satisfied with if
where is the th column of in (4) and is the element-wise soft-thresholding operator at , that is, .
When , the solution for is determined by
Steps 1-3 in Algorithm 1 solve the unconstrained problem in (6). In Step 4 and given the estimators from Steps 1-3, we set
where .
These estimators ensure is positive semi-definite, and give a Ledoit-Wolf type shrinkage estimator (Ledoit and Wolf, 2004),
(7)
As increases, it follows from Theorem 2 that estimated from Steps 1-3 are consistent and satisfy the constraint (6) with high probability. That is, as increases, in (7) converges to zero. See more discussions after Theorem 2.
Two parameters and in (6) require tuning. In our procedure, they are jointly selected via -fold cross validation. We let in our simulation studies and real data analysis.
4 Theoretical Properties
In this section, we investigate the convergence rate of the estimator from the multivariate sparse group lasso regression in (6).
Theoretical properties of the sparse group lasso estimator in multivariate linear regressions have been investigated in Li et al. (2015). However, they assumed a Gaussian distribution, which can be restrictive when modeling , the product of expressions from genes and in sample . Moreover, the penalty term (5) is more involved than the sparse group lasso penalty in Li et al. (2015), as is excluded from the group sparsity penalty and the diagonal elements of are excluded from the element-wise sparsity.
We first discuss notation. After rearranging the columns of such that as in Section 3, define an index set such that
(8)
containing indices in that are included in the penalty in (5).
For a matrix , let be the index set of non-zero elements in , be the index set of non-zero rows excluding the first row and
.
For an index set , define such that the th element of equals to the th element of if and is 0 otherwise.
For a given index set , define such that the th row of equals to the th row of if and is otherwise.
The unconstrained objective function in (6) can be rewritten as
(9)
In our theoretical analysis, we will focus on the minimizer of (9), which is not subject to the PD constraint in (6). Theorem 2 below shows that, there exists a sufficiently large such that the constrained estimator from (6) reduces to the unconstrained estimator from (9) (see remark after Theorem 2).
4.1 Convergence rate
Let be the true coefficient matrices in (1), and let be the coefficient matrix in (4) calculated from . Let and denote the number of nonzero entries and groups in , respectively. Let denote the cardinality of a set. We first state regularity conditions.
Assumption 1
Assume and , for .
Assumption 2
For any non-zero matrix that satisfies
there exist a constant such that
Assumption 2 is a restricted eigenvalue condition that ensures the convexity of is bounded in the space of as specified in Assumption 2, and is commonly employed in the literature (Hastie et al., 2015).
We first derive a deterministic upper bound for the prediction error of obtained from minimizing (9).
Theorem 1
Suppose Assumptions 1-2 hold, and for some constant . Then, the solution that minimizes (9) satisfies
Theorem 1 shows that the mean square prediction error is bounded by a factor of . The bound also depends on the restricted eigenvalue constant and the sparsity of the true model.
Theorem 2
Suppose Assumptions 1-2 hold. Assume ’s are zero-mean sub-exponential random variables, and and are independent for . Let be the sub-exponential norm of and . For constants and , let
Then, with probability at least , it holds that
Theorem 2 shows that the Frobenius norm of the estimation error is bounded by a factor of order .
Here, we do not assume and are independent, as and from gene pairs and can be correlated.
Hence, under our setting, the error terms across element-wise regression tasks can be correlated. Comparable convergence rates, up to a logarithm factor, have been derived for sparse group lasso estimators in univariate regressions (Cai et al., 2022; Zhang and Li, 2023).
Theorem 2 also suggests that estimators from minimizing (9) satisfy the PD constraint in (6) as increases.
Specifically, it holds by Weyl’s inequality that
Since converges to zero as increases and by Assumption 1, we have, for a sufficiently large sample size ,
That is, the unconstrained optimizer of (9) satisfies the PD constraint in (6) for any .
4.2 Statistical Inference via Debiasing
In this section, we consider the inference for under the proposed multivariate sparse group lasso regression. Inspired by recent advances on debiasing in high dimensional linear regressions (Javanmard and Montanari, 2014; Zhang and Zhang, 2014; Cai et al., 2022), we design a debiased lasso estimator based on (9) which performs debiasing on each column of the proposed estimator separately and make inference on the true parameter matrix . Since is of dimension , carrying out the debiasing in a column-wise fashion, each of dimension , significantly reduces the computational cost. The cost for this computational gain is the potential loss of efficiency, compared to debiasing all columns in simultaneously. We also note that debiasing all columns in simultaneously may not be feasible under our framework as one needs to estimate , the analytical form of which is difficult to derive without distributional assumptions on .
Besides non-Gaussian errors, our procedure also faces the challenge of heteroskedasticity, as may depend on . Next, we detail our approach.
Denote .
Consider the following debiased estimator :
(10)
where , and , is a solution to
(11)
where is to be specified later and is the th vector in the canonical basis of . From the above calculations, is only a function of the design matrix , and can be seen as an approximation to the inverse of .
Assuming a random design where ’s are i.i.d sub-Gaussian, and have bounded eigenvalues and diagonal elements, Javanmard and Montanari (2014) showed the optimizations in (4.2) are feasible with high probability.
Define an index map** such that, for and a symmetric matrix , if . Next, we state our result on debiasing for inference.
Theorem 3
Suppose conditions in Theorem 2 are satisfied and optimizations in (4.2) are feasible. Assume and for some and .
(1)
With probability at least , can be decomposed as
where such that
and such that with is asymptotically normal with mean zero and variance .
(2)
When and ,
an asymptotic two-sided confidence interval for , where , is
In Theorem 3, the errors are not assumed to be Gaussian and they can be heteroskedastic. Calculating the above confidence interval requires estimating the variance of , where . This can be consistently estimated by the empirical variance (Bühlmann and Van de Geer, 2015)
(12)
where with .
5 Simulation Studies
In this section, we investigate the finite sample performance of our proposed method, referred to as SparseCovReg, and compare it with four alternative methods, including:
DenseSample: standard sample covariance estimator ,
SparseSample: soft-thresholding sample covariance estimator where is
the element-wise soft-thresholding operator at (Rothman et al., 2009),
CovReg: quadratic covariance regression estimator in Hoff and Niu (2012),
DenseCovReg: multivariate regression estimator without sparsity, i.e., .
The tuning parameters in SparseCovReg and SparseSample are selected using 5-fold cross validation.
We simulate samples , where the response is of dimension (e.g., genes) and covariate is of dimension (e.g., genetic variants).
For ’s, we consider two simulation settings. In Setting 1, we consider continuous covariates drawn independently from and in Setting 2, we consider discrete covariates drawn independently from .
Given , we simulate from , where
(13)
and is the th element in . The above MA(1) covariance structure has been commonly considered by others (Rothman et al., 2009; Qiu and Liyanage, 2019).
We consider , and . For each simulation configuration, we generate 100 independent data sets.
Let denotes the true covariance matrix for the th observation and denotes the estimated from a given method.
For illustration, Figure 1 plots the estimated against for from 5 data replicates.
We did not include the scatter plot from SparseSample as it is very similar to that of DenseSample. As DenseSample does not account for the effect of covariates, is estimated to be constant across all subjects, as shown by five horizontal lines from 5 data replicates.
The CovReg method by Hoff and Niu (2012) cannot estimate the covariance well as the true covariance is not a quadratic function of the covariates.
The DenseCovReg gives a reasonable agreement between the estimated and true covariances (slopes are all roughly 1), though the variability of the estimates is very high.
On the other hand, the proposed SparseCovReg estimates the covariance well (slopes are all roughly 1) and enjoys a much reduced variability.
Figure 1: Comparison of the true non-zero covariance (x-axis) and estimated covariance (y-axis) for five simulated datasets under Setting 1 (continuous covariates) with the number of covariates and the sample size .
method
Setting 1
Setting 2
200
30
DenseSample
3.22 (0.08)
4.15 (0.07)
SparseSample
2.01 (0.06)
3.23 (0.06)
CovReg
7.75 (0.31)
8.60 (0.43)
DenseCovReg
15.09 (0.34)
15.64 (0.46)
SparseCovReg
1.84 (0.06)
2.17 (0.08)
100
DenseSample
3.21 (0.08)
4.13 (0.08)
SparseSample
2.01 (0.06)
3.22 (0.06)
CovReg
14.64 (1.74)
15.97 (2.20)
DenseCovReg
26.92 (0.60)
27.64 (0.79)
SparseCovReg
1.84 (0.08)
2.18 (0.10)
500
30
DenseSample
2.41 (0.04)
3.53 (0.03)
SparseSample
1.75 (0.04)
3.12 (0.03)
CovReg
4.35 (0.14)
5.18 (0.12)
DenseCovReg
9.55 (0.14)
9.93 (0.18)
SparseCovReg
1.28 (0.05)
1.42 (0.07)
100
DenseSample
2.39 (0.04)
3.52 (0.03)
SparseSample
1.75 (0.04)
3.11 (0.03)
CovReg
11.06 (0.34)
12.21 (0.43)
DenseCovReg
17.26 (0.28)
17.86 (0.36)
SparseCovReg
1.29 (0.06)
1.45 (0.07)
Table 1: Average estimation errors measured via over 100 simulations with standard errors shown in parentheses.
Next, we compare the estimation errors calculated as . Table 1 reports the average estimation errors with standard errors in the parentheses.
The proposed SparseCovReg outperforms the alternative methods for all and . It is seen that the estimation errors of SparseCovReg increases with and decreases with , confirming the results of Theorem 2. In Table 2, we also report the selection accuracy of SparseCovReg by measuring the true positive rate and the false positive rate as
True positive rate
False positive rate
Note that the selection accuracy cannot be fairly evaluated from other methods, as DenseSample, CovReg and DenseCovReg are all dense estimators, and SparseSample does not estimate for .
Selection rate
Setting 1
Setting 2
200
30
TPR
0.8880
0.9960
FPR
0.0145
0.0147
100
TPR
0.8749
0.9960
FPR
0.0042
0.0042
500
30
TPR
0.9919
0.9999
FPR
0.0155
0.0155
100
TPR
0.9926
1.0000
FPR
0.0044
0.0042
Table 2: True positive rate (TPR) and false positive rate (FPR) of SparseCovReg.
\begin{overpic}[width=426.39255pt,angle={0}]{figures/ciplot3.png}
\put(48.0,-2.0){\small Index}
\put(-2.0,18.0){\rotatebox{90.0}{\small{Coefficient}}}
\end{overpic}Figure 2: 95% confidence intervals for elements in from one data replicate under Setting 2 with , . True parameter values are shown in and estimated parameter values, after debiasing, are shown in .
Setting 1
Setting 2
Method
200
30
0.938
0.928
0.938
0.927
0.910
0.927
0.950
0.948
0.950
0.950
0.949
0.950
100
0.875
0.853
0.875
0.863
0.828
0.863
0.951
0.950
0.951
0.951
0.935
0.951
500
30
0.943
0.937
0.943
0.942
0.939
0.942
0.951
0.948
0.951
0.951
0.953
0.951
100
0.925
0.912
0.925
0.921
0.909
0.921
0.951
0.947
0.951
0.951
0.951
0.951
Table 3: Average coverage probabilities of the 95% confidence intervals with the variance of estimated by (12), referred to as , and with the true variance of , referred to as .
Columns , and show average coverage probabilities of all off-diagonal parameters, non-zero off-diagonal parameters and zero off-diagonal parameters in , respectively.
Next, we evaluate the efficacy of the statistical inference procedure from Section 4.2. In Figure 2, we plot 95% confidence intervals for entries in from one data replicate under Setting 2 with , .
Specifically, the first 49 confidence intervals in Figure 2 are shown for for . From (13), we have for .
The last 48 confidence intervals in Figure 2 are shown for . From (13), we have for . It is seen that the 95% confidence intervals show a good coverage.
Finally, we evaluate the coverage probabilities for all off-diagonal parameters in , and , respectively, in Table 3.
Under each setting, we report the average coverage probabilities when the confidence intervals are calculated using the true variances of ’s and empirical variances estimated using (12).
It is seen that the intervals calculated with empirical variances achieve a satisfactory coverage, and it approaches 95% as the sample size increases.
6 Real Data Analysis
We apply our proposed method SparseCovReg to the REMBRANDT study (GSE108476) that collects data on 178 patients with glioblastoma multiforme (GBM), the most common malignant form of brain tumor in adults and one of the most lethal of all cancers (Akhavan et al., 2010).
These 178 patients had undergone microarray and single-nucleotide polymorphism (SNP) chip profiling, with both gene expression and SNP data available for analysis.
The raw data were pre-processed and normalized using standard pipelines; see Gusev et al. (2018) for more details.
The main objectives of our analysis are to identify co-expression QTLs and recover both the population-level and individual-level covariance matrices of gene expressions.
For response variables, we consider the expression levels of 73 genes that belong to the human glioma pathway in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa and Goto, 2000).
As covariates, we consider local SNPs (i.e., SNPs that fall within 2kb upstream and 0.5kb downstream of the gene) residing near those 73 genes, resulting in a total of 118 SNPs. SNPs are coded with “0” indicating homozygous in the major allele and “1” otherwise. Our analysis also includes age (continous) and sex as covariates, bringing a total of 120 covariates and 326,821 parameters in the model (4).
Tuning parameters have been selected by 5-fold cross validation.
Figure 3: Heatmaps of the population-level covariance estimates from SparseSample (left) and SparseCovReg (right). Positive values are shown in red and negative values are shown in blue.
We first investigate the population-level co-expression matrix.
In Figure 3, we compare the soft-thresholding covariance estimator (Rothman et al., 2009) with the population level covariance obtained from SparseCovReg.
It is seen that the soft-thresholding estimator shares some common patterns with SparseCovReg but is considerably more noisy.
Using the population covariance matrix from SparseCovReg, we can identify high correlations between PIK3CA and genes in the calcium signaling pathway including CALML5, CALM1, CAMK1D and CAMK2B. This is reasonable as mutations in PIK3CA have been reported in multiple tumor types and PIK3CA is part of the PI3K/AKT/MTOR signaling pathway, one of the core pathways in human GBM (Network et al., 2008). The calcium signaling pathway also plays diverse roles in the progression of brain cancers (Maklad et al., 2019).
Figure 3 shows negative correlations between PIK3CA and genes in the p53 signaling pathway, another core pathway in human GBM (Network et al., 2008) and a potential target for inhibition in GBM treatments (Schröder and McDonald, 2015; Yin et al., 2021).
Figure 4: Heatmaps of identified nonzero covariate effects. Positive values are shown in red and negative values are shown in blue.
Next, we examine the covariate effects on the covariance matrix. Non-zero effects have been identified for six SNPs: rs6701524, rs10509346, rs10519201, rs1347069, rs503314, and rs306098.
The non-zero effects of rs10509346 and rs6701524 are shown in Figure 4 and their network effects after the debiased inference procedure are shown in Figure 5. Interestingly, these covariate effects are not easily observable from the soft-thresholding estimator in Figure 3, suggesting that, by fitting model (1), we may find some covariate-modulated co-expression patterns that can otherwise be overlooked.
From the left plot of Figure 5, it is seen that rs10509346, residing in CAMK2G, notably affect co-expressions among genes in the Ras-Raf-MEK-ERK signaling pathway including EGF, SHC4, RAF1 and MAPK3. Also, their co-expressions with CAMK2D and CALML5 in the calcium signaling pathway are affected by rs10509346. This agrees with the findings that the Ras-Raf-MEK-ERK pathway is modulated by Ca and calmodulin (Agell et al., 2002; Zhang and Li, 2023). Furthermore, this SNP is found to affect the co-expressions of MTOR, part of the PI3K/AKT/MTOR pathway, with genes in the Ras-Raf-MEK-ERK pathway. This result is interesting because MTOR is a key mediator of PI3K/AKT/MTOR signaling, and is known to cooperate with alterations in other signaling pathways that are also commonly activated in GBM patients, such as the Ras-Raf-MEK-ERK pathway (Akhavan et al., 2010).
The right plot of Figure 5, shows rs6701524, residing in MTOR, affects co-expressions of genes in the PI3K/MTOR pathway. In particular, co-expressions of PIK3CD (and PIK3CB) with other genes are affected by this SNP.
This is an interesting finding as PI3K/MTOR is a key pathway in the development and progression of GBM, and the inhibition of PI3K/MTOR signaling was found effective in increasing survival with GBM tumor (Batsios et al., 2019).
Co-expressions affected by other SNPs are also worth noting. For example, rs306098 has been found to affect co-expressions of SHC2 with CDK4/6 and MTOR, which is interesting because the combination of CDK4/6 and MTOR inhibition has been investigated as a potential therapeutic strategy in GBM (Olmez et al., 2017).
Figure 5: Network effects of rs10509346 (left) and rs6701524 (right) identified using the debiased inference procedure in Section 4.2 with significance level and Bonferroni correction . Positive effects are shown in red and negative effects are shown in blue. Edge weights in the network graphs are proportional to the effect size.
7 Discussion
In our current approach, we did not consider the hierarchy between population-level effect and covariate-specific effect. However, in some applications, it might be reasonable to assume that covariate-specific effect is non-zero only when the population-level effect is non-zero.
Our multivariate regression framework can be extended to accommodate such a hierarchical structure by modifying the penalty term (Tibshirani and Friedman, 2020; Kim et al., 2021).
Next, we comment on the model interpretability after scaling the covariates to be in . Given bounded covariates for all , denote and as the lower and upper bounds of , respectively, and consider the covariance regression model
The above equation can be rewritten as
where and . Note that and only differ by a positive scalar, and they share the same sparsity pattern. Hence, parameter estimates can be interpreted with covariates before the transformation.
References
Agell et al. (2002)
Agell, N., Bachs, O., Rocamora, N., and Villalonga, P. (2002),
“Modulation of the Ras/Raf/MEK/ERK pathway by Ca2+, and calmodulin,”
Cellular signalling, 14, 649–654.
Akhavan et al. (2010)
Akhavan, D., Cloughesy, T. F., and Mischel, P. S. (2010), “mTOR
signaling in glioblastoma: lessons learned from bench to bedside,”
Neuro-oncology, 12, 882–889.
Alakus et al. (2022)
Alakus, C., Larocque, D., and Labbe, A. (2022), “Covariance regression
with random forests,” arXiv preprint arXiv:2209.08173.
Anderson (1973)
Anderson, T. W. (1973), “Asymptotically efficient estimation of
covariance matrices with linear structure,” The Annals of
Statistics, 1, 135–141.
Argyriou et al. (2008)
Argyriou, A., Evgeniou, T., and Pontil, M. (2008), “Convex multi-task
feature learning,” Machine learning, 73, 243–272.
Batsios et al. (2019)
Batsios, G., Viswanath, P., Subramani, E., Najac, C., Gillespie, A. M., Santos,
R. D., Molloy, A. R., Pieper, R. O., and Ronen, S. M. (2019),
“PI3K/mTOR inhibition of IDH1 mutant glioma leads to reduced 2HG
production that is associated with increased survival,” Scientific
reports, 9, 10521.
Bickel et al. (2008a)
Bickel, P. J., Levina, E., et al. (2008a), “Covariance
regularization by thresholding,” The Annals of Statistics, 36,
2577–2604.
Bickel et al. (2008b)
— (2008b), “Regularized estimation of large covariance
matrices,” The Annals of Statistics, 36, 199–227.
Bien and Tibshirani (2011)
Bien, J. and Tibshirani, R. J. (2011), “Sparse estimation of a
covariance matrix,” Biometrika, 98, 807–820.
Bühlmann and Van de Geer (2015)
Bühlmann, P. and Van de Geer, S. (2015), “High-dimensional
inference in misspecified linear models,” Electronic Journal of
Statistics, 9, 1449–1473.
Butte et al. (2000)
Butte, A. J., Tamayo, P., Slonim, D., Golub, T. R., and Kohane, I. S. (2000),
“Discovering functional relationships between RNA expression and
chemotherapeutic susceptibility using relevance networks,”
Proceedings of the National Academy of Sciences, 97, 12182–12186.
Cai et al. (2022)
Cai, T. T., Zhang, A. R., and Zhou, Y. (2022), “Sparse group lasso:
Optimal sample complexity, convergence rate, and statistical inference,”
IEEE Transactions on Information Theory, 68, 5975–6002.
Chiu et al. (1996)
Chiu, T. Y., Leonard, T., and Tsui, K.-W. (1996), “The
matrix-logarithmic covariance model,” Journal of the American
Statistical Association, 91, 198–210.
El Karoui et al. (2010)
El Karoui, N. et al. (2010), “High-dimensionality effects in the
Markowitz problem and other quadratic programs with linear constraints: Risk
underestimation,” The Annals of Statistics, 38, 3487–3566.
Fox and Dunson (2015)
Fox, E. B. and Dunson, D. B. (2015), “Bayesian nonparametric covariance
regression,” The Journal of Machine Learning Research, 16,
2501–2542.
Franks (2021)
Franks, A. M. (2021), “Reducing subspace models for large-scale
covariance regression,” Biometrics.
Gardner et al. (2003)
Gardner, T. S., Di Bernardo, D., Lorenz, D., and Collins, J. J. (2003),
“Inferring genetic networks and identifying compound mode of action
via expression profiling,” Science, 301, 102–105.
Gusev et al. (2018)
Gusev, Y., Bhuvaneshwar, K., Song, L., Zenklusen, J.-C., Fine, H., and
Madhavan, S. (2018), “The REMBRANDT study, a large collection of
genomic data from brain cancer patients,” Scientific data, 5, 1–9.
Hastie et al. (2015)
Hastie, T., Tibshirani, R., and Wainwright, M. (2015), “Statistical
learning with sparsity,” Monographs on statistics and applied
probability, 143, 143.
Hoff and Niu (2012)
Hoff, P. D. and Niu, X. (2012), “A covariance regression model,”
Statistica Sinica, 729–753.
Huang et al. (2006)
Huang, J. Z., Liu, N., Pourahmadi, M., and Liu, L. (2006), “Covariance
matrix selection and estimation via penalised normal likelihood,”
Biometrika, 93, 85–98.
Javanmard and Montanari (2014)
Javanmard, A. and Montanari, A. (2014), “Confidence intervals and
hypothesis testing for high-dimensional regression,” The Journal of
Machine Learning Research, 15, 2869–2909.
Kanehisa and Goto (2000)
Kanehisa, M. and Goto, S. (2000), “KEGG: kyoto encyclopedia of genes
and genomes,” Nucleic acids research, 28, 27–30.
Kim et al. (2021)
Kim, R., Mueller, S., and Garcia, T. P. (2021), “svReg: Structural
varying-coefficient regression to differentiate how regional brain atrophy
affects motor impairment for Huntington disease severity groups,”
Biometrical Journal, 63, 1254–1271.
Lam and Fan (2009)
Lam, C. and Fan, J. (2009), “Sparsistency and rates of convergence in
large covariance matrix estimation,” Annals of statistics, 37, 4254.
Langfelder and Horvath (2008)
Langfelder, P. and Horvath, S. (2008), “WGCNA: an R package for
weighted correlation network analysis,” BMC bioinformatics, 9,
1–13.
Ledoit and Wolf (2004)
Ledoit, O. and Wolf, M. (2004), “A well-conditioned estimator for
large-dimensional covariance matrices,” Journal of multivariate
analysis, 88, 365–411.
Li et al. (2015)
Li, Y., Nan, B., and Zhu, J. (2015), ‘‘Multivariate sparse group lasso
for the multivariate multiple linear regression with an arbitrary group
structure,” Biometrics, 71, 354–363.
Li et al. (2010)
Li, Y., Wang, N., and Carroll, R. J. (2010), “Generalized functional
linear models with semiparametric single-index interactions,” Journal
of the American Statistical Association, 105, 621–633.
Maklad et al. (2019)
Maklad, A., Sharma, A., and Azimi, I. (2019), “Calcium signaling in
brain cancers: roles and therapeutic targeting,” Cancers, 11, 145.
Network et al. (2008)
Network et al. (2008), “Comprehensive genomic characterization defines
human glioblastoma genes and core pathways,” Nature, 455,
1061–1068.
Olmez et al. (2017)
Olmez, I., Brenneman, B., Xiao, A., Serbulea, V., Benamar, M., Zhang, Y.,
Manigat, L., Abbas, T., Lee, J., Nakano, I., et al. (2017), “Combined
CDK4/6 and mTOR inhibition is synergistic against glioblastoma via multiple
mechanisms,” Clinical Cancer Research, 23, 6958–6968.
Park (2023)
Park, H. G. (2023), “Bayesian estimation of covariate assisted
principal regression for brain functional connectivity,” arXiv
preprint arXiv:2306.07181.
Pourahmadi (1999)
Pourahmadi, M. (1999), “Joint mean-covariance models with applications
to longitudinal data: Unconstrained parameterisation,” Biometrika,
86, 677–690.
Qiu and Liyanage (2019)
Qiu, Y. and Liyanage, J. S. (2019), “Threshold selection for covariance
estimation,” Biometrics, 75, 895–905.
Rothman et al. (2009)
Rothman, A. J., Levina, E., and Zhu, J. (2009), “Generalized
thresholding of large covariance matrices,” Journal of the American
Statistical Association, 104, 177–186.
Schröder and McDonald (2015)
Schröder, L. B. and McDonald, K. L. (2015), “CDK4/6 inhibitor
PD0332991 in glioblastoma treatment: does it have a future?”
Frontiers in oncology, 5, 259.
Simon et al. (2013)
Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2013), “A
sparse-group lasso,” Journal of Computational and Graphical
Statistics, 22, 231–245.
Su et al. (2023)
Su, C., Xu, Z., Shan, X., Cai, B., Zhao, H., and Zhang, J. (2023),
“Cell-type-specific co-expression inference from single cell
RNA-sequencing data,” Nature Communications, 14, 4846.
Tibshirani and Friedman (2020)
Tibshirani, R. and Friedman, J. (2020), “A pliable lasso,”
Journal of Computational and Graphical Statistics, 29, 215–225.
Van Der Wijst et al. (2018)
Van Der Wijst, M. G., de Vries, D. H., Brugge, H., Westra, H.-J., and Franke,
L. (2018), “An integrative approach for building personalized gene
regulatory networks for precision medicine,” Genome medicine, 10,
1–15.
Vierstra et al. (2020)
Vierstra, J., Lazar, J., Sandstrom, R., Halow, J., Lee, K., Bates, D., Diegel,
M., Dunn, D., Neri, F., Haugen, E., et al. (2020), ‘‘Global reference
map** of human transcription factor footprints,” Nature, 583,
729–736.
Wu and Pourahmadi (2003)
Wu, W. B. and Pourahmadi, M. (2003), “Nonparametric estimation of large
covariance matrices of longitudinal data,” Biometrika, 90, 831–844.
Xue et al. (2012)
Xue, L., Ma, S., and Zou, H. (2012), “Positive-definite
1-penalized estimation of large covariance matrices,” Journal
of the American Statistical Association, 107, 1480–1491.
Yin et al. (2021)
Yin, L., Yao, Z., Wang, Y., Huang, J., Mazuranic, M., and Yin, A. (2021),
“In Preclinical evaluation of novel CDK4/6 inhibitor GLR2007 in
glioblastoma models,” J. Clin. Oncol, 39, e14023.
Yuan and Lin (2006)
Yuan, M. and Lin, Y. (2006), “Model selection and estimation in
regression with grouped variables,” Journal of the Royal Statistical
Society: Series B (Statistical Methodology), 68, 49–67.
Zhang and Zhang (2014)
Zhang, C.-H. and Zhang, S. S. (2014), ‘‘Confidence intervals for low
dimensional parameters in high dimensional linear models,” Journal of
the Royal Statistical Society: Series B: Statistical Methodology, 217–242.
Zhang and Li (2023)
Zhang, J. and Li, Y. (2023), “High-dimensional Gaussian graphical
regression models with covariates,” Journal of the American
Statistical Association, 118, 2088–2100.
Zhang et al. (2020)
Zhang, J., Sun, W. W., and Li, L. (2020), “Mixed-effect time-varying
network model and application in brain connectivity analysis,”
Journal of the American Statistical Association, 115, 2022–2036.
Zhang et al. (2023)
Zhang, J., Sun, W. W., and Li, L. (2023), “Generalized connectivity
matrix response regression with applications in brain connectivity studies,”
Journal of Computational and Graphical Statistics, 32, 252–262.
Zhang and Zhao (2023)
Zhang, J. and Zhao, H. (2023), “eQTL studies: from bulk tissues to
single cells,” Journal of Genetics and Genomics, 50, 925–933.
Zhao et al. (2021)
Zhao, Y., Wang, B., Mostofsky, S. H., Caffo, B. S., and Luo, X. (2021),
“Covariate assisted principal regression for covariance matrix
outcomes,” Biostatistics, 22, 629–645.
Zou et al. (2017)
Zou, T., Lan, W., Wang, H., and Tsai, C.-L. (2017), “Covariance
regression analysis,” Journal of the American Statistical
Association, 112, 266–281.