High-dimensional covariance regression with application to co-expression QTL detection

Rakheon Kim
Department of Statistical Science, Baylor University
and
**gfei Zhang
Goizueta Business School, Emory University

Abstract

While covariance matrices have been widely studied in many scientific fields, relatively limited progress has been made on estimating conditional covariances that permits a large covariance matrix to vary with high-dimensional subject-level covariates. In this paper, we present a new sparse multivariate regression framework that models the covariance matrix as a function of subject-level covariates. In the context of co-expression quantitative trait locus (QTL) studies, our method can be used to determine if and how gene co-expressions vary with genetic variations. To accommodate high-dimensional responses and covariates, we stipulate a combined sparsity structure that encourages covariates with non-zero effects and edges that are modulated by these covariates to be simultaneously sparse. We approach parameter estimation with a blockwise coordinate descent algorithm, and investigate the $\ell_{2}$ convergence rate of the estimated parameters. In addition, we propose a computationally efficient debiased inference procedure for uncertainty quantification. The efficacy of the proposed method is demonstrated through numerical experiments and an application to a gene co-expression network study with brain cancer patients.

Keywords: Covariance regression; subject-specific covariance matrix; multivariate linear regression; sparse group lasso; co-expression QTL.

1 Introduction

A covariance matrix measures the associations amongst a set of variables and its estimation and analysis play an important role in a wide range of applications, such as genetics (Butte et al., 2000; Su et al., 2023), neuroscience (Zhang et al., 2020, 2023), finance (El Karoui et al., 2010; Xue et al., 2012) and climatology (Bickel et al., 2008a). For example, in genetics, the covariance matrix estimated from gene expressions across different biological samples, often referred to as a co-expression network, is routinely used in identifying functional gene modules and dysregulated pathways in disease (Langfelder and Horvath, 2008; Su et al., 2023). Although most co-expression analyses to date assume a common covariance matrix for different subjects, the structure and degree of covariance may depend on individual’s characteristics such as age, sex and genotype, which are referred to as individual-level covariates or covariates in this paper when there is no ambiguity. For example, it is known that co-expressions among genes can be affected by individual genetic variants, clinical and environmental factors (Van Der Wijst et al., 2018). In particular, a genetic variant that affects co-expressions between a pair of genes is termed a co-expression quantitative trait loci (QTL). Identifying co-expression QTLs is of great scientific interests and can be crucial in develo** gene therapies that target specific gene or pathway disruptions (Van Der Wijst et al., 2018; Zhang and Zhao, 2023).

Although the literature on estimating large covariance matrices is steadily increasing (Wu and Pourahmadi, 2003; Huang et al., 2006; Bickel et al., 2008a, b; Rothman et al., 2009; Lam and Fan, 2009; Bien and Tibshirani, 2011, and others), the majority of existing methods assume a homogeneous population obeying a common covariance model. Some others have considered modeling covariate-dependent covariance matrices. For example, Anderson (1973) modeled the covariance matrix $\mbox{\boldmath$\Sigma$}\in\mathbb{R}^{p\times p}$ as a linear combination of a given set of symmetric matrices; Chiu et al. (1996) modeled elements in the logarithm of $\Sigma$ , denoted as $\log\mbox{\boldmath$\Sigma$}$ , as a linear function of covariates ${\mathbf{x}}\in\mathbb{R}^{q}$ . As noted by the authors, parameter interpretation for this model can be difficult, as a submatrix of $\Sigma$ is not generally the matrix exponential of the same submatrix of $\log\mbox{\boldmath$\Sigma$}$ , and so the entries in $\log\mbox{\boldmath$\Sigma$}$ do not directly relate to the corresponding entries in $\Sigma$ . Pourahmadi (1999) modeled elements of the Cholesky decomposition of $\mbox{\boldmath$\Sigma$}^{-1}$ as linear functions of ${\mathbf{x}}$ , though this model is not invariant to the reorderings of response variables. Zou et al. (2017) related $\Sigma$ to a linear combination of similarity matrices of covariates. However, the covariates considered in this work are variable-specific and not individual-specific (e.g., covariates of genes but not of individuals). As such, the estimated covariance could not account for individual-level heterogeneity due to clinical covariates and genotypes.

Notably, Hoff and Niu (2012) proposed to model $\Sigma$ as a quadratic function of covariates ${\mathbf{x}}$ written as ${\mathbf{B}}{\mathbf{x}}{\mathbf{x}}^{\top}{\mathbf{B}}^{\top}$ , ${\mathbf{B}}\in\mathbb{R}^{p\times q}$ , which also admits a nice random-effects model representation; model estimation is carried out using the expectation–maximization (EM) algorithm or an Markov chain Monte Carlo (MCMC) via Gibbs sampling. Their modeling framework is further extended in Fox and Dunson (2015) by considering non-linear effects, in Franks (2021) by considering high dimensional response variables and in Alakus et al. (2022) by considering random forests. The computational costs of the above extended methods can be prohibitive when dimensions of the response variables and covariates are both high. Moreover, due to the quadratic form of ${\mathbf{B}}{\mathbf{x}}{\mathbf{x}}^{\top}{\mathbf{B}}^{\top}$ , sparsity in parameter ${\mathbf{B}}$ does not directly translate to sparse effects of covariates, possibly limiting model interpretability. Zhao et al. (2021); Park (2023) studied a principal regression approach that models $\bm{\gamma}^{\top}\mbox{\boldmath$\Sigma$}\bm{\gamma}$ , where $\bm{\gamma}$ is an unknown rotation vector, as a generalized linear model of ${\mathbf{x}}$ . Parameter interpretation for this model may not be straightforward, as elements in $\Sigma$ are not directly modeled as a function of ${\mathbf{x}}$ . It is also challenging to further extend this approach to the high dimensional setting.

To flexibly model large covariance matrices modulated by individual-level covariates, we propose a covariance regression model that allows the structure and degree of covariance to vary with discrete and continuous covariates of high dimensions. Specifically, the covariance matrix is modeled as a linear function of covariates with matrix-valued coefficients, subject to constraints that ensure positive semi-definiteness. Our model needs not to make specific assumptions on the distribution of response variables, such as the Gaussian assumption imposed in Hoff and Niu (2012). Using method of moments, we formulate coefficient estimation as a multivariate linear regression problem and impose a sparse group lasso penalty that simultaneously encourages effective covariates and their effects on the covariance matrix to be sparse. This combined sparsity assumption facilitates model estimability and interpretability, and is closely connected with multi-tasking learning (Argyriou et al., 2008). However, it brings substantial challenges to our theoretical analysis. In theory, we investigate the convergence rate of the proposed estimator, allowing both the response variables and covariates to be high-dimensional. Under our modeling framework, we further formulate a debiased inferential procedure inspired by the recent literature on debiasing lasso (Javanmard and Montanari, 2014; Zhang and Zhang, 2014; Cai et al., 2022) that can also handle non-Gaussian and heteroskedastic errors. We show that the $q+1$ coefficient matrices associated with $q$ covariates (plus intercept) can be debiased separately, a result that significantly reduces the computational cost. Although motivated by a biological application, our method provides a general framework for modeling covariance matrices with covariates and is broadly applicable to other scientific fields that involve covariance estimation.

The rest of the paper is organized as follows. Section 2 introduces the covariance regression model and Section 3 discusses the sparse multivariate regression estimation. Section 4 investigates theoretically the convergence rate of the proposed estimator and also proposes a debiased inferential procedure. Section 5 carried out comprehensive simulation studies and Section 6 conducts a co-expression QTL analysis using a brain cancer genomics data set. A short discussion section concludes the paper.

2 Covariance Regression Models

We start with some notation. Write $[d]=\{1,2,\ldots,d\}$ . Given a vector ${\mathbf{x}}=(x_{1},\ldots,x_{d})^{\top}$ , we use $\|{\mathbf{x}}\|_{1}$ , $\|{\mathbf{x}}\|_{2}$ and $\|{\mathbf{x}}\|_{\infty}$ to denote the vector $\ell_{1}$ , $\ell_{2}$ and $\ell_{\infty}$ norms, respectively. For a matrix ${\mathbf{X}}\in\mathbb{R}^{d_{1}\times d_{2}}$ , we let $\|{\mathbf{X}}\|_{1}=\sum_{ij}|X_{ij}|$ , $\|{\mathbf{X}}\|_{F}=(\sum_{ij}X_{ij}^{2})^{1/2}$ and $\|{\mathbf{X}}\|_{\infty}=\max_{ij}|X_{ij}|$ denote the matrix element-wise $\ell_{1}$ norm, Frobenius norm and element-wise max norm, respectively, and let $\text{vech}({\mathbf{X}})=(X_{11},X_{12},\ldots,X_{1,d_{1}},\ldots,X_{d_{1}d_{% 1}})$ represent the vectorization of the upper triangular part of ${\mathbf{X}}$ and $\text{vec}({\mathbf{X}})$ represent the concatenation of columns in ${\mathbf{X}}$ . We use $\lambda_{\min}(\cdot)$ and $\lambda_{\max}(\cdot)$ to denote the smallest and largest eigenvalues of a matrix, respectively.

Given a vector of $p$ response variables denoted as ${\mathbf{y}}=(y_{1},\ldots,y_{p})^{\top}$ , and a vector of $q$ covariates denoted as ${\mathbf{x}}=(x_{1},\ldots,x_{q})^{\top}$ satisfying $x_{h}\in[0,1]$ for $h\in[q]$ , we assume that $\mathbb{E}({\mathbf{y}}|{\mathbf{x}})={\bm{\beta}}_{0}+\bm{\Gamma}{\mathbf{x}}$ , where ${\bm{\beta}}_{0}\in\mathbb{R}^{p}$ , $\bm{\Gamma}\in\mathbb{R}^{p\times q}$ , and

\displaystyle\text{Cov}({\mathbf{y}}|{\mathbf{x}})=\mbox{\boldmath$\Sigma$}({% \mathbf{x}})={\mathbf{B}}_{0}+\sum_{h=1}^{q}x_{h}{\mathbf{B}}_{h},

(1)

where ${\mathbf{B}}_{0}$ is a symmetric and positive definite (PD) matrix of dimension $p\times p$ and ${\mathbf{B}}_{1},\ldots,{\mathbf{B}}_{q}$ are symmetric matrices of dimension $p\times p$ . Here, ${\mathbf{B}}_{0}$ specifies the covariance at the population level and ${\mathbf{B}}_{h}$ represents the effect of covariate $x_{h}$ on the covariance matrix. Let ${\mathbf{B}}_{h}={\mathbf{Q}}_{h}{\bm{\Lambda}}_{h}{\mathbf{Q}}_{h}^{\top}$ be the eigendecomposition of ${\mathbf{B}}_{h}$ and define $p\times p$ matrices ${\bm{\Lambda}}_{h}^{+}$ and ${\bm{\Lambda}}_{h}^{-}$ such that $\Lambda_{h,jj}^{+}=\max(0,\Lambda_{h,jj})$ and $\Lambda_{h,jj}^{-}=\min(0,\Lambda_{h,jj})$ , leading to ${\bm{\Lambda}}_{h}={\bm{\Lambda}}_{h}^{+}+{\bm{\Lambda}}_{h}^{-}$ . We assume that

\displaystyle\lambda_{\text{min}}({\mathbf{B}}_{0}+\sum_{h=1}^{q}{\mathbf{B}}_% {h}^{-})>0,

(2)

where ${\mathbf{B}}_{h}^{-}={\mathbf{Q}}_{h}{\bm{\Lambda}}_{h}^{-}{\mathbf{Q}}_{h}^{\top}$ . This is a sufficient condition for a PD $\mbox{\boldmath$\Sigma$}({\mathbf{x}})$ as

\lambda_{\text{min}}\{\mbox{\boldmath$\Sigma$}({\mathbf{x}})\}=\lambda_{\text{% min}}({\mathbf{B}}_{0}+\sum_{h=1}^{q}x_{h}{\mathbf{B}}_{h})\geq\lambda_{\text{% min}}({\mathbf{B}}_{0}+\sum_{h=1}^{q}{\mathbf{B}}_{h}^{-}),

where the last inequality holds by the condition that $x_{h}\in[0,1]$ . When $p=1$ , we have $B_{h}=b_{h}$ and condition (2) reduces to $b_{0}+\sum_{h=1}^{q}\min\{b_{h},0\}>0$ , ensuring the variance to be positive regardless of the values of $x_{h}$ ’s. This is a reasonable assumption. For instance, in our motivating data example, subjects with a specific genetic variant mutation may have a lower variance in gene expression than others. In this case, condition (2) implies that these subjects should still have a positive gene expression variance as $b_{0}$ is dominant. We note that requiring the covariates to be in $[0,1]$ is not restrictive. In our data example, the covariates are genetic variants, which are often coded as $\{0,1\}$ . If a covariate takes negative values, one can add a shift such that the shifted covariate is nonnegative; if a covariate can be greater than one, one can rescale the covariate such that the scaled covariate is less than one. Such a transformation of $x_{h}$ does not affect the interpretability of ${\mathbf{B}}_{h}$ ; see details in Section 7. To expose key ideas, we assume ${\bm{\beta}}_{0}$ and $\bm{\Gamma}$ are known in the ensuing development, and focus on the estimation of ${\mathbf{B}}_{0},{\mathbf{B}}_{1},\ldots,{\mathbf{B}}_{q}$ . Extensions with estimated ${\bm{\beta}}_{0}$ and $\bm{\Gamma}$ are straightforward, but with more involved notation.

With $n$ independent observations denoted as $\{({\mathbf{y}}_{i},{\mathbf{x}}_{i}),i\in[n]\}\in\mathbb{R}^{p}\times\mathbb{% R}^{q}$ , we aim to estimate ${\mathbf{B}}_{0},{\mathbf{B}}_{1},\ldots,{\mathbf{B}}_{q}$ via (1). This is a challenging task, as even in the simple Gaussian case, the log likelihood function is

\sum_{i=1}^{n}\log\left|{\mathbf{B}}_{0}+\sum_{h=1}^{q}x_{h}{\mathbf{B}}_{h}% \right|-\sum_{i=1}^{n}\text{tr}\left\{\left({\mathbf{B}}_{0}+\sum_{h=1}^{q}x_{% h}{\mathbf{B}}_{h}\right)^{-1}{\mathbf{z}}_{i}{\mathbf{z}}_{i}^{\top}\right\},

where $\text{tr}(\cdot)$ denotes the trace of a matrix and ${\mathbf{z}}_{i}={\mathbf{y}}_{i}-\mathbb{E}({\mathbf{y}}_{i})$ . Due to the sums involved in the matrix trace and inverse calculations, this loglikelihood is not convex or biconvex with respect to ${\mathbf{B}}_{0},{\mathbf{B}}_{1},\ldots,{\mathbf{B}}_{q}$ , and cannot be directly optimized using iterative algorithmic solutions such as the EM and coordinate descent algorithms. To overcome this challenge, we consider a moment-based approach that is highly efficient to implement and need not to make distributional assumptions on ${\mathbf{y}}_{i}$ ’s.

First, note that (1) implies $\mathbb{E}(z_{ij}z_{ik})=B_{0,jk}+\sum_{h=1}^{q}x_{iq}B_{q,jk}$ , that is

\displaystyle z_{ij}z_{ik}=B_{0,jk}+x_{i1}B_{1,jk}+\ldots+x_{iq}B_{q,jk}+% \epsilon_{ijk},

(3)

where $\mathbb{E}(\epsilon_{ijk})=0$ and $B_{h,jk}$ denotes the $(j,k)$ th entry of ${\mathbf{B}}_{h}$ . Letting ${\mathbf{V}}$ be an $n\times p(p+1)/2$ matrix whose elements are $\{z_{ij}z_{ik}\}_{i=1,j=1,k=j}^{n,p,p}$ and ${\mathbf{X}}=\{X_{il}\}_{i=1,l=1}^{n,q+1}$ be the $n\times(q+1)$ design matrix (including the intercept term), the observation in (3) facilitates the estimation of ${\mathbf{B}}_{0},{\mathbf{B}}_{1},\ldots,{\mathbf{B}}_{q}$ via the following multivariate linear regression,

\displaystyle{\mathbf{V}}={\mathbf{X}}{\mathbf{D}}+{\mathbf{E}},

(4)

where ${\mathbf{D}}$ is the $(q+1)\times p(p+1)/2$ coefficient matrix whose elements are $\{B_{h,jk}\}_{h=0,j=1,k=j}^{q,p,p}$ and ${\mathbf{E}}$ is the $n\times p(p+1)/2$ matrix whose elements are $\{\epsilon_{ijk}\}_{i=1,j=1,k=j}^{n,p,p}$ .

When both $p$ and $q$ are large, to ensure the estimability and facilitate the interpretability of ${\mathbf{D}}$ , we impose ${\mathbf{D}}$ to be sparse. In particular, we assume ${\mathbf{D}}$ is group sparse, corresponding to sparse effective covariates, that is, only a subset of the covariates may impact edges (termed effective covariates). We further assume ${\mathbf{D}}$ is element-wise sparse. That is, effective covariates may influence only a subset of the edges. These simultaneous sparsity assumptions are well supported by genetic studies (Gardner et al., 2003; Vierstra et al., 2020), and improve model interpretability when compared to using the group sparsity or element-wise sparsity alone. To encourage simultaneous sparsity, we consider the following penalty

\displaystyle\mathcal{P}_{\lambda,\lambda_{g}}({\mathbf{B}}_{0},{\mathbf{B}}_{% 1},\ldots,{\mathbf{B}}_{q})=\lambda\bigg{(}\sum_{h=1}^{q}\sum_{j\leq k}|B_{h,% jk}|+\sum_{j<k}|B_{0,jk}|\bigg{)}+\lambda_{g}\sum_{h=1}^{q}\|\text{vech}({% \mathbf{B}}_{h})\|_{2},

(5)

where ${\mathbf{B}}_{0},{\mathbf{B}}_{1},\ldots,{\mathbf{B}}_{q}$ are as defined in the model (1) and $\lambda,\lambda_{g}$ are tuning parameters.

The term $\sum_{h=1}^{q}\sum_{j\leq k}|B_{h,jk}|+\sum_{j<k}|B_{0,jk}|$ is a lasso penalty that encourages the effect of effective covariates to be sparse. We exclude ${\mathbf{B}}_{0}$ from the group sparse penalty (but not the element-wise sparse penalty), as it determines the population level network. Also, the diagonal elements of ${\mathbf{B}}_{0}$ are excluded from element-wise sparse penalty to ensure the response variables have non-zero variances at the population level.

The term $\sum_{h=1}^{q}\|\text{vech}({\mathbf{B}}_{h})\|_{2}$ is a group lasso penalty (Yuan and Lin, 2006) that encourages the effective covariates to be sparse, achieved by regularizing ${\mathbf{B}}_{h}$ across $p(p+1)/2$ regression tasks from (3) simultaneously. Correspondingly, this penalty term facilitates a multi-task learning approach (Argyriou et al., 2008). The penalty term in (5) is similar to the sparse group lasso considered in Simon et al. (2013); Li et al. (2015), though it is not exactly the same as some parameters are included in the element-wise sparsity penalty but not the group sparsity penalty. This adds additional complexity to the estimation procedure and theoretical analysis.

3 Estimation

Given ${\mathbf{V}}$ , ${\mathbf{X}}$ and to estimate ${\mathbf{D}}$ , we consider the following optimization problem,

		$\displaystyle\frac{1}{2n}\\|{\mathbf{V}}-{\mathbf{X}}{\mathbf{D}}\\|_{F}^{2}+% \mathcal{P}_{\lambda,\lambda_{g}}({\mathbf{B}}_{0},{\mathbf{B}}_{1},\ldots,{% \mathbf{B}}_{q}),$		(6)
		$\displaystyle\text{s.t.}\,\,\lambda_{\text{min}}({\mathbf{B}}_{0}+\sum_{h=1}^{% q}{\mathbf{B}}_{h}^{-})>0$

where $\mathcal{P}_{\lambda,\lambda_{g}}(\cdot)$ is specified as in (5). When there are no covariates, (6) reduces to the standard sparse covariance estimation problem (Rothman et al., 2009), written as

\displaystyle\sum_{j=1}^{p}\sum_{k=j}^{p}\bigg{\{}\frac{1}{2n}\sum_{i=1}^{n}(z% _{ij}z_{ik}-B_{0,jk})^{2}+\lambda|B_{0,jk}|\bigg{\}},

which is minimized at $\hat{B}_{0,jk}=S_{\lambda}(\sum_{i=1}^{n}z_{ij}z_{ik}/n)$ for $j,k\in[p]$ and $S_{\lambda}(a)=\text{sign}(a)\times\max(|a|-\lambda,0)$ is the soft-thresholding operator at $\lambda$ (Bickel et al., 2008a; Rothman et al., 2009). The optimization problem in (6) is nontrivial, as the constraint set is nonconvex. To tackle this challenge, we propose an efficient two-step estimation procedure that first solves the non-constrained optimization and then finds a projection of the estimator to the constrained set. This sequential estimation procedure greatly simplifies the computation and gives asymptotically consistent estimators. Similar sequential procedures have been commonly employed in statistical learning and optimization problems; see, e.g., Li et al. (2010); Zhang et al. (2020).

Algorithm 1 Sparse covariance regression with multivariate sparse group lasso

Input: Tuning parameters

\lambda

\lambda_{g}

, and

\widetilde{{\mathbf{D}}}

as the initial estimator of

{\mathbf{D}}

in (4).

repeat

Step 1: For

l\in[q+1]

, compute

{\mathbf{R}}_{l}=[{\mathbf{R}}_{l}^{(\text{diag})},{\mathbf{R}}_{l}^{(\text{% off})}]

as:

\widetilde{{\mathbf{R}}}_{l}^{(\text{diag})}={\mathbf{V}}^{(\text{diag})}-{% \mathbf{X}}_{\mathrel{\scalebox{0.4}{$\bullet$}}[-l]}\widetilde{{\mathbf{D}}}_% {[-l]\mathrel{\scalebox{0.4}{$\bullet$}}}^{(\text{diag})},\quad\widetilde{{% \mathbf{R}}}_{l}^{(\text{off})}={\mathbf{V}}^{(\text{off})}-{\mathbf{X}}_{% \mathrel{\scalebox{0.4}{$\bullet$}}[-l]}\widetilde{{\mathbf{D}}}_{[-l]\mathrel% {\scalebox{0.4}{$\bullet$}}}^{(\text{off})}

Step 2: For

l=1

, update

\widetilde{{\mathbf{D}}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}=(\widetilde{{% \mathbf{D}}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}^{(\text{diag})},\widetilde% {{\mathbf{D}}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}^{(\text{off})})

	$\displaystyle\widetilde{{\mathbf{D}}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}^{% (\text{diag})}$	$\displaystyle=\bigg{(}\frac{1}{n}{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$% \bullet$}}l}^{\top}{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$\bullet$}}l}\bigg{)}% ^{-1}\bigg{(}\frac{1}{n}{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$\bullet$}}l}^{% \top}\widetilde{{\mathbf{R}}}_{l}^{(\text{diag})}\bigg{)},$
	$\displaystyle\widetilde{{\mathbf{D}}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}^{% (\text{off})}$	$\displaystyle=\bigg{(}\frac{1}{n}{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$% \bullet$}}l}^{\top}{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$\bullet$}}l}\bigg{)}% ^{-1}S_{\lambda}\bigg{(}\frac{1}{n}{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$% \bullet$}}l}^{\top}\widetilde{{\mathbf{R}}}_{l}^{(\text{off})}\bigg{)}.$

Step 3: For

l\neq 1

, check the condition below

\displaystyle\bigg{\|}S_{\lambda}\bigg{(}\frac{1}{n}{\mathbf{X}}_{\mathrel{% \scalebox{0.4}{$\bullet$}}l}^{\top}\widetilde{{\mathbf{R}}}_{l}\bigg{)}\bigg{% \|}_{2}<\lambda_{g}.

If the condition above is satisfied, set

\widetilde{{\mathbf{D}}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}={\mathbf{0}}

. If not, update

\widetilde{{\mathbf{D}}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}

\displaystyle\widetilde{{\mathbf{D}}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}=% \bigg{(}\frac{1}{n}{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$\bullet$}}l}^{\top}{% \mathbf{X}}_{\mathrel{\scalebox{0.4}{$\bullet$}}l}+\frac{\lambda_{g}}{\|% \widetilde{{\mathbf{D}}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}\|_{2}}\bigg{)}% ^{-1}S_{\lambda}\bigg{(}\frac{1}{n}{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$% \bullet$}}l}^{\top}\widetilde{{\mathbf{R}}}_{l}\bigg{)}.

until the algorithm converges.

Step 4: Compute

\widehat{\mathbf{B}}_{0},\widehat{{\mathbf{B}}}_{1},\ldots,\widehat{{\mathbf{B% }}}_{q}

as in (7).

To facilitate estimation, we reorder the columns of ${\mathbf{D}}$ such that ${\mathbf{D}}=[{\mathbf{D}}^{\text{(diag)}},{\mathbf{D}}^{(\text{off})}]$ , where ${\mathbf{D}}^{(\text{diag})}$ is a $(q+1)\times p$ matrix collecting diagonals of ${\mathbf{B}}_{0},{\mathbf{B}}_{1},\ldots,{\mathbf{B}}_{q}$ , and ${\mathbf{D}}^{(\text{off})}$ is a $(q+1)\times p(p-1)/2$ matrix collecting off-diagonals of ${\mathbf{B}}_{0},{\mathbf{B}}_{1},\ldots,{\mathbf{B}}_{q}$ . Similarly, we reorder the columns of ${\mathbf{V}}$ in (4) such that ${\mathbf{V}}=[{\mathbf{V}}^{(\text{diag})},{\mathbf{V}}^{(\text{off})}]$ . For $l\in[q+1]$ , let ${\mathbf{D}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}$ , ${\mathbf{D}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}^{(\text{diag})}$ and ${\mathbf{D}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}^{(\text{off})}$ be the $l$ th row of ${\mathbf{D}}$ , ${\mathbf{D}}^{(\text{diag})}$ and ${\mathbf{D}}^{(\text{off})}$ , respectively, and let $[-l]=[q+1]\backslash\{l\}$ . Write ${\mathbf{R}}_{l}={\mathbf{V}}-{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$\bullet$}% }[-l]}{\mathbf{D}}_{[-l]\mathrel{\scalebox{0.4}{$\bullet$}}}$ as the $n\times p(p+1)/2$ partial residual matrix. We first summarize our optimization procedure in Algorithm 1, and then describe each step in detail.

For optimization, we adopt the blockwise coordinate descent algorithm as described in Algorithm 1. For $l=1$ , the solution to ${\mathbf{D}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}^{(\text{diag})}$ is obtained by the least squares estimator, as the diagonal elements of ${\mathbf{B}}_{0}$ are not penalized, and the solution to ${\mathbf{D}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}^{(\text{off})}$ is obtained by the lasso estimator, as the off-diagonal elements of ${\mathbf{B}}_{0}$ are not penalized by the group lasso penalty. For $l=2,\ldots,q+1$ , the solution to ${\mathbf{D}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}$ is obtained by the sparse group lasso estimator. In particular, it can be shown that the Karush-Kuhn-Tucker condition for the sparse group lasso (Simon et al., 2013) is satisfied with ${\mathbf{D}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}={\mathbf{0}}$ if

\displaystyle\bigg{\|}S_{\lambda}\bigg{(}\frac{1}{n}{\mathbf{X}}_{\mathrel{% \scalebox{0.4}{$\bullet$}}l}^{\top}{\mathbf{R}}_{l}\bigg{)}\bigg{\|}_{2}<% \lambda_{g},

where ${\mathbf{X}}_{\mathrel{\scalebox{0.4}{$\bullet$}}l}$ is the $l$ th column of ${\mathbf{X}}$ in (4) and $S_{\lambda}({\mathbf{a}})$ is the element-wise soft-thresholding operator at $\lambda$ , that is, $\{S_{\lambda}({\mathbf{a}})\}_{j}=\text{sign}(a_{j})\times\max(|a_{j}|-\lambda% ,0)$ . When ${\mathbf{D}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}\neq{\mathbf{0}}$ , the solution for ${\mathbf{D}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}$ is determined by

\displaystyle\widetilde{{\mathbf{D}}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}

\displaystyle=\bigg{(}\frac{1}{n}{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$% \bullet$}}l}^{\top}{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$\bullet$}}l}+\frac{% \lambda_{g}}{\|{\mathbf{D}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}\|_{2}}\bigg% {)}^{-1}S_{\lambda}\bigg{(}\frac{1}{n}{\mathbf{X}}_{\mathrel{\scalebox{0.4}{$% \bullet$}}l}^{\top}{\mathbf{R}}_{l}\bigg{)}.

Steps 1-3 in Algorithm 1 solve the unconstrained problem in (6). In Step 4 and given the estimators $\widetilde{{\mathbf{B}}}_{0},\widetilde{{\mathbf{B}}}_{1},\ldots,\widetilde{{% \mathbf{B}}}_{q}$ from Steps 1-3, we set

\widehat{\mathbf{B}}_{0}=(1+\delta)^{-1}\widetilde{{\mathbf{B}}}_{0}+\delta/(1% +\delta){\mathbf{I}}_{p},\quad\widehat{\mathbf{B}}_{h}=(1+\delta)^{-1}% \widetilde{{\mathbf{B}}}_{h},\,\,h\in[q],

where $\delta=\text{max}[0,-\lambda_{\text{min}}(\widetilde{{\mathbf{B}}}_{0}+\sum_{h% =1}^{q}\widetilde{{\mathbf{B}}}_{h}^{-})]$ . These estimators $\widehat{\mathbf{B}}_{0},\widehat{\mathbf{B}}_{1},\ldots,\widehat{\mathbf{B}}_% {q}$ ensure $\mbox{\boldmath$\Sigma$}({\mathbf{x}})$ is positive semi-definite, and give a Ledoit-Wolf type shrinkage estimator (Ledoit and Wolf, 2004),

\displaystyle\widehat{}\mbox{\boldmath$\Sigma$}({\mathbf{x}})=\frac{1}{1+% \delta}\bigg{(}\widetilde{{\mathbf{B}}}_{0}+\sum_{h=1}^{q}x_{h}\widetilde{{% \mathbf{B}}}_{h}\bigg{)}+\frac{\delta}{1+\delta}{\mathbf{I}}_{p}.

(7)

As $n$ increases, it follows from Theorem 2 that $\widetilde{\mathbf{B}}_{0},\widetilde{\mathbf{B}}_{1},\ldots,\widetilde{% \mathbf{B}}_{q}$ estimated from Steps 1-3 are consistent and satisfy the constraint (6) with high probability. That is, as $n$ increases, $\delta$ in (7) converges to zero. See more discussions after Theorem 2.

Two parameters $\lambda$ and $\lambda_{g}$ in (6) require tuning. In our procedure, they are jointly selected via $L$ -fold cross validation. We let $L=5$ in our simulation studies and real data analysis.

4 Theoretical Properties

In this section, we investigate the convergence rate of the estimator from the multivariate sparse group lasso regression in (6). Theoretical properties of the sparse group lasso estimator in multivariate linear regressions have been investigated in Li et al. (2015). However, they assumed a Gaussian distribution, which can be restrictive when modeling $y_{ij}y_{ik}$ , the product of expressions from genes $j$ and $k$ in sample $i$ . Moreover, the penalty term (5) is more involved than the sparse group lasso penalty in Li et al. (2015), as ${\mathbf{B}}_{0}$ is excluded from the group sparsity penalty and the diagonal elements of ${\mathbf{B}}_{0}$ are excluded from the element-wise sparsity.

We first discuss notation. After rearranging the columns of ${\mathbf{D}}$ such that ${\mathbf{D}}=[{\mathbf{D}}^{\text{(diag)}},{\mathbf{D}}^{(\text{off})}]$ as in Section 3, define an index set $\mathcal{J}_{0}$ such that

\mathcal{J}_{0}=\{(l,m):l\in[q+1],m\in[p(p+1)/2]\}\setminus\{(1,1),\ldots,(1,p% )\},

(8)

containing indices in ${\mathbf{D}}$ that are included in the $\ell_{1}$ penalty in (5). For a matrix ${\mathbf{D}}\in\mathbb{R}^{(q+1)\times p(p+1)/2}$ , let $\mathcal{J}_{1}({\mathbf{D}})=\{(l,m):D_{lm}\neq 0\}$ be the index set of non-zero elements in ${\mathbf{D}}$ , $\mathcal{J}_{2}({\mathbf{D}})=\{l:{\mathbf{D}}_{l\mathrel{\scalebox{0.4}{$% \bullet$}}}\neq{\mathbf{0}},l\neq 1\}$ be the index set of non-zero rows excluding the first row and $\mathcal{J}^{c}_{2}({\mathbf{D}})=[q+1]\setminus\{1\}\setminus\mathcal{J}_{2}(% {\mathbf{D}})$ . For an index set $\mathcal{J}\subseteq\{(l,m):l\in[q+1],m\in[p(p+1)/2]\}$ , define ${\mathbf{D}}_{\mathcal{J}}$ such that the $(l,m)$ th element of ${\mathbf{D}}_{\mathcal{J}}$ equals to the $(l,m)$ th element of ${\mathbf{D}}$ if $(l,m)\in\mathcal{J}$ and is 0 otherwise. For a given index set $\mathcal{J}^{\prime}\subseteq\{2,\ldots,q+1\}$ , define ${\mathbf{D}}_{\mathcal{J}^{\prime}}$ such that the $l$ th row of ${\mathbf{D}}_{\mathcal{J}^{\prime}}$ equals to the $l$ th row of ${\mathbf{D}}$ if $l\in\mathcal{J}^{\prime}$ and is ${\mathbf{0}}$ otherwise.

The unconstrained objective function in (6) can be rewritten as

\displaystyle\frac{1}{2n}\|{\mathbf{V}}-{\mathbf{X}}{\mathbf{D}}\|_{F}^{2}+% \lambda\|{\mathbf{D}}_{\mathcal{J}_{0}}\|_{1}+\lambda_{g}\sum_{l=2}^{q+1}\|{% \mathbf{D}}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}\|_{2}.

(9)

In our theoretical analysis, we will focus on the minimizer of (9), which is not subject to the PD constraint in (6). Theorem 2 below shows that, there exists a sufficiently large $n\geq n_{0}$ such that the constrained estimator from (6) reduces to the unconstrained estimator from (9) (see remark after Theorem 2).

4.1 Convergence rate

Let ${\mathbf{B}}_{0}^{\ast},{\mathbf{B}}_{1}^{\ast},\ldots,{\mathbf{B}}_{q}^{\ast}$ be the true coefficient matrices in (1), and let ${\mathbf{D}}^{\ast}$ be the coefficient matrix in (4) calculated from ${\mathbf{B}}_{0}^{\ast},{\mathbf{B}}_{1}^{\ast},\ldots,{\mathbf{B}}_{q}^{\ast}$ . Let $s=|\mathcal{J}_{1}({\mathbf{D}}^{\ast})|$ and $r=|\mathcal{J}_{2}({\mathbf{D}}^{\ast})|$ denote the number of nonzero entries and groups in ${\mathbf{D}}^{\ast}$ , respectively. Let $|\cdot|$ denote the cardinality of a set. We first state regularity conditions.

Assumption 1

Assume $\lambda_{\text{min}}({\mathbf{B}}_{0}^{\ast}+\sum_{h=1}^{q}{{\mathbf{B}}_{h}^{% \ast}}^{-})>0$ and $x_{ih}\in[0,1]$ , for $i\in[n],h\in[q]$ .

Assumption 2

For any non-zero matrix $\mbox{\boldmath$\Delta$}\in\mathbb{R}^{(q+1)\times p(p+1)/2}$ that satisfies

\displaystyle\|\mbox{\boldmath$\Delta$}_{\mathcal{J}^{c}_{1}({\mathbf{D}}^{% \ast})}\|_{1}+2\sqrt{s/r}\sum_{l\in\mathcal{J}^{c}_{2}({\mathbf{D}}^{\ast})}\|% \mbox{\boldmath$\Delta$}_{l\mathrel{\scalebox{0.4}{$\bullet$}}}\|_{2}\leq 3\|% \mbox{\boldmath$\Delta$}_{\mathcal{J}_{1}({\mathbf{D}}^{\ast})}\|_{1}+2\sqrt{s% /r}\sum_{l\in\mathcal{J}_{2}({\mathbf{D}}^{\ast})}\|\mbox{\boldmath$\Delta$}_{% l\mathrel{\scalebox{0.4}{$\bullet$}}}\|_{2},

there exist a constant $\kappa>0$ such that

\displaystyle\frac{\|{\mathbf{X}}\mbox{\boldmath$\Delta$}\|_{F}^{2}}{n\|\mbox{% \boldmath$\Delta$}\|_{F}^{2}}>\kappa.

Assumption 2 is a restricted eigenvalue condition that ensures the convexity of ${\mathbf{X}}^{\top}{\mathbf{X}}/n$ is bounded in the space of $\Delta$ as specified in Assumption 2, and is commonly employed in the literature (Hastie et al., 2015).

We first derive a deterministic upper bound for the prediction error of $\widehat{\mathbf{D}}$ obtained from minimizing (9).

Theorem 1

Suppose Assumptions 1-2 hold, $\lambda\geq(2/n)\|{\mathbf{X}}^{\top}{\mathbf{E}}\|_{\infty}$ and $\lambda_{g}=C\lambda\sqrt{s/r}$ for some constant $C>0$ . Then, the solution $\widehat{{\mathbf{D}}}$ that minimizes (9) satisfies

\displaystyle\frac{1}{n}\|{\mathbf{X}}(\widehat{{\mathbf{D}}}-{\mathbf{D}}^{% \ast})\|_{F}^{2}

\displaystyle\leq\frac{(3+2C)^{2}\lambda^{2}s}{\kappa},

where $s$ and $\kappa$ are as defined in Assumption 2.

Theorem 1 shows that the mean square prediction error is bounded by a factor of $\lambda^{2}$ . The bound also depends on the restricted eigenvalue constant $\kappa$ and the sparsity of the true model.

Theorem 2

Suppose Assumptions 1-2 hold. Assume $\epsilon_{ijk}$ ’s are zero-mean sub-exponential random variables, and $\epsilon_{ijk}$ and $\epsilon_{i^{\prime}j^{\prime}k^{\prime}}$ are independent for $i\neq i^{\prime}$ . Let $\|\epsilon_{ijk}\|_{\psi_{1}}=\sup_{d\geq 1}d^{-1}(\mathbb{E}|\epsilon_{ijk}|^% {d})^{1/d}$ be the sub-exponential norm of $\epsilon_{ijk}$ and $K=\max_{ijk}\|\epsilon_{ijk}\|_{\psi_{1}}$ . For constants $c>0,C>0$ and $A>c^{-0.5}$ , let

\displaystyle\lambda=2KA\sqrt{\frac{\log\{p(p+1)(q+1)\}}{n}}\quad\text{and}% \quad\lambda_{g}=C\lambda\sqrt{\frac{s}{r}}.

Then, with probability at least $1-\{p(p+1)(q+1)\}^{(1-cA^{2})}$ , it holds that

\displaystyle\|\widehat{{\mathbf{D}}}-{\mathbf{D}}^{\ast}\|_{F}\leq\frac{2(3+2% C)KA}{\kappa}\sqrt{\frac{s\log\{p(p+1)(q+1)\}}{n}}.

Theorem 2 shows that the Frobenius norm of the estimation error is bounded by a factor of order $\sqrt{s\max(\log p,\log q)/n}$ . Here, we do not assume $\epsilon_{ijk}$ and $\epsilon_{ij^{\prime}k^{\prime}}$ are independent, as $z_{ij}z_{ik}$ and $z_{ij^{\prime}}z_{ik^{\prime}}$ from gene pairs $(j,k)$ and $(j^{\prime},k^{\prime})$ can be correlated. Hence, under our setting, the error terms across element-wise regression tasks can be correlated. Comparable convergence rates, up to a logarithm factor, have been derived for sparse group lasso estimators in univariate regressions (Cai et al., 2022; Zhang and Li, 2023).

Theorem 2 also suggests that estimators $\widetilde{{\mathbf{B}}}_{0},\widetilde{{\mathbf{B}}}_{1},\ldots,\widetilde{{% \mathbf{B}}}_{q}$ from minimizing (9) satisfy the PD constraint in (6) as $n$ increases. Specifically, it holds by Weyl’s inequality that

\displaystyle\lambda_{min}({\mathbf{B}}_{0}^{\ast}+\sum_{h=1}^{q}{{\mathbf{B}}% _{h}^{\ast}}^{-})\leq\lambda_{min}(\widetilde{{\mathbf{B}}}_{0}+\sum_{h=1}^{q}% \widetilde{{\mathbf{B}}}_{h}^{-})-\lambda_{min}(\widetilde{{\mathbf{B}}}_{0}+% \sum_{h=1}^{q}\widetilde{{\mathbf{B}}}_{h}^{-}-{\mathbf{B}}_{0}^{\ast}-\sum_{h% =1}^{q}{{\mathbf{B}}_{h}^{\ast}}^{-}).

Since $\lambda_{min}(\widetilde{{\mathbf{B}}}_{0}+\sum_{h=1}^{q}\widetilde{{\mathbf{B% }}}_{h}^{-}-{\mathbf{B}}_{0}^{\ast}-\sum_{h=1}^{q}{{\mathbf{B}}_{h}^{\ast}}^{-})$ converges to zero as $n$ increases and $\lambda_{\text{min}}({\mathbf{B}}_{0}^{\ast}+\sum_{h=1}^{q}{{\mathbf{B}}_{h}^{% \ast}}^{-})>0$ by Assumption 1, we have, for a sufficiently large sample size $n_{0}$ ,

\displaystyle\lambda_{\text{min}}(\widetilde{{\mathbf{B}}}_{0}+\sum_{h=1}^{q}% \widetilde{{\mathbf{B}}}_{h}^{-})>0.

That is, the unconstrained optimizer of (9) satisfies the PD constraint in (6) for any $n\geq n_{0}$ .

4.2 Statistical Inference via Debiasing

In this section, we consider the inference for ${\mathbf{D}}^{\ast}$ under the proposed multivariate sparse group lasso regression. Inspired by recent advances on debiasing in high dimensional linear regressions (Javanmard and Montanari, 2014; Zhang and Zhang, 2014; Cai et al., 2022), we design a debiased lasso estimator based on (9) which performs debiasing on each column of the proposed estimator $\widehat{{\mathbf{D}}}$ separately and make inference on the true parameter matrix ${\mathbf{D}}^{\ast}$ . Since $\widehat{{\mathbf{D}}}$ is of dimension $(q+1)\times p(p+1)/2$ , carrying out the debiasing in a column-wise fashion, each of dimension $q+1$ , significantly reduces the computational cost. The cost for this computational gain is the potential loss of efficiency, compared to debiasing all columns in $\widehat{{\mathbf{D}}}$ simultaneously. We also note that debiasing all columns in $\widehat{{\mathbf{D}}}$ simultaneously may not be feasible under our framework as one needs to estimate $\text{Cov}(z_{ij}z_{ik},z_{ij^{\prime}}z_{ik^{\prime}})$ , the analytical form of which is difficult to derive without distributional assumptions on ${\mathbf{z}}_{i}$ . Besides non-Gaussian errors, our procedure also faces the challenge of heteroskedasticity, as $\text{Var}(\epsilon_{ijk})$ may depend on ${\mathbf{x}}_{i}$ . Next, we detail our approach.

Denote $\widehat{{\bm{\Theta}}}={\mathbf{X}}^{\top}{\mathbf{X}}/n$ . Consider the following debiased estimator $\widehat{{\mathbf{D}}}^{u}$ :

\displaystyle\widehat{{\mathbf{D}}}^{u}=\widehat{{\mathbf{D}}}+\frac{1}{n}{% \mathbf{M}}{\mathbf{X}}^{\top}({\mathbf{V}}-{\mathbf{X}}\widehat{{\mathbf{D}}}).

(10)

where ${\mathbf{M}}=[{\mathbf{m}}_{1},\ldots,{\mathbf{m}}_{q+1}]^{\top}\in\mathbb{R}^% {(q+1)\times(q+1)}$ , and ${\mathbf{m}}_{l}$ , $l\in[q+1]$ is a solution to

		$\displaystyle{\mathbf{m}}_{l}=\arg\min_{{\mathbf{m}}}{\mathbf{m}}^{\top}% \widehat{{\bm{\Theta}}}{\mathbf{m}}$
		$\displaystyle\text{subject to}\quad\\|\widehat{{\bm{\Theta}}}{\mathbf{m}}-{% \mathbf{e}}_{l}\\|_{\infty}\leq\mu,$
		$\displaystyle\quad\quad\quad\quad\quad\;\\|{\mathbf{X}}{\mathbf{m}}\\|_{\infty}% \leq n^{\beta},\quad\text{for any fixed}\quad 1/4<\beta<1/2,$		(11)

where $\mu$ is to be specified later and ${\mathbf{e}}_{l}$ is the $l$ th vector in the canonical basis of $\mathbb{R}^{q+1}$ . From the above calculations, ${\mathbf{M}}$ is only a function of the design matrix ${\mathbf{X}}$ , and can be seen as an approximation to the inverse of ${\mathbf{X}}^{\top}{\mathbf{X}}/n$ . Assuming a random design where ${\mathbf{x}}_{i}$ ’s are i.i.d sub-Gaussian, and $\text{Var}({\mathbf{x}}_{i})$ have bounded eigenvalues and diagonal elements, Javanmard and Montanari (2014) showed the optimizations in (4.2) are feasible with high probability.

Define an index map** $T:[p(p+1)/2]\rightarrow[p]\times[p]$ such that, for $m\in[p(p+1)/2]$ and a $p\times p$ symmetric matrix ${\mathbf{A}}$ , $T(m)=(j,k)$ if $[\text{vech}({\mathbf{A}})]_{m}=A_{jk}$ . Next, we state our result on debiasing $\widehat{\mathbf{D}}$ for inference.

Theorem 3

Suppose conditions in Theorem 2 are satisfied and optimizations in (4.2) are feasible. Assume $\mathbb{E}(\epsilon_{ijk}^{2})=\sigma_{ijk}^{2}$ and $\mathbb{E}(|\epsilon_{ijk}|^{2+a})<C_{2}\sigma_{ijk}^{2+a}$ for some $a>0$ and $C_{2}>0$ .

(1)

With probability at least $1-\{p(p+1)(q+1)\}^{(1-cA^{2})}$ , $\widehat{{\mathbf{D}}}^{u}$ can be decomposed as

\displaystyle\sqrt{n}(\widehat{{\mathbf{D}}}^{u}-{\mathbf{D}}^{\ast})={\bm{% \Lambda}}+{\mathbf{W}},

where ${\bm{\Lambda}}=n^{1/2}({\mathbf{M}}\widehat{{\bm{\Theta}}}-{\mathbf{I}}_{q+1})% ({\mathbf{D}}^{\ast}-\widehat{{\mathbf{D}}})$ such that

\displaystyle\|{\bm{\Lambda}}\|_{\infty}\leq\frac{8(2+C)^{2}KA}{\kappa}\mu s% \sqrt{\log\{p(p+1)(q+1)\}},

and ${\mathbf{W}}=n^{-1/2}{\mathbf{M}}{\mathbf{X}}^{\top}{\mathbf{E}}$ such that $W_{lm}$ with $T(m)=(j,k)$ is asymptotically normal with mean zero and variance $n^{-1}\sum_{i=1}^{n}({\mathbf{M}}{\mathbf{X}}^{\top})_{li}^{2}\sigma_{ijk}^{2}$ .

(2)

When $\mu=\sqrt{\log\{p(p+1)(q+1)\}/n}$ and $s\log\{p(p+1)(q+1)\}/\sqrt{n}=o(1)$ , an asymptotic two-sided $100(1-\alpha)\%$ confidence interval for $D_{lm}^{\ast}$ , where $T(m)=(j,k)$ , is

\displaystyle\widehat{D}_{lm}^{u}\pm\Phi^{-1}(1-\alpha/2)n^{-1}\sqrt{\sum_{i=1% }^{n}({\mathbf{M}}{\mathbf{X}}^{\top})_{li}^{2}\sigma_{ijk}^{2}}.

In Theorem 3, the errors are not assumed to be Gaussian and they can be heteroskedastic. Calculating the above confidence interval requires estimating the variance of $W_{lm}$ , where $T(m)=(j,k)$ . This can be consistently estimated by the empirical variance (Bühlmann and Van de Geer, 2015)

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\bigg{\{}({\mathbf{M}}{\mathbf{X}}^{\top% })_{li}\hat{\epsilon}_{ijk}-\frac{1}{n}\sum_{r=1}^{n}({\mathbf{M}}{\mathbf{X}}% ^{\top})_{lr}\hat{\epsilon}_{rjk}\bigg{\}}^{2}.

(12)

where $\hat{\epsilon}_{ijk}=({\mathbf{V}}-{\mathbf{X}}\widehat{{\mathbf{D}}}^{u})_{im}$ with $T(m)=(j,k)$ .

5 Simulation Studies

In this section, we investigate the finite sample performance of our proposed method, referred to as SparseCovReg, and compare it with four alternative methods, including:
$\bullet$ DenseSample: standard sample covariance estimator ${\mathbf{S}}=\sum_{i=1}^{n}{\mathbf{y}}_{i}{\mathbf{y}}_{i}^{\top}/n$ ,
$\bullet$ SparseSample: soft-thresholding sample covariance estimator $S_{\lambda}({\mathbf{S}})$ where $S_{\lambda}(\cdot)$ is
the element-wise soft-thresholding operator at $\lambda$ (Rothman et al., 2009),
$\bullet$ CovReg: quadratic covariance regression estimator in Hoff and Niu (2012),
$\bullet$ DenseCovReg: multivariate regression estimator without sparsity, i.e., $({\mathbf{X}}^{\top}{\mathbf{X}})^{-1}{\mathbf{X}}^{\top}{\mathbf{V}}$ .
The tuning parameters in SparseCovReg and SparseSample are selected using 5-fold cross validation.

We simulate $n$ samples $\{({\mathbf{y}}_{i},{\mathbf{x}}_{i}),i\in[n]\}$ , where the response ${\mathbf{y}}_{i}$ is of dimension $p$ (e.g., genes) and covariate ${\mathbf{x}}_{i}$ is of dimension $q$ (e.g., genetic variants). For ${\mathbf{x}}_{i}$ ’s, we consider two simulation settings. In Setting 1, we consider continuous covariates drawn independently from $\text{Uniform}(0,1)$ and in Setting 2, we consider discrete covariates drawn independently from $\text{Bernoulli}(0.5)$ . Given ${\mathbf{x}}_{i}$ , we simulate ${\mathbf{y}}_{i}$ from $N_{p}({\mathbf{0}},\mbox{\boldmath$\Sigma$}({\mathbf{x}}_{i}))$ , where

\Sigma_{jk}({\mathbf{x}})=\begin{cases}0.5+0.5x_{1},&\quad\text{if }j=k,\\ 0.5x_{1},&\quad\text{if }|j-k|=1,\\ 0,&\quad\text{if }|j-k|>1,\end{cases}

(13)

and $\Sigma_{jk}({\mathbf{x}})$ is the $(j,k)$ th element in $\mbox{\boldmath$\Sigma$}({\mathbf{x}})$ . The above MA(1) covariance structure has been commonly considered by others (Rothman et al., 2009; Qiu and Liyanage, 2019). We consider $n=200,500$ , $p=50$ and $q=30,100$ . For each simulation configuration, we generate 100 independent data sets.

Let $\mbox{\boldmath$\Sigma$}_{i}^{\ast}$ denotes the true covariance matrix for the $i$ th observation and $\widehat{\mbox{\boldmath$\Sigma$}}_{i}$ denotes the estimated $\mbox{\boldmath$\Sigma$}_{i}^{\ast}$ from a given method. For illustration, Figure 1 plots the estimated $\widehat{\Sigma}_{i,12}$ against $\Sigma_{i,12}^{\ast}$ for $i\in[n]$ from 5 data replicates. We did not include the scatter plot from SparseSample as it is very similar to that of DenseSample. As DenseSample does not account for the effect of covariates, $\Sigma_{i,12}^{\ast}$ is estimated to be constant across all subjects, as shown by five horizontal lines from 5 data replicates. The CovReg method by Hoff and Niu (2012) cannot estimate the covariance well as the true covariance is not a quadratic function of the covariates. The DenseCovReg gives a reasonable agreement between the estimated and true covariances (slopes are all roughly 1), though the variability of the estimates is very high. On the other hand, the proposed SparseCovReg estimates the covariance well (slopes are all roughly 1) and enjoys a much reduced variability.

Refer to caption — Figure 1: Comparison of the true non-zero covariance $\Sigma_{i,12}^{\ast}$ (x-axis) and estimated covariance $\widehat{\Sigma}_{i,12}$ (y-axis) for five simulated datasets under Setting 1 (continuous covariates) with the number of covariates $q=30$ and the sample size $n=500$ .

$n$	$q$	method	Setting 1	Setting 2
200	30	DenseSample	3.22 (0.08)	4.15 (0.07)
		SparseSample	2.01 (0.06)	3.23 (0.06)
		CovReg	7.75 (0.31)	8.60 (0.43)
		DenseCovReg	15.09 (0.34)	15.64 (0.46)
		SparseCovReg	1.84 (0.06)	2.17 (0.08)
	100	DenseSample	3.21 (0.08)	4.13 (0.08)
		SparseSample	2.01 (0.06)	3.22 (0.06)
		CovReg	14.64 (1.74)	15.97 (2.20)
		DenseCovReg	26.92 (0.60)	27.64 (0.79)
		SparseCovReg	1.84 (0.08)	2.18 (0.10)
500	30	DenseSample	2.41 (0.04)	3.53 (0.03)
		SparseSample	1.75 (0.04)	3.12 (0.03)
		CovReg	4.35 (0.14)	5.18 (0.12)
		DenseCovReg	9.55 (0.14)	9.93 (0.18)
		SparseCovReg	1.28 (0.05)	1.42 (0.07)
	100	DenseSample	2.39 (0.04)	3.52 (0.03)
		SparseSample	1.75 (0.04)	3.11 (0.03)
		CovReg	11.06 (0.34)	12.21 (0.43)
		DenseCovReg	17.26 (0.28)	17.86 (0.36)
		SparseCovReg	1.29 (0.06)	1.45 (0.07)

Table 1: Average estimation errors measured via

\|\widehat{\mbox{\boldmath$\Sigma$}}_{i}-\mbox{\boldmath$\Sigma$}_{i}^{\ast}\|% _{F}

over 100 simulations with standard errors shown in parentheses.

Next, we compare the estimation errors calculated as $n^{-1}\sum_{i=1}^{n}\|\widehat{\mbox{\boldmath$\Sigma$}}_{i}-\mbox{\boldmath$% \Sigma$}_{i}^{\ast}\|_{F}$ . Table 1 reports the average estimation errors with standard errors in the parentheses. The proposed SparseCovReg outperforms the alternative methods for all $n$ and $q$ . It is seen that the estimation errors of SparseCovReg increases with $q$ and decreases with $n$ , confirming the results of Theorem 2. In Table 2, we also report the selection accuracy of SparseCovReg by measuring the true positive rate and the false positive rate as

	True positive rate	$\displaystyle=\frac{\#\{(h,j,k):\widehat{B}_{h,jk}\neq 0,\;B_{h,jk}^{\ast}\neq 0% \}}{\#\{(h,j,k):B_{h,jk}^{\ast}\neq 0\}}$
	False positive rate	$\displaystyle=\frac{\#\{(h,j,k):\widehat{B}_{h,jk}\neq 0,\;B_{h,jk}^{\ast}=0\}% }{\#\{(h,j,k):B_{h,jk}^{\ast}=0\}}.$

Note that the selection accuracy cannot be fairly evaluated from other methods, as DenseSample, CovReg and DenseCovReg are all dense estimators, and SparseSample does not estimate $B_{h,jk}$ for $h\in[q]$ .

$n$	$q$	Selection rate	Setting 1	Setting 2
200	30	TPR	0.8880	0.9960
		FPR	0.0145	0.0147
	100	TPR	0.8749	0.9960
		FPR	0.0042	0.0042
500	30	TPR	0.9919	0.9999
		FPR	0.0155	0.0155
	100	TPR	0.9926	1.0000
		FPR	0.0044	0.0042

Table 2: True positive rate (TPR) and false positive rate (FPR) of SparseCovReg.

\begin{overpic}[width=426.39255pt,angle={0}]{figures/ciplot3.png} \put(48.0,-2.0){\small Index} \put(-2.0,18.0){\rotatebox{90.0}{\small{Coefficient}}} \end{overpic}

Figure 2: 95% confidence intervals for elements in

{\mathbf{D}}^{\ast}

from one data replicate under Setting 2 with

q=100

n=500

. True parameter values are shown in

\circ

and estimated parameter values, after debiasing, are shown in

\bullet

			Setting 1			Setting 2
$n$	$q$	Method	${\mathbf{D}}^{\ast}$	$\mathcal{J}_{1}({\mathbf{D}}^{\ast})$	$\mathcal{J}^{c}_{1}({\mathbf{D}}^{\ast})$	${\mathbf{D}}^{\ast}$	$\mathcal{J}_{1}({\mathbf{D}}^{\ast})$	$\mathcal{J}^{c}_{1}({\mathbf{D}}^{\ast})$
200	30	$\hat{\sigma}_{ijk}$	0.938	0.928	0.938	0.927	0.910	0.927
		$\sigma_{ijk}^{\ast}$	0.950	0.948	0.950	0.950	0.949	0.950
	100	$\hat{\sigma}_{ijk}$	0.875	0.853	0.875	0.863	0.828	0.863
		$\sigma_{ijk}^{\ast}$	0.951	0.950	0.951	0.951	0.935	0.951
500	30	$\hat{\sigma}_{ijk}$	0.943	0.937	0.943	0.942	0.939	0.942
		$\sigma_{ijk}^{\ast}$	0.951	0.948	0.951	0.951	0.953	0.951
	100	$\hat{\sigma}_{ijk}$	0.925	0.912	0.925	0.921	0.909	0.921
		$\sigma_{ijk}^{\ast}$	0.951	0.947	0.951	0.951	0.951	0.951

Table 3: Average coverage probabilities of the 95% confidence intervals with the variance of

W_{lm}

estimated by (12), referred to as

\hat{\sigma}_{ijk}

, and with the true variance of

W_{lm}

, referred to as

\sigma_{ijk}^{\ast}

. Columns

{\mathbf{D}}^{\ast}

\mathcal{J}_{1}({\mathbf{D}}^{\ast})

and

\mathcal{J}^{c}_{1}({\mathbf{D}}^{\ast})

show average coverage probabilities of all off-diagonal parameters, non-zero off-diagonal parameters and zero off-diagonal parameters in

{\mathbf{D}}^{\ast}

, respectively.

Next, we evaluate the efficacy of the statistical inference procedure from Section 4.2. In Figure 2, we plot 95% confidence intervals for entries in ${\mathbf{D}}^{\ast}$ from one data replicate under Setting 2 with $q=100$ , $n=500$ . Specifically, the first 49 confidence intervals in Figure 2 are shown for $B_{1,jk}^{\ast}$ for $|j-k|=1$ . From (13), we have $B_{1,jk}^{\ast}=0.5$ for $|j-k|=1$ . The last 48 confidence intervals in Figure 2 are shown for $|j-k|=2$ . From (13), we have $B_{1,jk}^{\ast}=0$ for $|j-k|=2$ . It is seen that the 95% confidence intervals show a good coverage. Finally, we evaluate the coverage probabilities for all off-diagonal parameters in ${\mathbf{D}}^{\ast}$ , $\mathcal{J}_{1}({\mathbf{D}}^{\ast})$ and $\mathcal{J}^{c}_{1}({\mathbf{D}}^{\ast})$ , respectively, in Table 3. Under each $n,q$ setting, we report the average coverage probabilities when the confidence intervals are calculated using the true variances of $W_{lm}$ ’s and empirical variances estimated using (12). It is seen that the intervals calculated with empirical variances achieve a satisfactory coverage, and it approaches 95% as the sample size increases.

6 Real Data Analysis

We apply our proposed method SparseCovReg to the REMBRANDT study (GSE108476) that collects data on 178 patients with glioblastoma multiforme (GBM), the most common malignant form of brain tumor in adults and one of the most lethal of all cancers (Akhavan et al., 2010). These 178 patients had undergone microarray and single-nucleotide polymorphism (SNP) chip profiling, with both gene expression and SNP data available for analysis. The raw data were pre-processed and normalized using standard pipelines; see Gusev et al. (2018) for more details. The main objectives of our analysis are to identify co-expression QTLs and recover both the population-level and individual-level covariance matrices of gene expressions.

For response variables, we consider the expression levels of 73 genes that belong to the human glioma pathway in the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa and Goto, 2000). As covariates, we consider local SNPs (i.e., SNPs that fall within 2kb upstream and 0.5kb downstream of the gene) residing near those 73 genes, resulting in a total of 118 SNPs. SNPs are coded with “0” indicating homozygous in the major allele and “1” otherwise. Our analysis also includes age (continous) and sex as covariates, bringing a total of 120 covariates and 326,821 parameters in the model (4). Tuning parameters have been selected by 5-fold cross validation.

\begin{overpic}[width=224.03743pt,angle={0}]{figures/heat_ST_new} \put(25.0,104.0){\uline{\small{soft-thresholding estimator}}} \par\put(-1.0,84.5){\vector(0,4){6.5}} \put(-1.0,84.5){\vector(0,-4){6.5}} \put(-7.0,82.0){\rotatebox{90.0}{\tiny{p53}}} \put(-1.0,68.0){\vector(0,4){9.5}} \put(-1.0,68.0){\vector(0,-4){10.0}} \put(-7.0,62.0){\rotatebox{90.0}{\tiny{calcium}}} \put(-1.0,50.5){\vector(0,4){7.0}} \put(-1.0,50.5){\vector(0,-4){7.0}} \put(-9.0,45.5){\rotatebox{90.0}{\tiny{PI3K/}}} \put(-5.0,45.0){\rotatebox{90.0}{\tiny{MTOR}}} \put(-1.0,31.5){\vector(0,4){11.5}} \put(-1.0,31.5){\vector(0,-4){11.5}} \put(-9.0,24.5){\rotatebox{90.0}{\tiny{Ras-Raf-}}} \put(-5.0,23.0){\rotatebox{90.0}{\tiny{MEK-ERK}}} \par\end{overpic} \begin{overpic}[width=224.03743pt,angle={0}]{figures/heat_pop_new} \put(10.0,104.0){\uline{\small{Population network from SparseCovReg}}} \end{overpic}

\begin{overpic}[width=224.03743pt,height=14.45377pt,angle={0}]{figures/heat_% legend1} \end{overpic}

Figure 3: Heatmaps of the population-level covariance estimates from SparseSample (left) and SparseCovReg (right). Positive values are shown in red and negative values are shown in blue.

We first investigate the population-level co-expression matrix. In Figure 3, we compare the soft-thresholding covariance estimator (Rothman et al., 2009) with the population level covariance ${\mathbf{B}}_{0}$ obtained from SparseCovReg. It is seen that the soft-thresholding estimator shares some common patterns with SparseCovReg but is considerably more noisy. Using the population covariance matrix from SparseCovReg, we can identify high correlations between PIK3CA and genes in the calcium signaling pathway including CALML5, CALM1, CAMK1D and CAMK2B. This is reasonable as mutations in PIK3CA have been reported in multiple tumor types and PIK3CA is part of the PI3K/AKT/MTOR signaling pathway, one of the core pathways in human GBM (Network et al., 2008). The calcium signaling pathway also plays diverse roles in the progression of brain cancers (Maklad et al., 2019). Figure 3 shows negative correlations between PIK3CA and genes in the p53 signaling pathway, another core pathway in human GBM (Network et al., 2008) and a potential target for inhibition in GBM treatments (Schröder and McDonald, 2015; Yin et al., 2021).

\begin{overpic}[width=224.03743pt,angle={0}]{figures/heat_18_rs10509346_new} \put(21.0,104.0){\uline{\small{Covariate effect ({rs10509346})}}} \par\put(-1.0,84.5){\vector(0,4){6.5}} \put(-1.0,84.5){\vector(0,-4){6.5}} \put(-7.0,82.0){\rotatebox{90.0}{\tiny{p53}}} \put(-1.0,68.0){\vector(0,4){9.5}} \put(-1.0,68.0){\vector(0,-4){10.0}} \put(-7.0,62.0){\rotatebox{90.0}{\tiny{calcium}}} \put(-1.0,50.5){\vector(0,4){7.0}} \put(-1.0,50.5){\vector(0,-4){7.0}} \put(-9.0,45.5){\rotatebox{90.0}{\tiny{PI3K/}}} \put(-5.0,45.0){\rotatebox{90.0}{\tiny{MTOR}}} \put(-1.0,31.5){\vector(0,4){11.5}} \put(-1.0,31.5){\vector(0,-4){11.5}} \put(-9.0,24.5){\rotatebox{90.0}{\tiny{Ras-Raf-}}} \put(-5.0,23.0){\rotatebox{90.0}{\tiny{MEK-ERK}}} \par\end{overpic} \begin{overpic}[width=224.03743pt,angle={0}]{figures/heat_3_rs6701524_new} \put(22.0,104.0){\uline{\small{Covariate effect ({rs6701524})}}} \end{overpic}

\begin{overpic}[width=224.03743pt,height=14.45377pt,angle={0}]{figures/heat_% legend1} \end{overpic}

Figure 4: Heatmaps of identified nonzero covariate effects. Positive values are shown in red and negative values are shown in blue.

Next, we examine the covariate effects on the covariance matrix. Non-zero effects have been identified for six SNPs: rs6701524, rs10509346, rs10519201, rs1347069, rs503314, and rs306098. The non-zero effects of rs10509346 and rs6701524 are shown in Figure 4 and their network effects after the debiased inference procedure are shown in Figure 5. Interestingly, these covariate effects are not easily observable from the soft-thresholding estimator in Figure 3, suggesting that, by fitting model (1), we may find some covariate-modulated co-expression patterns that can otherwise be overlooked.

From the left plot of Figure 5, it is seen that rs10509346, residing in CAMK2G, notably affect co-expressions among genes in the Ras-Raf-MEK-ERK signaling pathway including EGF, SHC4, RAF1 and MAPK3. Also, their co-expressions with CAMK2D and CALML5 in the calcium signaling pathway are affected by rs10509346. This agrees with the findings that the Ras-Raf-MEK-ERK pathway is modulated by Ca ${}^{+2}$ and calmodulin (Agell et al., 2002; Zhang and Li, 2023). Furthermore, this SNP is found to affect the co-expressions of MTOR, part of the PI3K/AKT/MTOR pathway, with genes in the Ras-Raf-MEK-ERK pathway. This result is interesting because MTOR is a key mediator of PI3K/AKT/MTOR signaling, and is known to cooperate with alterations in other signaling pathways that are also commonly activated in GBM patients, such as the Ras-Raf-MEK-ERK pathway (Akhavan et al., 2010).

The right plot of Figure 5, shows rs6701524, residing in MTOR, affects co-expressions of genes in the PI3K/MTOR pathway. In particular, co-expressions of PIK3CD (and PIK3CB) with other genes are affected by this SNP. This is an interesting finding as PI3K/MTOR is a key pathway in the development and progression of GBM, and the inhibition of PI3K/MTOR signaling was found effective in increasing survival with GBM tumor (Batsios et al., 2019). Co-expressions affected by other SNPs are also worth noting. For example, rs306098 has been found to affect co-expressions of SHC2 with CDK4/6 and MTOR, which is interesting because the combination of CDK4/6 and MTOR inhibition has been investigated as a potential therapeutic strategy in GBM (Olmez et al., 2017).

\begin{overpic}[width=224.03743pt,angle={0}]{figures/net_18_rs10509346_new} \put(37.0,104.0){{\uline{\small{{rs10509346}}}}} \end{overpic} \begin{overpic}[width=224.03743pt,angle={0}]{figures/net_3_rs6701524_new} \put(38.0,104.0){{\uline{\small{{rs6701524}}}}} \end{overpic}

Figure 5: Network effects of rs10509346 (left) and rs6701524 (right) identified using the debiased inference procedure in Section 4.2 with significance level

\alpha=1\%

and Bonferroni correction

1-\alpha/\{p(p-1)/2\}

. Positive effects are shown in red and negative effects are shown in blue. Edge weights in the network graphs are proportional to the effect size.

7 Discussion

In our current approach, we did not consider the hierarchy between population-level effect and covariate-specific effect. However, in some applications, it might be reasonable to assume that covariate-specific effect is non-zero only when the population-level effect is non-zero. Our multivariate regression framework can be extended to accommodate such a hierarchical structure by modifying the penalty term (Tibshirani and Friedman, 2020; Kim et al., 2021).

Next, we comment on the model interpretability after scaling the covariates to be in $[0,1]$ . Given bounded covariates $x_{h}\in\mathbb{R}$ for all $h\in[q]$ , denote $LB(x_{h})$ and $UB(x_{h})$ as the lower and upper bounds of $x_{h}$ , respectively, and consider the covariance regression model

\displaystyle\mbox{\boldmath$\Sigma$}({\mathbf{x}})={\mathbf{B}}_{0}+\sum_{h=1% }^{q}\frac{x_{h}-LB(x_{h})}{UB(x_{h})-LB(x_{h})}{\mathbf{B}}_{h}.

The above equation can be rewritten as

\displaystyle\mbox{\boldmath$\Sigma$}({\mathbf{x}})=\bm{\bar{}}{{\mathbf{B}}}_% {0}+\sum_{h=1}^{q}x_{h}\bm{\bar{}}{{\mathbf{B}}}_{h}.

where $\bm{\bar{}}{{\mathbf{B}}}_{0}={\mathbf{B}}_{0}-\sum_{h=1}^{q}LB(x_{h})/\{UB(x_% {h})-LB(x_{h})\}{\mathbf{B}}_{h}$ and $\bm{\bar{}}{{\mathbf{B}}}_{h}={\mathbf{B}}_{h}/\{UB(x_{h})-LB(x_{h})\}$ . Note that $\bm{\bar{}}{{\mathbf{B}}}_{h}$ and ${\mathbf{B}}_{h}$ only differ by a positive scalar, and they share the same sparsity pattern. Hence, parameter estimates can be interpreted with covariates before the transformation.

References

Agell et al. (2002) Agell, N., Bachs, O., Rocamora, N., and Villalonga, P. (2002), “Modulation of the Ras/Raf/MEK/ERK pathway by Ca2+, and calmodulin,” Cellular signalling, 14, 649–654.
Akhavan et al. (2010) Akhavan, D., Cloughesy, T. F., and Mischel, P. S. (2010), “mTOR signaling in glioblastoma: lessons learned from bench to bedside,” Neuro-oncology, 12, 882–889.
Alakus et al. (2022) Alakus, C., Larocque, D., and Labbe, A. (2022), “Covariance regression with random forests,” arXiv preprint arXiv:2209.08173.
Anderson (1973) Anderson, T. W. (1973), “Asymptotically efficient estimation of covariance matrices with linear structure,” The Annals of Statistics, 1, 135–141.
Argyriou et al. (2008) Argyriou, A., Evgeniou, T., and Pontil, M. (2008), “Convex multi-task feature learning,” Machine learning, 73, 243–272.
Batsios et al. (2019) Batsios, G., Viswanath, P., Subramani, E., Najac, C., Gillespie, A. M., Santos, R. D., Molloy, A. R., Pieper, R. O., and Ronen, S. M. (2019), “PI3K/mTOR inhibition of IDH1 mutant glioma leads to reduced 2HG production that is associated with increased survival,” Scientific reports, 9, 10521.
Bickel et al. (2008a) Bickel, P. J., Levina, E., et al. (2008a), “Covariance regularization by thresholding,” The Annals of Statistics, 36, 2577–2604.
Bickel et al. (2008b) — (2008b), “Regularized estimation of large covariance matrices,” The Annals of Statistics, 36, 199–227.
Bien and Tibshirani (2011) Bien, J. and Tibshirani, R. J. (2011), “Sparse estimation of a covariance matrix,” Biometrika, 98, 807–820.
Bühlmann and Van de Geer (2015) Bühlmann, P. and Van de Geer, S. (2015), “High-dimensional inference in misspecified linear models,” Electronic Journal of Statistics, 9, 1449–1473.
Butte et al. (2000) Butte, A. J., Tamayo, P., Slonim, D., Golub, T. R., and Kohane, I. S. (2000), “Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks,” Proceedings of the National Academy of Sciences, 97, 12182–12186.
Cai et al. (2022) Cai, T. T., Zhang, A. R., and Zhou, Y. (2022), “Sparse group lasso: Optimal sample complexity, convergence rate, and statistical inference,” IEEE Transactions on Information Theory, 68, 5975–6002.
Chiu et al. (1996) Chiu, T. Y., Leonard, T., and Tsui, K.-W. (1996), “The matrix-logarithmic covariance model,” Journal of the American Statistical Association, 91, 198–210.
El Karoui et al. (2010) El Karoui, N. et al. (2010), “High-dimensionality effects in the Markowitz problem and other quadratic programs with linear constraints: Risk underestimation,” The Annals of Statistics, 38, 3487–3566.
Fox and Dunson (2015) Fox, E. B. and Dunson, D. B. (2015), “Bayesian nonparametric covariance regression,” The Journal of Machine Learning Research, 16, 2501–2542.
Franks (2021) Franks, A. M. (2021), “Reducing subspace models for large-scale covariance regression,” Biometrics.
Gardner et al. (2003) Gardner, T. S., Di Bernardo, D., Lorenz, D., and Collins, J. J. (2003), “Inferring genetic networks and identifying compound mode of action via expression profiling,” Science, 301, 102–105.
Gusev et al. (2018) Gusev, Y., Bhuvaneshwar, K., Song, L., Zenklusen, J.-C., Fine, H., and Madhavan, S. (2018), “The REMBRANDT study, a large collection of genomic data from brain cancer patients,” Scientific data, 5, 1–9.
Hastie et al. (2015) Hastie, T., Tibshirani, R., and Wainwright, M. (2015), “Statistical learning with sparsity,” Monographs on statistics and applied probability, 143, 143.
Hoff and Niu (2012) Hoff, P. D. and Niu, X. (2012), “A covariance regression model,” Statistica Sinica, 729–753.
Huang et al. (2006) Huang, J. Z., Liu, N., Pourahmadi, M., and Liu, L. (2006), “Covariance matrix selection and estimation via penalised normal likelihood,” Biometrika, 93, 85–98.
Javanmard and Montanari (2014) Javanmard, A. and Montanari, A. (2014), “Confidence intervals and hypothesis testing for high-dimensional regression,” The Journal of Machine Learning Research, 15, 2869–2909.
Kanehisa and Goto (2000) Kanehisa, M. and Goto, S. (2000), “KEGG: kyoto encyclopedia of genes and genomes,” Nucleic acids research, 28, 27–30.
Kim et al. (2021) Kim, R., Mueller, S., and Garcia, T. P. (2021), “svReg: Structural varying-coefficient regression to differentiate how regional brain atrophy affects motor impairment for Huntington disease severity groups,” Biometrical Journal, 63, 1254–1271.
Lam and Fan (2009) Lam, C. and Fan, J. (2009), “Sparsistency and rates of convergence in large covariance matrix estimation,” Annals of statistics, 37, 4254.
Langfelder and Horvath (2008) Langfelder, P. and Horvath, S. (2008), “WGCNA: an R package for weighted correlation network analysis,” BMC bioinformatics, 9, 1–13.
Ledoit and Wolf (2004) Ledoit, O. and Wolf, M. (2004), “A well-conditioned estimator for large-dimensional covariance matrices,” Journal of multivariate analysis, 88, 365–411.
Li et al. (2015) Li, Y., Nan, B., and Zhu, J. (2015), ‘‘Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure,” Biometrics, 71, 354–363.
Li et al. (2010) Li, Y., Wang, N., and Carroll, R. J. (2010), “Generalized functional linear models with semiparametric single-index interactions,” Journal of the American Statistical Association, 105, 621–633.
Maklad et al. (2019) Maklad, A., Sharma, A., and Azimi, I. (2019), “Calcium signaling in brain cancers: roles and therapeutic targeting,” Cancers, 11, 145.
Network et al. (2008) Network et al. (2008), “Comprehensive genomic characterization defines human glioblastoma genes and core pathways,” Nature, 455, 1061–1068.
Olmez et al. (2017) Olmez, I., Brenneman, B., Xiao, A., Serbulea, V., Benamar, M., Zhang, Y., Manigat, L., Abbas, T., Lee, J., Nakano, I., et al. (2017), “Combined CDK4/6 and mTOR inhibition is synergistic against glioblastoma via multiple mechanisms,” Clinical Cancer Research, 23, 6958–6968.
Park (2023) Park, H. G. (2023), “Bayesian estimation of covariate assisted principal regression for brain functional connectivity,” arXiv preprint arXiv:2306.07181.
Pourahmadi (1999) Pourahmadi, M. (1999), “Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation,” Biometrika, 86, 677–690.
Qiu and Liyanage (2019) Qiu, Y. and Liyanage, J. S. (2019), “Threshold selection for covariance estimation,” Biometrics, 75, 895–905.
Rothman et al. (2009) Rothman, A. J., Levina, E., and Zhu, J. (2009), “Generalized thresholding of large covariance matrices,” Journal of the American Statistical Association, 104, 177–186.
Schröder and McDonald (2015) Schröder, L. B. and McDonald, K. L. (2015), “CDK4/6 inhibitor PD0332991 in glioblastoma treatment: does it have a future?” Frontiers in oncology, 5, 259.
Simon et al. (2013) Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2013), “A sparse-group lasso,” Journal of Computational and Graphical Statistics, 22, 231–245.
Su et al. (2023) Su, C., Xu, Z., Shan, X., Cai, B., Zhao, H., and Zhang, J. (2023), “Cell-type-specific co-expression inference from single cell RNA-sequencing data,” Nature Communications, 14, 4846.
Tibshirani and Friedman (2020) Tibshirani, R. and Friedman, J. (2020), “A pliable lasso,” Journal of Computational and Graphical Statistics, 29, 215–225.
Van Der Wijst et al. (2018) Van Der Wijst, M. G., de Vries, D. H., Brugge, H., Westra, H.-J., and Franke, L. (2018), “An integrative approach for building personalized gene regulatory networks for precision medicine,” Genome medicine, 10, 1–15.
Vierstra et al. (2020) Vierstra, J., Lazar, J., Sandstrom, R., Halow, J., Lee, K., Bates, D., Diegel, M., Dunn, D., Neri, F., Haugen, E., et al. (2020), ‘‘Global reference map** of human transcription factor footprints,” Nature, 583, 729–736.
Wu and Pourahmadi (2003) Wu, W. B. and Pourahmadi, M. (2003), “Nonparametric estimation of large covariance matrices of longitudinal data,” Biometrika, 90, 831–844.
Xue et al. (2012) Xue, L., Ma, S., and Zou, H. (2012), “Positive-definite $\ell$ 1-penalized estimation of large covariance matrices,” Journal of the American Statistical Association, 107, 1480–1491.
Yin et al. (2021) Yin, L., Yao, Z., Wang, Y., Huang, J., Mazuranic, M., and Yin, A. (2021), “In Preclinical evaluation of novel CDK4/6 inhibitor GLR2007 in glioblastoma models,” J. Clin. Oncol, 39, e14023.
Yuan and Lin (2006) Yuan, M. and Lin, Y. (2006), “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49–67.
Zhang and Zhang (2014) Zhang, C.-H. and Zhang, S. S. (2014), ‘‘Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B: Statistical Methodology, 217–242.
Zhang and Li (2023) Zhang, J. and Li, Y. (2023), “High-dimensional Gaussian graphical regression models with covariates,” Journal of the American Statistical Association, 118, 2088–2100.
Zhang et al. (2020) Zhang, J., Sun, W. W., and Li, L. (2020), “Mixed-effect time-varying network model and application in brain connectivity analysis,” Journal of the American Statistical Association, 115, 2022–2036.
Zhang et al. (2023) Zhang, J., Sun, W. W., and Li, L. (2023), “Generalized connectivity matrix response regression with applications in brain connectivity studies,” Journal of Computational and Graphical Statistics, 32, 252–262.
Zhang and Zhao (2023) Zhang, J. and Zhao, H. (2023), “eQTL studies: from bulk tissues to single cells,” Journal of Genetics and Genomics, 50, 925–933.
Zhao et al. (2021) Zhao, Y., Wang, B., Mostofsky, S. H., Caffo, B. S., and Luo, X. (2021), “Covariate assisted principal regression for covariance matrix outcomes,” Biostatistics, 22, 629–645.
Zou et al. (2017) Zou, T., Lan, W., Wang, H., and Tsai, C.-L. (2017), “Covariance regression analysis,” Journal of the American Statistical Association, 112, 266–281.