Contrastive independent component analysis

Kexin Wang Harvard University, 29 Oxford Street, Pierce Hall 212A, Cambridge, MA 02138, USA [email protected] , Aida Maraj Harvard University, 29 Oxford Street, Pierce Hall 212A, Cambridge, MA 02138, USA AND University of Michigan, East Hall 1855, Ann Arbor, MI 48109 [email protected] and Anna Seigal Harvard University, 29 Oxford Street, Pierce Hall 324, Cambridge, MA 02138, USA [email protected]

Abstract.

Visualizing data and finding patterns in data are ubiquitous problems in the sciences. Increasingly, applications seek signal and structure in a contrastive setting: a foreground dataset relative to a background dataset. For this purpose, we propose contrastive independent component analysis (cICA). This generalizes independent component analysis to independent latent variables across a foreground and background. We propose a hierarchical tensor decomposition algorithm for cICA. We study the identifiability of cICA and demonstrate its performance visualizing data and finding patterns in data, using synthetic and real-world datasets, comparing the approach to existing contrastive methods.

1. Introduction

Finding and understanding patterns in data is fundamental in various scientific fields. Often, data have been collected under two different settings, such as a group of patients receiving a treatment and a control group, or a group of patients with a certain disease and a group without the disease. The goal may be to understand the effect of the treatment, or to understand the genetic changes that describe the disease. While standard data analysis methods can be used, which restrict attention to one of the datasets or combine them together, an alternate view is offered by contrastive methods. Contrastive methods view the two settings as a foreground and a background. They seek to learn patterns in the foreground after accounting for (or, “subtracting off”) the background. The hope is that such patterns encode useful structure and offer a good basis for dimensionality reduction and visualization of the data, to identify fine-grained structure and clusters particular to the foreground.

The contrastive viewpoint is first addressed in [ZHPA13], in contrastive topic modeling and contrastive hidden Markov Models applied to genomic sequence analysis. Principal component analysis (PCA) is generalized to contrastive PCA (cPCA) in [AZBZ17, AZBZ18]. The contrastive patterns are principal components of the foreground covariance matrix minus a scalar multiple of the background covariance matrix. The paper [SGN19] studies a linear contrastive latent variable model. Probabilistic contrastive PCA (PCPCA) is introduced in [LJE20], where foreground patterns are inferred by maximizing a likelihood ratio of linear Gaussian mixtures.

In this paper, we propose contrastive independent component analysis (cICA). Independent component analysis (ICA) is a blind source separation method, which seeks to recover latent sources and unknown mixing from observations of mixtures of signals [CJ10]. ICA assumes that latent sources are independent. In extending ICA to the contrastive setting, the idea is that background data is generated by mixing of independent sources while foreground data is generated by the background mixing together with a foreground mixing of independent sources. The patterns of interest are the foreground mixing.

We show that cICA has strong identifiability properties. These enable the contribution of each background pattern to the foreground to be found uniquely. This avoids the need for a sweep of hyperparameters to find the best multiple of the background to subtract from the foreground and even avoids the assumption that the background contribution to the foreground is via a single scalar multiple, both of which are required in cPCA and PCPCA [AZBZ17, AZ19, LJE20]. We develop tensor decomposition algorithms for cICA and show that they recover accurate patterns for synthetic data. For this, we devise a new hierarchical tensor decomposition based on recursive eigendecompositions. We turn cICA into a dimensionality reduction tool and investigate its performance on real-world data, comparing the plots to those obtained with other contrastive methods to see its competitiveness.

The paper is organized as follows. We define cICA and introduce a tensor decomposition approach to learn it in Section 2. A key ingredient of learning cICA is a new hierarchical tensor decomposition, which we introduce and study in Section 3. We study the identifiability results and present algorithms for cICA in Section 4. Numerical results are in Section 6.

2. From ICA to contrastive ICA

Independent component analysis (ICA) studies observations that are a linear mixture of independent source variables. We write the ICA model as

(1)

\mathbf{y}=A\mathbf{z},

where $\mathbf{z}$ is a vector of $r$ independent latent random variables, the mixing matrix is $A\in\mathbb{R}^{p\times r}$ , and $\mathbf{y}$ is a vector of $p$ observed variables. The $i$ -th column of $A$ records a pattern in the data: the contribution of variable $z_{i}$ to each of the $p$ observed variables. The identifiability of ICA refers to the uniqueness of the mixing matrix $A$ and sometimes also of the variables $\mathbf{z}$ ; see [EK04, Com94, WS24].

Many algorithms for ICA proceed via tensor decomposition, see e.g. [CJ10, CS93, DLDMV01, DLCC07]. The cumulants of a distribution are symmetric tensors that encode it. The $d$ -th cumulant $\kappa_{d}(\mathbf{y})$ of $\mathbf{y}$ is a symmetric order $d$ tensor of format $p\times\cdots\times p$ with decomposition

(2)

\kappa_{d}(\mathbf{y})=\sum_{i=1}^{r}\lambda_{i}\mathbf{a}_{i}^{\otimes d},

where the scalar $\lambda_{i}$ is the $d$ -th cumulant of $z_{i}$ and the vector $\mathbf{a}_{i}\in\mathbb{R}^{p}$ is the $i$ -th column of $A$ . This decomposition (2) follows from the multi-linear properties of cumulants and the fact that cumulant tensors of independent variables are diagonal, see [McC18, Chapter 2]. The matrix $A$ can be recovered using tensor decomposition of the cumulant tensor (2). If the tensor decomposition is identifiable, then the columns $\mathbf{a}_{i}$ with $\lambda_{i}\neq 0$ can be recovered uniquely up to permutation and scaling of columns. Thus tensor decomposition of higher-order cumulant tensors gives an algorithm for ICA, provided no source variable is Gaussian (this is required for non-zero higher-order cumulants).

In this paper we extend ICA, and tensor decomposition for ICA, to the comparison of two distributions. We call this contrastive ICA (cICA), by analogy with cPCA [AZBZ18]. We have two observed distributions, a foreground and a background. Both are assumed to be linear mixtures of independent source variables. The cICA model expresses the background $\mathbf{y}$ and foreground $\mathbf{x}$ as

(3)

\mathbf{y}=A\mathbf{z}\qquad\text{and}\qquad\mathbf{x}=A\mathbf{z}^{\prime}+B% \mathbf{s}.

The background distribution $\mathbf{y}$ is a linear mixture of a random vector $\mathbf{z}$ of $r$ independent random variables, as in (1). The foreground $\mathbf{x}$ is a mixture of $r+\ell$ independent variables $\mathbf{z}^{\prime}=(z_{1}^{\prime},\ldots,z_{r}^{\prime})$ and $\mathbf{s}=(s_{1},\ldots,s_{\ell})$ . The columns of $A$ are the patterns in the background: column $\mathbf{a}_{i}\in\mathbb{R}^{p}$ records how source variable $z_{i}$ appears among the $p$ background variables as well as how source variable $z_{i}^{\prime}$ appears among the $p$ foreground variables. The columns of $B$ are patterns that appear only in the foreground. They correspond to the variables $s_{i}$ , referred to as the salient variables in [AZ19].

We propose a tensor decomposition algorithm to recover mixing matrices $A$ and $B$ from (3). These matrices record the patterns that encode our background and foreground distributions. We apply the algorithm to empirical cumulant tensors of $\mathbf{x}$ and $\mathbf{y}$ obtained from sample data. We order the columns of matrix $B$ to obtain a dimensionality reduction tool. We work under the assumption that $\mathbf{z},\mathbf{z}^{\prime},\mathbf{s}$ are non-Gaussian, an assumption that also appears for usual ICA. This can likely be relaxed to that at most one source is Gaussian, cf. [Com94, WS24].

Under the model (3), the $d$ -th cumulants of the background and foreground data are, respectively,

(4)

\kappa_{d}(\mathbf{y})=\sum_{i=1}^{r}\lambda_{i}\mathbf{a}_{i}^{\otimes d},% \qquad\quad\kappa_{d}(\mathbf{x})=\sum_{i=1}^{r}\lambda_{i}^{\prime}\mathbf{a}% _{i}^{\otimes d}+\sum_{j=1}^{\ell}\nu_{j}\mathbf{b}_{j}^{\otimes d},

where $\lambda_{i}$ is the $d$ -th cumulant of $z_{i}$ , $\lambda_{i}^{\prime}$ is the $d$ -th cumulant of $z_{i}^{\prime}$ , and $\nu_{j}$ is the $d$ -th cumulant of $s_{j}$ . This follows from the multilinearity of cumulants and that cumulant tensors of independent sources are diagonal, as for usual ICA. We hence have the following optimization problem to recover $A$ and $B$ : find a joint decomposition of cumulant tensors $\kappa_{d}(\mathbf{y})$ and $\kappa_{d}(\mathbf{x})$ of the form in (4). Our approach is:

(1)

Compute a symmetric tensor decomposition of $\kappa_{d}(\mathbf{y})$ to learn $A$ .
(2)

Find the coefficients $\lambda_{i}^{\prime}$ of each $\mathbf{a}_{i}^{\otimes d}$ in $\kappa_{d}(\mathbf{x})$ to obtain $\sum_{j=1}^{\ell}\nu_{j}\mathbf{b}_{j}^{\otimes d}$ .
(3)

Compute a symmetric tensor decomposition of $\sum_{j=1}^{\ell}\nu_{j}\mathbf{b}_{j}^{\otimes d}$ to learn $B$ .

We work with the fourth order cumulants $d=4$ . We propose a hierarchical eigendecomposition based algorithm to decompose an order four symmetric tensor, which we describe in the next section. This is a key ingredient to our cICA algorithm.

2.1. Related Work

We relate cICA to other contrastive models. Setting $\mathbf{z}^{\prime}=\gamma\mathbf{z}$ and studying observed distributions $\mathbf{x}$ and $\mathbf{y}$ via their covariance matrices ( $d=2$ ) specializes cICA to cPCA from [AZBZ17, AZBZ18]. cICA also relates to PCPCA [LJE20] but we do not impose distributional assumptions, beyond independence, on the variables $\mathbf{z}$ and $(\mathbf{z}^{\prime},\mathbf{s})$ . Finally, the cICA model fits into the contrastive latent variable model framework of [SGN19], but we disregard noise terms and do not impose $\mathbf{z}=\mathbf{z}^{\prime}$ .

The setting of cICA relates to usual ICA, with block structure on the mixing matrix:

\text{if $\mathbf{z}^{\prime}\!$, $\mathbf{z}$, $\mathbf{s}$ are independent,}% \,\begin{pmatrix}\mathbf{x}\\ \mathbf{y}\end{pmatrix}=\begin{pmatrix}0&A&B\\ A&0&0\end{pmatrix}\begin{pmatrix}\mathbf{z}\phantom{{}^{\prime}}\\ \mathbf{z}^{\prime}\\ \mathbf{s}\phantom{{}^{\prime}}\end{pmatrix};\quad\text{if $\mathbf{z}^{\prime% }=\gamma\mathbf{z}$,}\,\begin{pmatrix}\mathbf{x}\\ \mathbf{y}\end{pmatrix}=\begin{pmatrix}\gamma A&B\\ A&0\end{pmatrix}\begin{pmatrix}\mathbf{z}\\ \mathbf{s}\end{pmatrix}.

However, learning parameters via usual ICA in either of these settings requires access to the joint distribution of $(\mathbf{x},\mathbf{y})$ , In the first setting, $\mathbf{x},\mathbf{y}$ are independent, so we can build the joint distribution of $(\mathbf{x},\mathbf{y})$ from unpaired observations of $\mathbf{x}$ and $\mathbf{y}$ . Identifiability can be characterized using [Com94], or using [EK04, WS24] if the model is overcomplete (i.e. the number of sources exceeds the number of observations, which occurs for $2r+\ell>2p$ ). However, in practice, the independence assumption on all of $\mathbf{z}^{\prime},\mathbf{z},\mathbf{s}$ is too strong. In the second setting, we do not have access to the joint distribution of $(\mathbf{x},\mathbf{y})$ unless we have paired data samples, which is unrealistic in the settings we study.

In [SSDU24], the authors study multi-modal linear ICA. They recover the mixing matrices from each mode via usual linear ICA and then use a hypothesis test to decide which latent variables should be shared across modes. Our method differs from this as we seek shared patterns across datasets rather than shared latent variables.

Nonlinear ICA related contrastive methods have been explored in the literature. Nonlinear ICA is studied using contrastive learning in [HM16, HST19, LF22]. Here contrastive is used in a different context: it describes a method to train a network to distinguish two datasets. A nonlinear contrastive method called a contrastive variational autoencoder (cVAE) is introduced in [AZ19, SGN19]. The paper [WBWL22] presents a method for cVAE using maximum mean discrepancy to prevent leakage of information between the two sets of latent variables. Identifiability of cVAE is studied using result from nonlinear ICA in [LHH⁺24]. These works produce a nonlinear latent encoding of data, whereas our focus is on pattern vectors to describe observed variables.

3. Hierarchical tensor decomposition

We introduce a hierarchical tensor decomposition (HTD) that decomposes an order four tensor via recursive eigendecompositions. The idea is to find a low-rank approximation of a tensor, whose rank one summands offer an interpretable basis on which to project data. Later, we use the decomposition for cICA. In this section, we define the decomposition and study its properties.

3.1. The HTD algorithm

Consider a symmetric tensor $T$ of format $p\times p\times p\times p$ . We compute a rank $r$ approximation,

(5)

T\approx\sum_{i=1}^{r}\nu_{i}\mathbf{b}_{i}^{\otimes 4},

as follows. Let $\operatorname{Mat}(T)$ be the flattening of $T$ that rearranges its $p^{4}$ entries into a matrix of size $p^{2}\times p^{2}$ . The entries of $\operatorname{Mat}(T)$ are indexed $((i_{1},i_{2}),(j_{1},j_{2}))$ , where $i_{1},i_{2},j_{1},j_{2}\in[p]:=\{1,\ldots,p\}$ . We compute the approximation (5) by first computing the eigendecomposition of $\operatorname{Mat}(T)$ , whose eigenvectors lie in $\mathbb{R}^{p^{2}}$ , and then by resha** these eigenvectors into $p\times p$ matrices and computing their top eigenvalue and corresponding eigenvector. By top eigenvalue we mean those of highest magnitude. This decomposition has not to our knowledge been studied before but has connections to the hierarchical tensor representations of [Hac12, Chapter 11] and to the PARATREE model in [SRK09], see Subsection 3.3. Here is the algorithm.

Algorithm 1 Compute unit vectors

\mathbf{b}_{1},\ldots,\mathbf{b}_{r}

such that

T\approx\sum_{i=1}^{r}\nu_{i}\mathbf{b}_{i}^{\otimes 4}

1:Symmetric tensor

T

of format

p\times p\times p\times p

and rank

r

2:Compute the eigendecomposition of the

p^{2}\times p^{2}

flattening

\operatorname{Mat}(T)

. Take the top

r

eigenvalues

\mu_{1},\ldots,\mu_{r}

, with corresponding eigenvectors

\mathbf{v}_{1},\ldots,\mathbf{v}_{r}\in\mathbb{R}^{p^{2}}

of unit length.

3:For each

i\in[r]

, reshape

\mathbf{v}_{i}\in\mathbb{R}^{p^{2}}

M_{i}\in\mathbb{R}^{p\times p}

4:For each

M_{i}

, find the top eigenvalue

\beta_{i}

and a corresponding unit length eigenvector

\mathbf{b}_{i}\in\mathbb{R}^{p}

5:Rank

r

decomposition

\sum_{i=1}^{r}(\mu_{i}\beta_{i}^{2})\mathbf{b}_{i}^{\otimes 4}

We record some observations pertaining to Algorithm 1. The matrix $\operatorname{Mat}(T)\in\mathbb{R}^{p^{2}\times p^{2}}$ is symmetric since $T$ is symmetric. The matrices $M_{1},\ldots,M_{r}\in\mathbb{R}^{p\times p}$ are also symmetric, because the vectors $\mathbf{v}_{1},\ldots,\mathbf{v}_{r}$ are in the column space of $\operatorname{Mat}(T)$ , whose $(i_{1},i_{2})$ -th row coincides with its $(i_{2},i_{1})$ -th row. Although the output vectors $\mathbf{b}_{i}$ are in general not orthogonal, as each is an eigenvector of a distinct matrix, they can be nearly orthogonal in practice, see Section 3.2. This is because they are the leading eigenvectors of matrices that have been reshaped from orthogonal vectors $\mathbf{v}_{i}$ .

Example 3.1 ( $2\times 2\times 2\times 2$ example).

Let $r=2$ . Fix

T=2\begin{bmatrix}1\\ 0\end{bmatrix}^{\otimes 4}+\begin{bmatrix}0.0998\\ 0.995\end{bmatrix}^{\otimes 4}.\qquad\text{Then}\qquad\mathrm{Mat}(T)=\begin{% bmatrix}2.0001&0.0010&0.0010&0.0099\\ 0.0010&0.0099&0.0099&0.0983\\ 0.0010&0.0099&0.0099&0.0983\\ 0.0099&0.0983&0.0983&0.9801\end{bmatrix}

with eigenvalues $\mu_{1}=2.00019,\mu_{2}=0.99977$ and associated eigenvectors

\mathbf{v}_{1}^{\mathsf{T}}\approx\begin{bmatrix}0.99995&0.00098&0.00098&0.009% 85\end{bmatrix},\,\mathbf{v}_{2}^{\mathsf{T}}\approx\begin{bmatrix}-0.00995&0.% 0993&0.0993&0.99003\end{bmatrix}.

Their corresponding matrices $M_{1},M_{2}\in\mathbb{R}^{2\times 2}$ are symmetric with top eigenvalues $\beta_{1}=0.99995$ and $\beta_{2}=0.9998$ , respectively, with associated eigenvectors $\mathbf{b}_{1}^{\mathsf{T}}=\begin{bmatrix}0.99999&0.00099\end{bmatrix}$ and $\mathbf{b}_{2}^{\mathsf{T}}=\begin{bmatrix}0.09787&0.99519\end{bmatrix}$ . The HTD algorithm with input $T$ and $r=2$ thus outputs

(6)

\sum_{i=1}^{2}(\mu_{i}\beta_{i}^{2})\mathbf{b}_{i}^{\otimes 4}=1.99999\begin{% bmatrix}0.99999\\ 0.00099\end{bmatrix}^{\otimes 4}+0.99937\begin{bmatrix}0.09787\\ 0.99519\end{bmatrix}^{\otimes 4}.

We note the similarity to the input tensor $T$ .

3.2. Properties of the decomposition

The HTD algorithm outputs a rank $r$ approximation of a tensor. In certain cases, the output closely approximates the input tensor, as in Example 3.1. We bound the distance between the HTD approximation and the input tensor. We give a bound that applies to all tensors in Proposition 3.2. We show that the input and output coincide for orthogonally decomposable tensors in Proposition 3.3. Our main result is Theorem 3.4, which bounds the distance between an input and output tensor for a tensor decomposition involving vectors that are close to orthogonal.

The norm $\|\cdot\|$ refers to the Frobenius norm for matrices and tensors and the $2$ -norm for vectors; i.e., the square root of the sum of the squares of the entries. The $2$ -norm of a matrix is denoted by $\|\cdot\|_{2}$ .

Proposition 3.2.

Let $T$ be a symmetric tensor of format $p\times p\times p\times p$ . Let $T^{\prime}=\sum_{i=1}^{r}(\mu_{i}\beta_{i}^{2})\mathbf{b}_{i}^{\otimes 4}$ be the rank $r$ HTD approximation of $T$ . Then

\|T^{\prime}-T\|\leq{\left(\sum_{i=r+1}^{q}\mu_{i}^{2}\right)}^{\frac{1}{2}}+% \sum_{i=1}^{r}|\mu_{i}|(1+|\beta_{i}|){\left(\sum_{j=2}^{r_{i}}(\beta_{i}^{(j)% })^{2}\right)}^{\frac{1}{2}},

where $q$ is the rank of $\operatorname{Mat}(T)$ , $r_{i}$ is the rank of $M_{i}$ , the numbers $\mu_{1},\ldots,\mu_{r}$ are the eigenvalues of $\operatorname{Mat}(T)$ in descending order of magnitude, and $\beta_{i}:=\beta_{i}^{(1)}$ is the highest magnitude eigenvalue of $M_{i}$ with $\beta_{i}^{(2)},\ldots,\beta_{i}^{(r_{i})}$ the other eigenvalues.

Proof.

We use the notation from Algorithm 1. We have

\|\operatorname{Mat}(T)-\sum_{i=1}^{r}\mu_{i}\mathbf{v}_{i}^{\otimes 2}\|^{2}=% \sum_{i=r+1}^{q}\mu_{i}^{2}\qquad\text{and}\qquad\|M_{i}-\beta_{i}\mathbf{b}_{% i}^{\otimes 2}\|^{2}=\sum_{j=2}^{r_{i}}(\beta_{i}^{(j)})^{2},

from the properties of the eigendecomposition of a symmetric matrix and the Frobenius norm. Let $T^{\prime\prime}$ be the $p\times p\times p\times p$ tensor obtained from resha** the truncated eigendecomposition $\sum_{i=1}^{r}\mu_{i}\mathbf{v}_{i}^{\otimes 2}$ of $\operatorname{Mat}(T)$ . Then $\|T-T^{\prime\prime}\|^{2}=\sum_{i=r+1}^{q}\mu_{i}^{2}$ . Let $\mathbf{B}_{i}\in\mathbb{R}^{p^{2}}$ be the vectorization of $\mathbf{b}_{i}^{\otimes 2}\in\mathbb{R}^{p\times p}$ . Then

	$\displaystyle\\|T^{\prime\prime}-T^{\prime}\\|=$	$\displaystyle\\|\sum_{i=1}^{r}\mu_{i}(\mathbf{v}_{i}^{\otimes 2}-\beta_{i}^{2}% \mathbf{B}_{i}^{\otimes 2})\\|$
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{r}\|\mu_{i}\|\\|\mathbf{v}_{i}^{\otimes 2}-\beta_{i}^{2}% \mathbf{B}_{i}^{\otimes 2}\\|$
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{r}\|\mu_{i}\|(\\|\mathbf{v}_{i}^{\otimes 2}-\beta_{i}% \mathbf{B}_{i}\otimes\mathbf{v}_{i}\\|+\\|\beta_{i}^{2}\mathbf{B}_{i}^{\otimes 2% }-\beta_{i}\mathbf{B}_{i}\otimes\mathbf{v}_{i}\\|)$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{r}\|\mu_{i}\|(\\|\mathbf{v}_{i}\\|+\|\beta_{i}\|\\|\mathbf{B% }_{i}\\|)\\|\mathbf{v}_{i}-\beta_{i}\mathbf{B}_{i}\\|$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{r}\|\mu_{i}\|(1+\|\beta_{i}\|){\left(\sum_{j=2}^{r_{i}}(% \beta_{i}^{(j)})^{2}\right)}^{\frac{1}{2}},$

where the penultimate equality follows from $\|\mathbf{x}\otimes\mathbf{y}\|=\|\mathbf{x}\|\cdot\|\mathbf{y}\|$ and the last equality uses $\|\mathbf{v}_{i}\|=\|\mathbf{B}_{i}\|=1$ . We conclude by the triangle inequality $\|T-T^{\prime}\|\leq\|T-T^{\prime\prime}\|+\|T^{\prime\prime}-T^{\prime}\|$ . ∎

The quantity in Proposition 3.2 is small if $\operatorname{Mat}(T)$ is well-approximated by a matrix of rank $r$ , and each $M_{i}$ is well-approximated by a matrix of rank one. Orthogonally decomposable tensors are those with a decomposition into orthogonal rank one terms; that is, a decomposition $T=\sum_{i=1}^{r}\nu_{i}\mathbf{b}_{i}^{\otimes 4}$ , where $\mathbf{b}_{1},\ldots,\mathbf{b}_{r}$ are orthonormal [Rob16]. For orthogonally decomposable tensors, HTD recovers the exact decomposition.

Proposition 3.3.

Let $T=\sum_{i=1}^{r}\nu_{i}\mathbf{b}_{i}^{\otimes 4}$ , where the vectors $\mathbf{b}_{1},\ldots,\mathbf{b}_{r}$ are orthonormal and the coefficients $\nu_{1},\ldots,\nu_{r}$ are distinct. Then the rank $r$ HTD approximation is the tensor $T$ .

Proof.

The flattening $\operatorname{Mat}(T)$ has decomposition $\sum_{i=1}^{r}\nu_{i}\mathbf{B}_{i}^{\otimes 2}$ , where $\mathbf{B}_{i}\in\mathbb{R}^{p^{2}}$ is the vectorization of $\mathbf{b}_{i}^{\otimes 2}\in\mathbb{R}^{p\times p}$ . We have the orthogonality $\langle\mathbf{B}_{i},\mathbf{B}_{j}\rangle=\langle\mathbf{b}_{i},\mathbf{b}_{% j}\rangle^{2}=0$ for all $i\neq j$ , since the vectors $\mathbf{b}_{i},\mathbf{b}_{j}$ are orthogonal. Hence this expression for $\operatorname{Mat}(T)$ is a sum of outer products of orthogonal vectors, so it is the eigendecomposition of $\operatorname{Mat}(T)$ . The matrix reshaped from the eigenvector $\mathbf{B}_{i}$ is $M_{i}=\mathbf{b}_{i}^{\otimes 2}$ . It has top eigenvalue $1$ with corresponding eigenvector $\mathbf{b}_{i}$ . Hence the output of HTD is $\sum_{i=1}^{r}\nu_{i}\mathbf{b}_{i}^{\otimes 4}$ . ∎

We extend Proposition 3.3 to decompositions where the vectors $\mathbf{b}_{i}$ are close to orthogonal.

Theorem 3.4.

Let $r=\mathcal{O}(1)$ . Fix $T=\sum_{i=1}^{r}\nu_{i}\mathbf{b}_{i}^{\otimes 4}$ with $\nu_{i}$ distinct and in decreasing order of magnitude and $\mathbf{b}_{i}$ unit vectors with $\mathbf{b}_{1}^{\otimes 2},\ldots,\mathbf{b}_{r}^{\otimes 2}$ linearly independent. Assume that $|\langle\mathbf{b}_{i},\mathbf{b}_{j}\rangle|\leq\epsilon$ for all $i\neq j$ , for small $\epsilon>0$ . Suppose the output of rank $r$ HTD applied to $T$ is $T^{\prime}=\sum_{i=1}^{r}\nu_{i}^{\prime}\mathbf{b}_{i}^{\prime\otimes 4}$ with $\nu_{i}^{\prime}$ in decreasing order of magnitude and $\mathbf{b}_{i}^{\prime}$ unit vectors. Then

{|\nu_{i}-\nu^{\prime}_{i}|}=|\nu_{i}|\left(\frac{K}{\nu}\right)^{\frac{1}{2}}% \mathcal{O}(\epsilon)\quad\text{and}\quad\min\{\|\mathbf{b}_{i}-\mathbf{b}_{i}% ^{\prime}\|,\|\mathbf{b}_{i}+\mathbf{b}_{i}^{\prime}\|\}=\left({\frac{K}{\nu}}% \right)^{\frac{1}{4}}\mathcal{O}(\epsilon^{\frac{1}{2}}),

where $\nu=\min\{|\nu_{i}-\nu_{j}|,|\nu_{i}|:1\leq i<j\leq r\}$ and $K=\sum_{i=1}^{r}|\nu_{i}|(2^{i+1}-4)$ .

Note that the quantity $\min\{\|\mathbf{b}_{i}-\mathbf{b}_{i}^{\prime}\|,\|\mathbf{b}_{i}+\mathbf{b}_{% i}^{\prime}\|\}$ arises in Theorem 3.4, because of the sign indeterminacy in the vectors in the decompositions, due to the equality $(-\mathbf{b}_{i})^{\otimes d}=\mathbf{b}_{i}^{\otimes d}$ for $d$ even.

We prove Theorem 3.4 via two lemmas. The condition that the matrices $\mathbf{b}_{1}^{\otimes 2},\ldots,\mathbf{b}_{r}^{\otimes 2}$ are linearly independent ensures that $\operatorname{Mat}(T)$ has rank $r$ . This condition holds for generic vectors $\mathbf{b}_{i}$ , provided $r\leq{p+1\choose 2}$ .

Lemma 3.5.

Assume the hypotheses of Theorem 3.4. Let $\mathbf{B}_{i}$ be the vectorization of $\mathbf{b}_{i}^{\otimes 2}$ for $i\in[r]$ and define $M=\sum_{i=1}^{r}\nu_{i}\mathbf{B}_{i}^{\otimes 2}$ , the flattening of tensor $T$ . Then there exists a matrix $M^{\prime}$ with eigendecomposition

(7)

M^{\prime}=\sum_{i=1}^{r}\nu_{i}(\mathbf{B}_{i}^{\prime})^{\otimes 2}

such that $\|\mathbf{B}_{i}^{\prime}-\mathbf{B}_{i}\|\leq(2^{i}-2)\epsilon^{2}$ for $i\in[r]$ and $\|M-M^{\prime}\|\leq(\sum_{i=1}^{r}|\nu_{i}|(2^{i+1}-4))\epsilon^{2}.$

Proof.

Our first goal is to construct the vectors $\mathbf{B}_{i}^{\prime}\in\mathbb{R}^{p^{2}}$ . We generate orthogonal vectors via Gram-Schmidt:

\mathbf{B}_{j}^{\prime\prime}:=\mathbf{B}_{j}-\sum_{i=1}^{j-1}\langle\mathbf{B% }_{i}^{\prime\prime},\mathbf{B}_{j}\rangle\mathbf{B}_{i}^{\prime\prime}.

The vectors $\mathbf{B}_{i}$ satisfy $\|\mathbf{B}_{i}\|=1$ for all $i$ and $|\langle\mathbf{B}_{i},\mathbf{B}_{j}\rangle|=|\langle\mathbf{b}_{i},\mathbf{b% }_{j}\rangle|^{2}\leq\epsilon^{2}$ for all $i\neq j$ . We will prove by induction on $i$ that

\|\mathbf{B}_{i}^{\prime\prime}-\mathbf{B}_{i}\|\leq(2^{i-1}-1)\epsilon^{2}\ % \text{ and }\ |\langle\mathbf{B}_{i}^{\prime\prime},\mathbf{B}_{j}\rangle|\leq 2% ^{i-1}\epsilon^{2}\text{ for all }j>i.

The proofs rely on the inequalities

\displaystyle\|\mathbf{B}_{j}^{\prime\prime}-\mathbf{B}_{j}\|\leq\sum_{i=1}^{j% -1}|\langle\mathbf{B}_{i}^{\prime\prime},\mathbf{B}_{j}\rangle|

and

\displaystyle|\langle\mathbf{B}_{i}^{\prime\prime},\mathbf{B}_{j}\rangle|=|% \langle\mathbf{B}_{i},\mathbf{B}_{j}\rangle+\langle\mathbf{B}^{\prime\prime}_{% i}-\mathbf{B}_{i},\mathbf{B}_{j}\rangle|\leq|\langle\mathbf{B}_{i},\mathbf{B}_% {j}\rangle|+\|\mathbf{B}_{i}^{\prime\prime}-\mathbf{B}_{i}\|,

where the last inequality uses Cauchy-Schwarz. When $i=1$ , $\|\mathbf{B}_{i}^{\prime\prime}-\mathbf{B}_{i}\|=0\leq(2^{0}-1)\epsilon^{2}$ and $|\langle\mathbf{B}_{1}^{\prime\prime},\mathbf{B}_{j}\rangle|\leq\epsilon^{2}+0% =2^{1-1}\epsilon^{2}$ , as desired. Suppose the statement is true for all $k<i$ . Then,

\displaystyle\|\mathbf{B}_{i}^{\prime\prime}-\mathbf{B}_{i}\|\leq\sum_{k=1}^{i% -1}|\langle\mathbf{B}_{k}^{\prime\prime},\mathbf{B}_{i}\rangle|\leq\sum_{k=1}^% {i-1}2^{k-1}\epsilon^{2}=(2^{i-1}-1)\epsilon^{2}

and

\displaystyle|\langle\mathbf{B}_{i}^{\prime\prime},\mathbf{B}_{j}\rangle|\leq% \epsilon^{2}+(2^{i-1}-1)\epsilon^{2}=2^{i-1}\epsilon^{2}.

This concludes the induction. We now define $\mathbf{B}_{i}^{\prime}=\frac{1}{\|\mathbf{B}_{i}^{\prime\prime}\|}\mathbf{B}_% {i}^{\prime\prime}$ . Then $\mathbf{B}^{\prime}_{1},\ldots,\mathbf{B}^{\prime}_{r}$ are orthonormal, hence (7) is the eigendecomposition of $M^{\prime}$ . Moreover,

\|\mathbf{B}_{i}^{\prime}-\mathbf{B}_{i}\|\leq\|\mathbf{B}_{i}^{\prime}-% \mathbf{B}_{i}^{\prime\prime}\|+\|\mathbf{B}_{i}^{\prime\prime}-\mathbf{B}_{i}% \|=|\|\mathbf{B}_{i}^{\prime\prime}\|-1|+\|\mathbf{B}_{i}^{\prime\prime}-% \mathbf{B}_{i}\|\leq 2\|\mathbf{B}_{i}^{\prime\prime}-\mathbf{B}_{i}\|\leq(2^{% i}-2)\epsilon^{2}.

It remains to bound $\|M-M^{\prime}\|$ . We have $\|M-M^{\prime}\|\leq\sum_{i=1}^{r}|\nu_{i}|\|(\mathbf{B}_{i}^{\prime})^{% \otimes 2}-\mathbf{B}_{i}^{\otimes 2}\|$ . The summands satisfy

	$\displaystyle\\|(\mathbf{B}_{i}^{\prime})^{\otimes 2}-\mathbf{B}_{i}^{\otimes 2% }\\|\leq$	$\displaystyle\\|(\mathbf{B}_{i}^{\prime})^{\otimes 2}-\mathbf{B}_{i}\otimes% \mathbf{B}_{i}^{\prime}\\|+\\|\mathbf{B}_{i}\otimes\mathbf{B}_{i}^{\prime}-(% \mathbf{B}^{\prime}_{i})^{\otimes 2}\\|$
	$\displaystyle=$	$\displaystyle\\|\mathbf{B}_{i}^{\prime\prime}\\|\cdot\\|\mathbf{B}_{i}^{\prime}-% \mathbf{B}_{i}\\|+\\|\mathbf{B}_{i}\\|\cdot\\|\mathbf{B}_{i}^{\prime}-\mathbf{B}_{% i}\\|$
	$\displaystyle=$	$\displaystyle 2\\|\mathbf{B}_{i}^{\prime}-\mathbf{B}_{i}\\|\leq(2^{i+1}-4)% \epsilon^{2},$

where the first and third lines follow from the triangle inequality and the second follows from $\|\mathbf{a}\otimes\mathbf{b}\|=\|\mathbf{a}\|\cdot\|\mathbf{b}\|$ . Collecting the summands proves the claim. ∎

Lemma 3.6.

Assume the setup of Theorem 3.4 and let $M^{\prime}$ be the matrix from Lemma 3.5. Then the eigenvalues and eigenvectors of $M$ and $M^{\prime}$ are close: let $\sum_{i=1}^{r}\mu_{i}\mathbf{v}_{i}^{\otimes 2}$ be the eigendecomposition of $M$ , then

|\mu_{i}-\nu_{i}|\leq K\epsilon^{2}\quad\text{and}\quad\min\{\|\mathbf{v}_{i}-% \mathbf{B}_{i}^{\prime}\|,\|\mathbf{v}_{i}+\mathbf{B}_{i}^{\prime}\|\}\leq 2% \sqrt{\frac{K}{\nu}}\epsilon,

where $\nu_{1},\ldots,\nu_{r}$ and $\mu_{1},\ldots,\mu_{r}$ are ordered by decreasing magnitude and $K=\sum_{i=1}^{r}|\nu_{i}|(2^{i+1}-4)$ .

Proof.

We have the bound

|\mu_{i}-\nu_{i}|\leq\|M-M^{\prime}\|\leq K\epsilon^{2},

where the first inequality follows from Weyl’s inequality [Wey12] and the second from Lemma 3.5. For the similarity of eigenvectors, we lower bound $|\langle\mathbf{B}_{i}^{\prime},\mathbf{v}_{i}\rangle|$ using the difference $M^{\prime}\mathbf{v}_{i}-\nu_{i}\mathbf{v}_{i}$ . We have

	$\displaystyle\\|M^{\prime}\mathbf{v}_{i}-\nu_{i}\mathbf{v}_{i}\\|$	$\displaystyle=\\|(M^{\prime}-M)\mathbf{v}_{i}+(\mu_{i}-\nu_{i})\mathbf{v}_{i}\\|$
		$\displaystyle\leq\\|M^{\prime}-M\\|_{2}+\|\mu_{i}-\nu_{i}\|$
		$\displaystyle\leq\\|M^{\prime}-M\\|+\|\mu_{i}-\nu_{i}\|\leq 2K\epsilon^{2},$

where the second inequality follows from the triangle inequality and the definition of the 2-norm and the third inequality follows from the fact that the 2-norm for matrices is bounded above by the Frobenius norm. If $r<p^{2},$ we complete $\mathbf{B}_{1}^{\prime},\ldots,\mathbf{B}_{r}^{\prime}$ to an orthonormal basis $\mathbf{B}_{1}^{\prime},\ldots,\mathbf{B}_{p^{2}}^{\prime}$ of $\mathbb{R}^{p^{2}}$ . Then $\mathbf{v}_{i}=\sum_{i=1}^{p^{2}}\langle\mathbf{B}_{i}^{\prime},\mathbf{v}_{i}% \rangle\mathbf{B}_{i}^{\prime}$ . Hence

M^{\prime}\mathbf{v}_{i}-\nu_{i}\mathbf{v}_{i}=\sum_{j=1}^{r}(\nu_{j}-\nu_{i})% \langle\mathbf{B}_{j}^{\prime},\mathbf{v}_{i}\rangle\mathbf{B}_{j}^{\prime}-% \nu_{i}\sum_{i=r+1}^{p^{2}}\langle\mathbf{B}_{i}^{\prime},\mathbf{v}_{i}% \rangle\mathbf{B}_{i}^{\prime}.

Then $\|M^{\prime}\mathbf{v}_{i}-\nu_{i}\mathbf{v}_{i}\|^{2}\geq\sum_{j=1,j\neq i}^{% p^{2}}\nu^{2}\langle\mathbf{B}_{j}^{\prime},\mathbf{v}_{i}\rangle^{2}=\nu^{2}(% 1-\langle\mathbf{B}_{i}^{\prime},\mathbf{v}_{i}\rangle^{2})$ , where $\nu=\min\{|\nu_{i}-\nu_{j}|,|\nu_{i}|:1\leq i<j\leq r\}$ . Then

|\langle\mathbf{B}_{i}^{\prime},\mathbf{v}_{i}\rangle|\geq{\left(1-(\frac{2K}{% \nu}\epsilon^{2})^{2}\right)}^{\frac{1}{2}}\geq 1-\frac{2K}{\nu}\epsilon^{2}.

Since $\|\mathbf{v}_{i}\pm\mathbf{B}_{i}^{\prime}\|^{2}=\|\mathbf{v}_{i}\|^{2}+\|% \mathbf{B}_{i}^{\prime}\|^{2}\pm 2\langle\mathbf{B}_{i}^{\prime},\mathbf{v}_{i}\rangle$ , we have

\min\{\|\mathbf{v}_{i}-\mathbf{B}_{i}^{\prime}\|,\|\mathbf{v}_{i}+\mathbf{B}_{% i}^{\prime}\|\}=\sqrt{\|\mathbf{v}_{i}\|^{2}+\|\mathbf{B}_{i}^{\prime}\|^{2}-2% |\langle\mathbf{B}_{i}^{\prime},\mathbf{v}_{i}\rangle|}\leq 2\sqrt{\frac{K}{% \nu}}\epsilon.\qed

Proof of Theorem 3.4.

For $i\in[r]$ , the matrix $M_{i}\in\mathbb{R}^{p\times p}$ reshapes the unit vector $\mathbf{v}_{i}\in\mathbb{R}^{p^{2}}$ into a $p\times p$ matrix. The output of HTD is $T^{\prime}$ as in the statement, where $\mathbf{b}_{1}^{\prime},\ldots,\mathbf{b}_{r}^{\prime}$ are the eigenvectors of $M_{1},\ldots,M_{r}$ , respectively. The output of HTD has $r$ non-zero summands, since the matrix $\operatorname{Mat}(T)$ has rank $r$ . This follows from the fact that vectors $\mathbf{b}_{1}^{\otimes 2},\ldots,\mathbf{b}_{r}^{\otimes 2}$ are linearly independent. Then,

\|M_{i}-\mathbf{b}_{i}^{\otimes 2}\|=\|\mathbf{v}_{i}-\mathbf{B}_{i}\|\leq\|% \mathbf{v}_{i}-\mathbf{B}_{i}^{\prime}\|+\|\mathbf{B}_{i}-\mathbf{B}_{i}^{% \prime}\|=2\sqrt{\frac{K}{\nu}}\epsilon+(2^{i}-2)\epsilon^{2}.

We use the distance between $M_{i}$ and $\mathbf{b}_{i}^{\otimes 2}$ to bound the distance between their eigenvectors and eigenvalues:

|\beta_{i}-1|\leq\|M_{i}-\mathbf{b}_{i}^{\otimes 2}\|\leq 2\sqrt{\frac{K}{\nu}% }\epsilon+(2^{i}-2)\epsilon^{2},

by Weyl’s inequality. Note from Algorithm 1 that $\nu_{i}^{\prime}=\mu_{i}\beta_{i}^{2}$ . Hence

	$\displaystyle\|\nu_{i}^{\prime}-\nu_{i}\|$	$\displaystyle\leq\|\mu_{i}\beta_{i}^{2}-\mu_{i}\|+\|\mu_{i}-\nu_{i}\|$
		$\displaystyle\leq\|\mu_{i}(\beta_{i}+1)\|\cdot\|\beta_{i}-1\|+\|\mu_{i}-\nu_{i}\|$
		$\displaystyle\leq(\|\nu_{i}\|+\|\nu_{i}-\mu_{i}\|)(2+\|\beta_{i}-1\|)\|\beta_{i}-1\|+\|% \mu_{i}-\nu_{i}\|$
		$\displaystyle\leq(\|\nu_{i}\|+K\epsilon^{2})(2+2\sqrt{\frac{K}{\nu}}\epsilon+(2^% {i}-2)\epsilon^{2})(2\sqrt{\frac{K}{\nu}}\epsilon+(2^{i}-2)\epsilon^{2})+K% \epsilon^{2}$
		$\displaystyle=\|\nu_{i}\|\sqrt{\frac{K}{\nu}}\mathcal{O}(\epsilon),$

where the first and second inequalities follow from the triangle inequality and the third is obtained via substituting in the bounds $|\beta_{i}-1|$ and $|\nu_{i}-\mu_{i}|$ , obtained above.

To bound $\|\mathbf{b}_{i}-\mathbf{b}_{i}^{\prime}\|$ , we consider $(\mathbf{b}_{i}^{\otimes 2})\mathbf{b}_{i}^{\prime}-\mathbf{b}_{i}^{\prime}$ . On the one hand, we have

	$\displaystyle\\|(\mathbf{b}_{i}^{\otimes 2})\mathbf{b}_{i}^{\prime}-\mathbf{b}_% {i}^{\prime}\\|$	$\displaystyle=\\|(\mathbf{b}_{i}^{\otimes 2}-M_{i})\mathbf{b}_{i}^{\prime}+(M_{% i}-I)\mathbf{b}_{i}^{\prime}\\|\leq\\|M_{i}-\mathbf{b}_{i}^{\otimes 2}\\|_{2}+\|% \beta_{i}-1\|$
		$\displaystyle\leq\\|M_{i}-\mathbf{b}_{i}^{\otimes 2}\\|+\|\beta_{i}-1\|\leq 2(2% \sqrt{\frac{K}{\nu}}\epsilon+(2^{i}-2)\epsilon^{2}).$

On the other hand, $(\mathbf{b}_{i}^{\otimes 2})\mathbf{b}_{i}^{\prime}-\mathbf{b}_{i}^{\prime}=% \mathbf{b}_{i}\langle\mathbf{b}_{i},\mathbf{b}_{i}^{\prime}\rangle-\mathbf{b}_% {i}^{\prime}$ and $\|(\mathbf{b}_{i}^{\otimes 2})\mathbf{b}_{i}^{\prime}-\mathbf{b}_{i}^{\prime}% \|^{2}=1-\langle\mathbf{b}_{i},\mathbf{b}_{i}^{\prime}\rangle^{2}$ . So, $|\langle\mathbf{b}_{i},\mathbf{b}_{i}^{\prime}\rangle|\geq{\left(1-(2(2\sqrt{% \frac{K}{\nu}}\epsilon+(2^{i}-2)\epsilon^{2}))^{2}\right)}^{\frac{1}{2}}\geq 1% -2(2\sqrt{\frac{K}{\nu}}\epsilon+(2^{i}-2)\epsilon^{2})$ . Hence,

\displaystyle\min\{\|\mathbf{b}_{i}-\mathbf{b}_{i}^{\prime}\|,\|\mathbf{b}_{i}% +\mathbf{b}_{i}^{\prime}\|\}=\sqrt{2-2\langle\mathbf{b}_{i},\mathbf{b}_{i}^{% \prime}\rangle}\leq 2{\left(2\sqrt{\frac{K}{\nu}}\epsilon+(2^{i}-2)\epsilon^{2% }\right)}^{\frac{1}{2}}.

The last expression simplifies to $({\frac{K}{\nu}})^{\frac{1}{4}}\mathcal{O}(\epsilon^{\frac{1}{2}})$ . ∎

3.3. Comparison of HTD with other hierarchical tensor decompositions

We compare HTD in Algorithm 1 to other hierarchical tensor decompositions. The goal of hierarchical tensor decomposition [Hac12, Chapter 11] is to efficiently represent a tensor that lives in a high-dimensional space. Given a tensor of order $d$ , a hierarchical decomposition is based on a hierarchy of vector spaces given by a dimension partition tree on indices $\{1,\ldots,d\}$ , such as those in Figure 1.

{forest}

[ $\{1,2,\ldots,d\}$ [ $\{1\}$ ] [ $\{2,\ldots,d\}$ [ $\{2\}$ ] [ $\vdots$ [ $\{d-1\}$ ] [ $\{d\}$ ] ] ] ]

{forest}

[ $\{1,2,3,4\}$ [ $\{1,2\}$ [ $\{1\}$ ] [ $\{2\}$ ] ] [ $\{3,4\}$ [ $\{3\}$ ] [ $\{4\}$ ] ] ]

Figure 1. The dimension partition trees used in (a) the PARATREE algorithm of [SRK09] and (b) our HTD from Algorithm 1.

Hierarchical tensor representations in [Hac12, Chapter 11] start at the leaves of the tree, which are labelled by single indices. One finds subspaces $U_{i}\subseteq\mathbb{R}^{n_{i}}$ such that the tensor is well-approximated by a tensor in the lower-dimensional space $U_{1}\otimes\cdots\otimes U_{d}\subset\mathbb{R}^{n_{1}}\otimes\cdots\otimes% \mathbb{R}^{n_{d}}$ . Proceeding from leaves to root, when two indices $\{i\}$ and $\{j\}$ combine to form the subset $\{i,j\}$ , the representation finds a subspace $U_{ij}\subset U_{i}\otimes U_{j}$ that well-approximates the tensor. This repeats until we have a low-dimensional subspace $U_{1\cdots d}\subseteq\mathbb{R}^{n_{1}}\otimes\cdots\otimes\mathbb{R}^{n_{d}}$ such that the tensor $T$ lies in this subspace to reasonable accuracy. Fixing ranks in the representation fixes the allowable dimension of the subspaces $U_{I}$ for the subsets $I\subseteq[d]$ in the tree. See [Hac12, Figure 11.1].

The PARATREE model starts at the root of the tree. For example, if the root is the splitting of $\{1,2,3\}$ into $\{1\}\cup\{2,3\}$ (i.e. Figure 1 in the case $d=3$ ) then one computes a decomposition of the flattened tensor in $\mathbb{R}^{n_{1}}\otimes\mathbb{R}^{n_{2}n_{3}}$ to give a sum $\sum_{i=1}^{r_{1}}\mathbf{u}_{i}\otimes\mathbf{x}_{i}$ , with $\mathbf{u}_{i}\in\mathbb{R}^{n_{1}}$ and $\mathbf{x}_{i}\in\mathbb{R}^{n_{2}n_{3}}$ . The second step is the splitting of indices $\{2,3\}=\{2\}\cup\{3\}$ . This decomposes each vector $\mathbf{x}_{i}=\sum_{j=1}^{r_{2}}\mathbf{v}_{i,j}\otimes\mathbf{w}_{i,j}$ , where $\mathbf{x}_{i}\in\mathbb{R}^{n_{2}n_{3}}$ is viewed as a matrix of size $n_{2}\times n_{3}$ . This results in the decomposition

(8)

T=\sum_{i=1}^{r_{1}}\mathbf{u_{i}}\otimes\left(\sum_{j=1}^{r_{2}}\mathbf{v}_{i% ,j}\otimes\mathbf{w}_{i,j}\right).

This pattern can be continued for larger $d$ , see [SRK09, Equation 9].

Our HTD takes as input a symmetric $p\times p\times p\times p$ tensor. We use the dimension partition tree in Figure 1(b). HTD can be viewed as a symmetric analogue of the PARATREE model, but differs in that it uses a different dimension partition tree, and leverages the symmetry of the tensor and decomposition to produce a rank $r$ decomposition, rather than the rank $r_{1}r_{2}$ (or, more generally, rank $r_{1}\cdots r_{d-1}$ ) decomposition obtained from (8). Compared to the hierarchical tensor representations of [Hac12, Chapter 11], it differs in that the tensor is symmetric and it uses the dimension partition tree from root to leaves rather than leaves to root.

4. Tensor decompositions for cICA

The cICA model assumes $\mathbf{y}=A\mathbf{z}$ and $\mathbf{x}=A\mathbf{z}^{\prime}+B\mathbf{s}$ , for $A\in\mathbb{R}^{p\times r}$ and $B\in\mathbb{R}^{p\times\ell}$ , see (3). This leads to the cICA tensor decompositions (4). We present two variants of cICA to compute the decompositions (4). One does not assume a relationship between $\mathbf{z}$ and $\mathbf{z}^{\prime}$ ; we call this general cICA, see Section 4.1. The other assumes the proportional relationship $\mathbf{z}^{\prime}=\gamma\mathbf{z}$ for some scalar $\gamma$ ; we call this proportional cICA, see Section 4.2. We explain how to use cICA for dimensionality reduction in Section 4.3. This projects data onto a subspace given by certain columns of the foreground mixing matrix $B$ .

4.1. General cICA

We present Algorithm 2 for general cICA. Steps 1 and 3 both decompose a symmetric order four tensor. We use the subspace power method [KP19] in Step 1 to prioritize the accuracy of the tensor decomposition. We use Algorithm 1 in Step 3 to prioritize interpretability and efficiency.

Algorithm 2 Recover

A

and

B

from the cumulants of the background and foreground

\kappa_{4}(\mathbf{x}),\kappa_{4}(\mathbf{y})

and

r,\ell

as in (4).

2:Recover

A

: Compute the symmetric tensor decomposition of

\kappa_{4}(\mathbf{y})

via the subspace power method [KP19]. This recovers

A

up to permutation and scaling of columns.

3:Subtract background from

\kappa_{4}(\mathbf{x})

: Learn the coefficients

\lambda_{i}^{\prime}

\mathbf{a}_{1}^{\otimes 4},\ldots,\mathbf{a}_{r}^{\otimes 4}

\kappa_{4}(\mathbf{x})

using the deflation step of the subspace power method.

4:Recover

B

: Compute the symmetric tensor decomposition of

\sum_{i=1}^{\ell}\nu_{i}\mathbf{b}^{\otimes 4}=\kappa_{4}(\mathbf{x})-\sum_{i=% 1}^{r}\lambda_{i}^{\prime}\mathbf{a}_{i}^{\otimes 4}

, using Algorithm 1.

5:Mixing matrices

A

and

B

We study the identifiability of the algorithm, that is, the uniqueness of the vectors and scalars it outputs, assuming genericity. Our genericity assumption holds almost surely in the space of parameters.

We use the following lemma.

Lemma 4.1.

Let vectors $\mathbf{a}_{i}\in\mathbb{R}^{p}$ and scalars $\lambda_{i}\in\mathbb{R}$ be generic. Then the decomposition $T=\sum_{i=1}^{q}\lambda_{i}\mathbf{a}_{i}^{\otimes d}$ of a symmetric $p\times p\times p\times p$ tensor $T$ is unique for

q\leq\begin{cases}\lceil\frac{1}{p}{p+3\choose 4}-1\rceil&\text{for }p\notin\{% 3,4,5\},\\ \lceil\frac{1}{p}{p+3\choose 4}\rceil&\text{for }p\in\{3,5\},\\ 9&\text{for }p=4,\text{ provided $q\neq 8$.}\end{cases}

Proof.

The rank of a generic $p\times p\times p\times p$ symmetric tensor is $\lceil\frac{1}{p}{p+3\choose 4}\rceil$ for $p\notin\{3,4,5\}$ and $\lceil\frac{1}{p}{p+3\choose 4}\rceil+1$ for $p\in\{3,4,5\}$ , by the Alexander-Hirschowitz theorem [JA95]. Generic rank $q$ tensors in this space, with $q$ strictly below the generic rank have unique symmetric tensor decomposition for $(p,q)\neq(4,8)$ and two tensor decompositions for $p=4,q=8$ by [COV17, Theorem 1.1]. ∎

Proposition 4.2 (Identifiability of the cICA tensor decomposition).

The joint decomposition

(9)

\kappa_{4}(\mathbf{y})=\sum_{i=1}^{r}\lambda_{i}\mathbf{a}_{i}^{\otimes 4},% \qquad\quad\kappa_{4}(\mathbf{x})=\sum_{i=1}^{r}\lambda_{i}^{\prime}\mathbf{a}% _{i}^{\otimes 4}+\sum_{j=1}^{\ell}\nu_{j}\mathbf{b}_{j}^{\otimes 4},

is unique for generic $\mathbf{a}_{i},\mathbf{b}_{j},\lambda_{i},\lambda_{i}^{\prime},\nu_{j}$ , where $i\in[r]$ and $j\in[\ell]$ , when $r+\ell<\lceil\frac{1}{p}{p+3\choose 4}\rceil$ for $p\neq 3,4,5$ , $r+\ell\leq\lceil\frac{1}{p}{p+3\choose 4}\rceil$ for $p=3,5$ , and when $r+\ell\leq 9,r+\ell\neq 8$ for $p=4$ .

Proof.

The cICA tensor decomposition in the statement is identifiable when the symmetric tensor decomposition of $\kappa_{4}(\mathbf{x})$ is unique, as follows. The tensor decomposition of $\kappa_{4}(\mathbf{x})$ , gives vectors $\mathbf{a}_{i},\mathbf{b}_{j}$ up to permutation and scaling. Then we can solve a linear system to find the decomposition $\kappa_{4}(\mathbf{y})=\sum_{i=1}^{r}\lambda_{i}\mathbf{a}_{i}^{\otimes 4}$ . It therefore remains to study the identifiability of the decomposition of $\kappa_{4}(\mathbf{x})$ . It is a symmetric $p\times p\times p\times p$ tensor of rank $r+\ell$ . Hence the uniqueness follows from Lemma 4.1, setting $q=r+\ell$ . ∎

We say that Algorithm 2 is identifiable if, for generic $\mathbf{a}_{i},\mathbf{b}_{j},\lambda_{i},\lambda_{i}^{\prime},\nu_{j}$ where $i\in[r]$ , $j\in[\ell]$ , we can uniquely recover the vectors $\mathbf{a}_{1},\ldots,\mathbf{a}_{r}$ , the coefficients $\lambda_{1}^{\prime},\ldots,\lambda_{r}^{\prime}$ , and the vectors $\mathbf{b}_{1},\ldots,\mathbf{b}_{\ell}$ .

Proposition 4.3.

Algorithm 2 is identifiable when $r+\ell\leq{p+1\choose 2}$ for $p\neq 4$ and $r+\ell\leq 9,r,\ell\neq 8$ for $p=4$ .

To prove Proposition 4.3 and latter Theorem 4.5, we use the following linear algebra result. See [KP19, Lemma B.1] for a proof.

Lemma 4.4.

Let $M\in\mathbb{R}^{n\times n},U\in\mathbb{R}^{n\times k}$ and $V\in\mathbb{R}^{n\times k}$ be full rank matrices with $k\leq n$ . Let $C^{\ast}=(V^{\mathsf{T}}M^{-1}U)^{\dagger}$ , where $\dagger$ denotes the pseudo-inverse, and $d=\operatorname{rank}(C^{\ast})$ . Then

\operatorname{rank}(M-UCV^{\mathsf{T}})\geq n-d,

with equality if and only if $C=C^{\ast}$ .

Proof of Proposition 4.3.

Tensors $\sum_{i=1}^{r}\lambda_{i}\mathbf{a}_{i}^{\otimes 4}$ and $\sum_{j=1}^{\ell}\nu_{j}\mathbf{b}_{j}^{\otimes 4}$ are generic rank $r$ and rank $\ell$ tensors, respectively. So, the identifiability of Steps 1 and 3 of Algorithm 2 hold if $r,\ell<\lceil\frac{1}{p}{p+3\choose 4}\rceil$ for $p\notin\{3,4,5\}$ or $r,\ell\leq\lceil\frac{1}{p}{p+3\choose 4}\rceil$ for $p\in\{3,5\}$ or $r,\ell\leq 9,r,\ell\neq 8$ for $p=4$ , setting $q=r$ and $q=\ell$ in Lemma 4.1.

It remains to consider Step 2, learning the coefficients $\lambda_{i}^{\prime}$ of $\mathbf{a}_{i}^{\otimes 4}$ in $\kappa_{4}(\mathbf{x})$ . The flattening of $\kappa_{4}(\mathbf{x})$ has the form $M=\sum_{i=1}^{r}\lambda_{i}^{\prime}\mathbf{A}_{i}^{\otimes 2}+\sum_{j=1}^{% \ell}\nu_{j}\mathbf{B}_{j}^{\otimes 2}\in\mathbb{R}^{p^{2}\times p^{2}}$ , where $\mathbf{A}_{i},\mathbf{B}_{j}\in\mathbb{R}^{p^{2}}$ vectorize $\mathbf{a}_{i}^{\otimes 2}$ and $\mathbf{b}_{j}^{\otimes 2}$ , respectively. The scalar $\lambda_{i}^{\prime}$ is unique if $\operatorname{rank}(M-\lambda_{i}^{\prime}\mathbf{A}_{i}\otimes\mathbf{A}_{i})% =\operatorname{rank}(M)-1$ , by Lemma 4.4. It is $((\mathbf{A}_{i}^{\mathsf{T}}V)D^{-1}(\mathbf{A}_{i}^{\mathsf{T}}V)^{\mathsf{T% }})^{-1}$ , where $VDV^{\mathsf{T}}$ is the thin eigendecomposition of $M$ . In particular, the coefficient $\lambda_{i}^{\prime}$ is unique when

\mathbf{a}_{i}^{\otimes 2}\notin\operatorname{Span}(\{\mathbf{a}_{1}^{\otimes 2% },\ldots,\mathbf{a}_{i-1}^{\otimes 2},\mathbf{a}_{i+1}^{\otimes 2},\mathbf{a}_% {r}^{\otimes 2},\mathbf{b}_{1}^{\otimes 2},\ldots,\mathbf{b}_{\ell}^{\otimes 2% }\}).

For generic $\mathbf{a}_{i}$ and $\mathbf{b}_{j}$ , this holds provided $r+\ell$ is at most ${p+1\choose 2}$ , the dimension of the space of $p\times p$ symmetric matrices. Inequalities ${p+1\choose 2}\leq\lceil\frac{1}{p}{p+3\choose 4}\rceil$ for $p\notin\{3,4,5\}$ and ${p+1\choose 2}\leq\lceil\frac{1}{p}{p+3\choose 4}\rceil+1$ for $p\in\{3,4,5\}$ hold. Combining the above conditions, Algorithm 2 is identifiable when $r+\ell\leq{p+1\choose 2}$ for $p\neq 4$ and $r+\ell\leq 9,r,\ell\neq 8$ for $p=4$ . ∎

In some settings, we assume that the vectors $\mathbf{b}_{1},\ldots,\mathbf{b}_{\ell}$ are orthogonal. In particular, $\ell\leq p$ . This assumption is natural for visualization purposes, since the projection onto foreground patterns is orthogonal. In this case, HTD gives an exact decomposition, by Proposition 3.3. The identifiability requirements are the same as in Propositions 4.2 and 4.3, as follows. The identifiability conditions in the two propositions are unchanged under a change of basis by an invertible $p\times p$ matrix. When $\ell\leq p$ , we can apply a change of basis to $\kappa_{4}(\mathbf{x})$ so that the vectors $\mathbf{b}_{1},\ldots,\mathbf{b}_{\ell}$ become orthogonal. We apply the same change of basis to $\kappa_{4}(\mathbf{y})$ .

4.2. Proportional cICA

In this subsection, we assume a proportional relationship $\mathbf{z}^{\prime}=\gamma\mathbf{z}$ for some scalar $\gamma>0$ . This assumption also appears in cPCA [AZBZ17]. There, the choice of the hyperparameter $\gamma$ is not unique. However, in our setting—which involves the fourth-order cumulants $\kappa_{4}(\mathbf{y})$ and $\kappa_{4}(\mathbf{x})$ , under the assumption that $r+\ell\leq{p+1\choose 2}$ —the value of $\gamma$ is uniquely determined, with a closed-form expression, see Theorem 4.5. The details of the ensuing algorithm for computing matrix $B$ are as follows.

Algorithm 3 Recover

B

from the background and foreground cumulants when

\mathbf{z^{\prime}}=\gamma\mathbf{z}

\kappa_{4}(\mathbf{x}),\kappa_{4}(\mathbf{y})

and

\ell

as in (4).

2:Compute

\gamma

using Theorem 4.5.

3:Recover

B

: Compute rank

\ell

symmetric decomposition of

\kappa_{4}(\mathbf{x})-\gamma^{4}\kappa_{4}(\mathbf{y})

, using Algorithm 1.

4:Mixing matrix

B

Theorem 4.5.

Consider proportional cICA with $\mathbf{z}^{\prime}=\gamma\mathbf{z}$ , for $\gamma>0$ . For generic $\mathbf{a}_{1},\ldots,\mathbf{a}_{r}$ and $\mathbf{b}_{1},\ldots,\mathbf{b}_{\ell}$ with $r+\ell\leq{p+1\choose 2}$ and $r\neq 8$ , the hyperparameter $\gamma$ is the unique value $(\frac{1}{\lambda_{i}}(\mathbf{a}_{i}^{\mathsf{T}}VD^{-1}V^{\mathsf{T}}\mathbf% {a}_{i})^{-1})^{\frac{1}{4}}$ , where $i$ is any index between $1$ and $r$ , $\lambda_{i}$ is the coefficient of $\mathbf{a}_{i}^{\otimes 4}$ in $\kappa_{4}(\mathbf{x})$ and $VDV^{\mathsf{T}}$ is the thin eigendecomposition of $\operatorname{Mat}(\kappa_{4}(\mathbf{x}))$ .

Proof.

The flattenings of the cumulants $\kappa_{4}(\mathbf{y})$ and $\kappa_{4}(\mathbf{x})$ are, respectively,

M_{\mathbf{y}}:=\sum_{i=1}^{r}\lambda_{i}\mathbf{A}_{i}^{\otimes 2},\qquad M_{% \mathbf{x}}:=\gamma^{4}\left(\sum_{i=1}^{r}\lambda_{i}\mathbf{A}_{i}^{\otimes 2% }\right)+\sum_{j=1}^{\ell}\nu_{j}\mathbf{B}_{j}^{\otimes 2},

where $\mathbf{A}_{i},\mathbf{B}_{j}\in\mathbb{R}^{p^{2}}$ vectorize the matrices $\mathbf{a}_{i}^{\otimes 2}$ and $\mathbf{b}_{i}^{\otimes 2}$ , respectively and we use that $\lambda_{i}^{\prime}=\gamma^{4}\lambda_{i}$ . We have $\operatorname{rank}M_{\mathbf{y}}=r$ and $\operatorname{rank}M_{\mathbf{x}}=r+\ell$ , by the assumptions in the statement.

Let $A\in\mathbb{R}^{p^{2}\times r}$ be the matrix with columns $\mathbf{A}_{1},\ldots,\mathbf{A}_{r}$ and define $D^{\prime}=\gamma^{4}\text{Diag}(\lambda_{1},\ldots,\lambda_{r})$ . We have $\operatorname{rank}(M_{\mathbf{x}}-AD^{\prime}A^{\mathsf{T}})=\operatorname{% rank}(\sum_{j=1}^{\ell}\nu_{j}\mathbf{B}_{j}^{\otimes 2})=\ell$ . Suppose that $VDV^{\mathsf{T}}$ is the thin eigendecomposition of $M_{\mathbf{x}}$ . We have

V^{\mathsf{T}}(M_{\mathbf{x}}-AD^{\prime}A^{\mathsf{T}})V=D-(V^{\mathsf{T}}A)D% ^{\prime}(V^{\mathsf{T}}A)^{\mathsf{T}}.

We have that $\operatorname{rank}D=r+\ell$ , the upper bound $\operatorname{rank}(V^{\mathsf{T}}A)D^{\prime}(V^{\mathsf{T}}A)^{\mathsf{T}}=% \operatorname{rank}V^{\mathsf{T}}M_{\mathbf{y}}V\leq r$ , and finally that $\operatorname{rank}(D-(V^{\mathsf{T}}A)D^{\prime}(V^{\mathsf{T}}A)^{\mathsf{T}% })=\operatorname{rank}(V^{\mathsf{T}}(M_{\mathbf{x}}-AD^{\prime}A^{\mathsf{T}}% )V)\leq\ell$ . Hence

D^{\prime}=(A^{\mathsf{T}}VD^{-1}V^{\mathsf{T}}A)^{-1},

by Lemma 4.4. Matrices $A,\text{Diag}(\lambda_{1},\ldots,\lambda_{r}),V,D$ can be recovered uniquely from tensor decomposition of $\kappa_{4}(\mathbf{y})$ and the eigendecomposition of $M_{\mathbf{x}}$ . So $D^{\prime}$ can also be recovered uniquely, and hence $\gamma$ is unique: it is $\gamma^{4}\lambda_{i}=(\mathbf{a}_{i}^{\mathsf{T}}VD^{-1}V^{\mathsf{T}}\mathbf% {a}_{i})^{-1}$ for any $i\in[r]$ . ∎

One can test the proportionality assumption by seeing whether the values $\left(\frac{1}{\lambda_{i}}(\mathbf{a}_{i}^{\top}VD^{-1}V^{\top}\mathbf{a}_{i}% )^{-1}\right)^{\frac{1}{4}}$ from Theorem 4.5 are approximately equal as $i$ varies. In practice, exact proportionality may not hold, and learning $\gamma$ via Theorem 4.5 could be challenging. An alternative is to use a sweep of $\gamma$ values and choose $\gamma$ according to visualization plots, a similar method to that used in cPCA [AZBZ17].

4.3. cICA for dimensionality reduction

Usual ICA has been used as a tool to project data, see [Dom18, GW20, LM08]. We extend this to cICA. In practice, the input to cICA consists of samples from the foreground $\mathbf{x}$ and background $\mathbf{y}$ . These samples comprise the foreground data $X\in\mathbb{R}^{n\times p}$ and the background data $Y\in\mathbb{R}^{m\times p}$ , where $n$ and $m$ are the number of samples in the foreground and background datasets respectively. We then construct the sample cumulants $\kappa_{4}(\mathbf{x})$ and $\kappa_{4}(\mathbf{y})$ as follows.

A dataset of $n$ samples in $\mathbb{R}^{p}$ gives a data matrix $X\in\mathbb{R}^{n\times p}$ . Its fourth cumulant is computed as follows. Let $\bar{X}\in\mathbb{R}^{p}$ denote the mean vector over all observations. The $p\times p$ sample covariance matrix $\Sigma$ for $X$ has entries $\sigma_{ij}=\frac{1}{n}\sum_{t=1}^{n}(X_{ti}-\bar{X}_{i})(X_{tj}-\bar{X}_{j}),$ . The fourth-order central sample moment is a $p\times p\times p\times p$ tensor with entries $M_{ijkl}=\frac{1}{n}\sum_{t=1}^{n}(X_{ti}-\bar{X}_{i})(X_{tj}-\bar{X}_{j})(X_{% tk}-\bar{X}_{k})(X_{tl}-\bar{X}_{l}).$ Entry $(i,j,k,l)$ of the fourth-order sample cumulant is $M_{ijkl}-\sigma_{ij}\sigma_{kl}-\sigma_{ik}\sigma_{jl}-\sigma_{il}\sigma_{jk}.$ If the data $X$ are samples from a distribution $\mathbf{x}$ , this sample cumulant approximates $\kappa_{4}(\mathbf{x})$ . The computation for $\kappa_{4}(\mathbf{y})$ is similar.

When $p$ is large, forming the fourth cumulants may be prohibitively expensive. To get around this, one can reduce the dimension before forming the cumulants, as follows.

We combine the foreground and background datasets together to form a single dataset, a matrix of size $(m+n)\times p$ . Let $U\in\mathbb{R}^{p\times k}$ have as its columns the top $k$ principal components of this combined data. The background and foreground transformed variables are then

(10)

U^{\mathsf{T}}A\mathbf{z}\qquad\text{and}\qquad U^{\mathsf{T}}A\mathbf{z}^{% \prime}+U^{\mathsf{T}}B\mathbf{s},

respectively, where $U^{\mathsf{T}}A\in\mathbb{R}^{k\times r}$ and $U^{\mathsf{T}}B\in\mathbb{R}^{k\times\ell}$ . The recovered foreground patterns from cICA are the columns of $U^{\mathsf{T}}B$ . The columns of $UU^{\mathsf{T}}B\in\mathbb{R}^{p\times\ell}$ convert these projected foreground patterns back into the original space.

In practice, for our data visualization in Section 6.2, we choose the number $k$ of PCA components to be 30 or the number of components that explains at least $90\%$ variance, whichever comes first.

We compute the mixing matrix $B\in\mathbb{R}^{p\times\ell}$ with columns $\mathbf{b}_{1},\ldots,\mathbf{b}_{\ell}$ using Algorithm 2 or 3. When employing cICA for dimensionality reduction, we project the foreground data $X$ onto $XB$ . For a two-dimensional plot, we plot the projections $(X\mathbf{b}_{i},X\mathbf{b}_{j})$ for a pair $i,j$ . To select the most relevant vectors out of our $\ell$ recovered vectors $\mathbf{b}_{i}\in\mathbb{R}^{\ell}$ , we order them by the ratio

(11)

k(\mathbf{b}):=\frac{\mathbf{b}^{\top}\kappa_{2}(\mathbf{x})\mathbf{b}}{% \mathbf{b}^{\top}\kappa_{2}(\mathbf{y})\mathbf{b}}.

We justify this ranking in Section 5.2. We interpret the axes of a cICA dimensionality reduction plot in Section 5.3.

5. Practicalities and interpretation of cICA

In this section, we discuss the practicalities of cICA: preprocessing the input to speed up the algorithm and how to choose the ranks $r$ and $\ell$ . We also discuss how to interpret coordinates when viewing cICA as a dimensionality reduction method.

5.1. Preprocessing with PCA

When $p$ is large, forming the fourth cumulants may be prohibitively expensive. To get around this, one can reduce the dimension before forming the cumulants, as follows. We combine the foreground and background datasets together to form a single dataset, a matrix of size $(m+n)\times p$ . Let $U\in\mathbb{R}^{p\times k}$ have as its columns the top $k$ principal components of this combined data. The background and foreground transformed variables then have the form

(12)

U^{\mathsf{T}}A\mathbf{z}\qquad\text{and}\qquad U^{\mathsf{T}}A\mathbf{z}^{% \prime}+U^{\mathsf{T}}B\mathbf{s},

In practice, for our data visualization in Section 6.2, we choose the number $k$ of PCA components to be 30 or the number of components that explains at least $90\%$ variance, whichever comes first.

5.2. Choosing the ranks

When computing the tensor decompositions in cICA, a key step is to determine the ranks $r$ and $\ell$ . To choose the ranks, we can use the flattenings of the cumulants, the matrices $\operatorname{Mat}(\kappa_{4}(\mathbf{x})),\operatorname{Mat}(\kappa_{4}(% \mathbf{y}))\in\mathbb{R}^{p^{2}\times p^{2}}$ . If the expressions for the cumulant tensors $\kappa_{4}(\mathbf{x})$ and $\kappa_{4}(\mathbf{y})$ in (4) hold exactly, and if $r+\ell\leq{p+1\choose 2}$ and the vectors $\mathbf{a}_{i},\mathbf{b}_{j}$ are generic, then

r=\text{rank}(\text{Mat}(\kappa_{4}(\mathbf{y})))\quad\text{and}\quad r+\ell=% \text{rank}(\text{Mat}(\kappa_{4}(\mathbf{x}))).

For non-exact cumluants, such as sample cumulants, we do not work with the exact ranks of the flattening matrices, but instead examine plots of the eigenvalues in descending magnitude (see e.g. Figure 7) to choose an appropriate cut-off. We choose $r$ such that the decrease of the eigenvalue plot of $\mathrm{Mat}(\kappa_{4}(\mathbf{y}))$ slows down, choose $q$ such that the decrease of the eigenvalue plot of $\mathrm{Mat}(\kappa_{4}(\mathbf{x}))$ slows down, and calculate $\ell=q-r$ . General cICA has hyperparameters $r$ and $\ell$ ; proportional cICA has one hyperparameter $\ell$ .

We discuss how the results may be affected by an incorrect choice of $r$ and $\ell$ and justify our proposed way to order the foreground patterns $\mathbf{b}_{1},\ldots,\mathbf{b}_{\ell}$ by importance in (11).

Let the true ranks be $r$ and $\ell$ and assume that we have used $r^{\prime}$ and $\ell^{\prime}$ in the input to Algorithm 2.

•

If $\ell^{\prime}>\ell$ , then $\ell^{\prime}-\ell$ foreground patterns are noise.
•

If $\ell^{\prime}<\ell$ , then $\ell-\ell^{\prime}$ foreground patterns are not recovered.
•

If $r^{\prime}<r$ , then background patterns are mixed with foreground patterns, as follows. Assuming without loss of generality that we have recovered $\mathbf{a}_{1},\ldots,\mathbf{a}_{r^{\prime}}$ , the third step of Algorithm 2 decomposes the tensor $\sum_{i=r^{\prime}+1}^{r}\lambda_{i}^{\prime}\mathbf{a}_{i}^{\otimes 4}+\sum_{% j=1}^{\ell}\nu_{j}\mathbf{b}_{j}^{\otimes 4}$ via HTD, as in Algorithm 1. If the orthogonality hypotheses of Proposition 3.3 hold, then the recovered foreground patterns are recovered together with some background patterns that are incorrectly interpreted as foreground patterns. If the approximate orthogonality hypotheses of Theorem 3.4 hold, then the foreground patterns are recovered approximately, together with background patterns that are classed as foreground patterns. Without an orthogonality condition, the recovered foreground patterns $\mathbf{b}_{1},\ldots,\mathbf{b}_{\ell}$ will be polluted but still roughly collinear to the true foreground patterns for small $r-r^{\prime}$ or when the dimension of the dataset is large, resulting in almost orthogonality between random vectors.
•

If $r^{\prime}>r$ , then foreground patterns are mixed with background noise, as follows. Some background patterns from Algorithm 2 will be noise, say $\mathbf{a}_{r+1}^{\prime},\ldots,\mathbf{a}_{r^{\prime}}^{\prime}$ . Step 2 of Algorithm 2 computes the coefficients of the tensors $(\mathbf{a}_{r+1}^{\prime})^{\otimes 4},\ldots,(\mathbf{a}_{r^{\prime}}^{% \prime})^{\otimes 4}$ in $\kappa_{4}(\mathbf{x})$ , though they are not true rank one components of $\kappa_{4}(\mathbf{x})$ . In Step 3, the tensor to be decomposed has the form $\sum_{i=1}^{r^{\prime}-r}\mu_{i}(\mathbf{a}_{r+i}^{\prime})^{\otimes 4}+\sum_{% i=1}^{\ell}\nu_{i}\mathbf{b}_{i}^{\otimes 4}$ for some $\mu_{1},\ldots,\mu_{r^{\prime}-r}\in\mathbb{R}$ . As in the case $r^{\prime}<r$ , the foreground patterns can still be exactly or approximately recovered, under the hypotheses of Proposition 3.3 and Theorem 3.4 respectively, albeit with some background noise recovered as foreground patterns.

The above discussion shows that when $r^{\prime}\neq r$ , the vectors $\mathbf{b}_{1},\ldots,\mathbf{b}_{\ell}$ obtained from Algorithm 2 could represent foreground patterns, background patterns, or noise. We order the vectors according to (11). The denominator of (11) is the variance of the linearly transformed background dataset $Y\mathbf{b}$ . The numerator is that of the transformed dataset $X\mathbf{b}$ . Their ratio enables us to select the most relevant foreground patterns, as follows.

•

If $\mathbf{b}$ is a foreground pattern, we expect $\mathbf{b}^{\mathsf{T}}\kappa_{2}(\mathbf{y})\mathbf{b}$ to be small relative to $\mathbf{b}^{\mathsf{T}}\kappa_{2}(\mathbf{x})\mathbf{b}$ , hence a large $k(\mathbf{b})$ .
•

If $\mathbf{b}$ is a background pattern, we expect $\mathbf{b}^{\mathsf{T}}\kappa_{2}(\mathbf{y})\mathbf{b}\approx\alpha\mathbf{b}% ^{\mathsf{T}}\kappa_{2}(\mathbf{x})\mathbf{b}$ for some constant $\alpha$ and hence $k(\mathbf{b})\approx\alpha$ .
•

If $\mathbf{b}$ is foreground noise, we expect a small $\mathbf{b}^{\mathsf{T}}\kappa_{2}(\mathbf{x})\mathbf{b}$ , hence small $k(\mathbf{b})$ .
•

If $\mathbf{b}$ is background noise, we expect a small $\mathbf{b}^{\mathsf{T}}\kappa_{2}(\mathbf{y})\mathbf{b}$ , hence a large $k(\mathbf{b})$ . To prevent the background noise showing up in the recovered foreground pattern, we require $r^{\prime}\leq r$ .

In practice, we consider those patterns for which $k(\mathbf{b})$ exceeds a certain threshold, or take the patterns with the two highest values of $k(\mathbf{b})$ .

5.3. Visualization

We discuss how to interpret coordinates when using cICA for dimensionality reduction. The following proposition relates the projections $\mathbf{b}_{i}^{T}\mathbf{x}$ for $i\in[\ell]$ to the latent variables $s_{i}$ .

Proposition 5.1.

Consider the cICA model in (3). Suppose $\|\mathbf{b}_{i}\|=1$ for $i\in[\ell]$ . Assume that for some small $\epsilon>0$ that $|\langle\mathbf{b}_{i},\mathbf{b}_{j}\rangle|<\epsilon$ and $|\langle\mathbf{b}_{i},\mathbf{a}_{k}\rangle|<\epsilon$ for $i\neq j\in[\ell]$ , $k\in[r]$ . Then, for each $i\in[\ell]$ ,

|s_{i}-\mathbf{b}_{i}^{T}\mathbf{x}|=(rC_{\mathbf{z}^{\prime}}+(\ell-1)C_{% \mathbf{s}})\mathcal{O}(\epsilon),

where $C_{\mathbf{z}^{\prime}}$ and $C_{\mathbf{s}}$ are upper bounds on the magnitudes of random variables in $\mathbf{z}^{\prime}$ and $\mathbf{s}$ . In particular, $\mathbf{b}_{i}^{T}\mathbf{x}$ approximates the component $s_{i}$ with an error linear in $\epsilon$ .

Proof.

Recall from (3) that $\mathbf{x}=A\mathbf{z}^{\prime}+B\mathbf{s}$ . Hence

\displaystyle\mathbf{b}_{i}^{T}\mathbf{x}

\displaystyle=(\mathbf{b}_{i}^{T}A)\mathbf{z}^{\prime}+(\mathbf{b}_{i}^{T}B)% \mathbf{s}=\sum_{k=1}^{r}\langle\mathbf{b}_{i},\mathbf{a}_{k}\rangle z^{\prime% }_{k}+\sum_{j=1,j\neq i}^{\ell}\langle\mathbf{b}_{i},\mathbf{b}_{j}\rangle s_{% j}+s_{i}.

The almost orthogonality conditions of the proposition then imply that

\displaystyle|s_{i}-\mathbf{b}_{i}^{T}\mathbf{x}|

\displaystyle\leq\sum_{k=1}^{r}|\langle\mathbf{b}_{i},\mathbf{a}_{k}\rangle||z% ^{\prime}_{k}|+\sum_{j=1}^{\ell}|\langle\mathbf{b}_{i},\mathbf{b}_{j}\rangle||% s_{j}|\leq(rC_{\mathbf{z}^{\prime}}+(\ell-1)C_{\mathbf{s}})\epsilon.\qed

The almost orthogonality conditions in Proposition 5.1 are strong requirements. However, they can be relaxed – if $|\langle\mathbf{b}_{i},\mathbf{b}_{j}\rangle|<\epsilon$ for chosen $i,j\in[\ell]$ and sources $s_{i}$ and $s_{j}$ have wider variance than $(\mathbf{b}_{i}^{\mathsf{T}}A)\mathbf{z}^{\prime}$ and $(\mathbf{b}_{j}^{\mathsf{T}}A)\mathbf{z}^{\prime}$ , then plotting $\mathbf{b}_{i}^{\mathsf{T}}X$ against $\mathbf{b}_{j}^{\mathsf{T}}X$ still approximates the plot of $s_{i}$ against $s_{j}$ .

If $(\mathbf{b}_{i}^{\mathsf{T}}A)\mathbf{z}^{\prime}$ and $(\mathbf{b}_{j}^{\mathsf{T}}A)\mathbf{z}^{\prime}$ are uncorrelated, we expect the plot of $X\mathbf{b}_{i}$ against $X\mathbf{b}_{j}$ to show axis-aligned clusters; otherwise, clusters may not be axis-aligned. We specify the condition for $(\mathbf{b}_{i}^{\mathsf{T}}A)\mathbf{z}^{\prime}$ and $(\mathbf{b}_{j}^{\mathsf{T}}A)\mathbf{z}^{\prime}$ to be uncorrelated, assuming that all variables in the tuple $\mathbf{z}^{\prime}$ have the same variance.

Proposition 5.2.

Consider the cICA model in (3). Suppose that the independent variables $\mathbf{z^{\prime}}$ is a tuple of independent random variables with the same variance. Then $(\mathbf{b}_{i}^{\mathsf{T}}A)\mathbf{z}^{\prime}$ and $(\mathbf{b}_{j}^{\mathsf{T}}A)\mathbf{z}^{\prime}$ are uncorrelated if and only if $\langle\mathbf{b}_{i}^{\mathsf{T}}A,\mathbf{b}_{j}^{\mathsf{T}}A\rangle=0$ .

Proof.

Write $\mathbf{u}=\mathbf{b}_{i}^{\mathsf{T}}A$ and $\mathbf{v}=\mathbf{b}_{j}^{\mathsf{T}}A$ . By the bilinearity of the covariance

\displaystyle\mathrm{Cov}(\mathbf{u}\mathbf{z}^{\prime},\mathbf{v}\mathbf{z}^{% \prime})=\sum_{1\leq i,j\leq r}u_{i}v_{j}\mathrm{Cov}(z^{\prime}_{i},z^{\prime% }_{j})=\sum_{1\leq i\leq r}u_{i}v_{i}\mathrm{Var}(z^{\prime}_{i})=\mathrm{Var}% (z^{\prime}_{1})\sum_{1\leq i\leq r}u_{i}v_{i}.

The last expression is zero if and only if $\langle\mathbf{u},\mathbf{v}\rangle=0$ . ∎

6. Numerical experiments

We investigate the performance of cICA for finding patterns in data (Section 6.1) and for data visualization (Section 6.2). Our code is available on GitHub at https://github.com/QWE123665/cICA.

6.1. Finding patterns

The cICA patterns are the foreground vectors $\mathbf{b}_{i}$ . We show that cICA recovers these vectors accurately for synthetic data, comparing it with cPCA [AZBZ17] and PCPCA [LJE20]. We also apply cICA to gene expression data from [SCJ⁺23]. Taking monkey gene expression as the background and human gene expression as the foreground, we relate the cICA patterns to existing results to identify genes responsible for human evolution.

6.1.1. Synthetic data

We use synthetic data to assess the accuracy of the patterns recovered by cICA, both for general cICA (Algorithm 2) and proportional cICA (Algorithm 3). We compare against cPCA and PCPCA, illustrating that cICA algorithms recover the foreground patterns more accurately when generated under a model (3) that assumes independence of latent variables, see Figure 2. The details of the simulations are in Appendix A.1.

Refer to caption — Figure 2. The similarity of the recovered vs. true foreground patterns (i.e. the accuracy of recovering matrix $B$ ), measured via cosine similarity in (a) and (c) and relative Frobenius error in (b) and (d). The $x$ -axis is the number of variables $p$ , which ranges from 4 to 12. Plots (a) and (b) refer to cICA in Algorithm 2. The interquartile range over 100 runs is shaded in red, with the best run shown as the red line. Plots (c) and (d) refer to proportional cICA in Algorithm 3, which is deterministic. For cPCA and PCPCA, we test 100 hyperparameter values and plot the one with lowest error.

We see from Figure 2 that cICA outperforms cPCA and PCPCA in recovering the foreground patterns. Figure 2(a) shows that the interquartile range for cICA in Algorithm 2 is above the maximum cosine similarity results for cPCA and PCPCA. The best performing cICA has cosine similarity above 0.9 for all tested $p$ . Figure 2(b) shows analogous results with accuracy measured via relative Frobenius error. The variability in performance as $p$ changes is due to randomness in the matrix $A$ . Figures 2(c) and (d) show analogous results for proportional cICA from Algorithm 3, with hyperparameter $\gamma$ learned from Theorem 4.5. The method outperforms cPCA and PCPCA, with the added benefit that no selection of hyperparameters is necessary.

6.1.2. Human and monkey gene expression data

We apply cICA to a dataset of human and monkey gene expression from [SCJ⁺23], in which the authors analyse human, chimp, gorilla, macaque, and marmoset datasets to identify genes that are responsible for evolutionary change. Out of 14131 genes, they identify 3383 genes with extensive differences between human and non-human primates, of which they identify a subset of 139 with deeply conserved co-expression across all non-human animals, and strongly divergent co-expression relationships in humans.

We select the 15 most variant genes among the 139 selected genes and the 15 most variant genes among the other $3244=3383-139$ genes. We combine 10000 chimp and 10000 gorilla data points to form the background dataset $Y\in\mathbb{R}^{20000\times 30}$ and 10000 human gene expression data points for the foreground dataset $X\in\mathbb{R}^{10000\times 30}$ . Then we apply cICA as in Algorithm 2 and use (11) to order the $\mathbf{b}_{i}$ and extract the first two vectors $\mathbf{b}_{1},\mathbf{b}_{2}\in\mathbb{R}^{30}$ . We observe that the 15 genes with highest absolute values in $\mathbf{b}_{1}$ (resp. $\mathbf{b}_{2}$ ) have 10 (resp. 13) genes among the 15 selected genes that come from the subset of 139 in [SCJ⁺23]. This demonstrates consistency with the results from [SCJ⁺23]: the vectors $\mathbf{b}_{i}$ assign higher weights to the genes from the subset of 139. For details, see Appendix A.2.

6.2. Data visualization

We use cICA for dimensionality reduction and data visualization, as described in Section 4.3. We investigate the performance on three datasets: mouse protein expression, corrupted MNIST images, and gene expression, the same data studied in the papers [AZBZ17, LJE20]. We quantify the performance of the methods, using the silhouette score [Rou87] of the projected data; higher values indicate better clustering of points.

6.2.1. Mouse protein data

We study the mouse protein dataset from [HGC15]. The foreground data measure protein expression in the cortex of mice subjected to shock therapy, some of whom have Down syndrome. The background dataset consists of protein expression measurements from mice without Down Syndrome who did not receive shock therapy. We compare general cICA, proportional cICA, as well as cPCA and PCPCA. All four algorithms can separate the two clusters in the foreground data, corresponding to mice with Down syndrome and those without, though the projections differ. The general cICA algorithm has the highest silhouette score (0.606), followed by proportional cICA (0.604), then cPCA (0.421), and then PCPCA (0.220), see Figure 3. See Appendix A.3 for details.

6.2.2. Corrupted MNIST data

Next we explore the corrupted MNIST dataset from [AZBZ18]. The foreground data are digits 0, 1 from the MNIST dataset superimposed with strength 0.25 onto 5000 randomly selected grass images from ImageNet. The background data are the 5000 grass images. Each image has size $28\times 28$ . The projections are shown in Figure 4. All four algorithms cluster the foreground data between digits 0 and 1. The cPCA algorithm has the highest silhouette score (0.546), followed by proportional cICA (0.508), general cICA (0.451) and PCPCA (0.009). See Appendix A.4.

6.2.3. Transplant gene expression data

We study the single-cell RNA sequencing data from [ZTB⁺17]. The foreground datas are gene expressions of bone marrow mononuclear cells from patients with acute myeloid leukemia before and after they received a stem-cell transplant; the background dataset contains gene expression measurements of healthy people. The projection plots of the four algorithms are shown in Figure 5. cPCA has the highest silhouette score (0.451), followed by proportional cICA (0.402), then general cICA (0.344), then PCPCA (0.164). See Appendix A.5 for details.

7. Summary

We have presented contrastive independent component analysis (cICA), a tool to explore patterns and visualize data in one setting relative to another. We designed algorithms for cICA based on a new hierarchical tensor decomposition that we introduce. We studied two variants: general and proportional cICA. The upside to general cICA is its higher expressivity: it is able to model background patterns that each contribute to the foreground in different relative amounts $\lambda_{i}^{\prime}/\lambda_{i}$ . The advantage of proportional cICA is that it is deterministic, based solely on recursive eigendecompositions. We used our algorithms to find contrastive patterns that describe a foreground dataset relative to a background, testing the results on synthetic and real-world datasets. We saw its potential to extract foreground patterns of interest and its competitiveness with other contrastive methods.

We investigated the identifiability of cICA, via the uniqueness of its associated coupled tensor decomposition, seeing identifiability improvements relative to cPCA and PCPCA. This echoes the improved identifiability of ICA over PCA: a general linear mixing can be recovered uniquely via ICA, whereas PCA requires an orthogonal mixing.

We conclude with two directions for further study. Our cICA model describes observations as a linear mixing of independent latent variables. Drop** the linearity assumption, we may seek patterns that have nonlinear signature across the observed variables. This would combine the nonlinear contrastive methods of [AZ19, SGN19, WBWL22, LHH⁺24] with approaches to find interpretable patterns, generalizing our vectors $\mathbf{b}_{i}$ . Finally, drop** the independence assumption on the latent variables is also a promising direction for further study, which would connect cICA to other latent variable models such as those arising in causal disentanglement [YLC⁺21, SSBU23].

Acknowledgements

We thank Salil Bhate for helpful discussions. AM and AS were partially supported by the NSF (DMS-2306672 and DMR-2011754).

References

[AZ19] Abubakar Abid and James Zou. Contrastive variational autoencoder enhances salient features. arXiv preprint arXiv:1902.04601, 2019.
[AZBZ17] Abubakar Abid, Martin J Zhang, Vivek K Bagaria, and James Zou. Contrastive principal component analysis. arXiv preprint arXiv:1709.06716, 2017.
[AZBZ18] Abubakar Abid, Martin J Zhang, Vivek K Bagaria, and James Zou. Exploring patterns enriched in a dataset with contrastive principal component analysis. Nature communications, 9(1):2134, 2018.
[CJ10] Pierre Comon and Christian Jutten. Handbook of Blind Source Separation: Independent component analysis and applications. Academic press, 2010.
[Com94] Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.
[COV17] Luca Chiantini, Giorgio Ottaviani, and Nick Vannieuwenhoven. On generic identifiability of symmetric tensors of subgeneric rank. Transactions of the American Mathematical Society, 369(6):4021–4042, 2017.
[CS93] Jean-François Cardoso and Antoine Souloumiac. Blind beamforming for non-Gaussian signals. In IEE proceedings F (radar and signal processing), volume 140, pages 362–370. IET, 1993.
[DLCC07] Lieven De Lathauwer, Josphine Castaing, and Jean-Franois Cardoso. Fourth-order cumulant-based blind identification of underdetermined mixtures. IEEE Transactions on Signal Processing, 55:2965–2973, 2007.
[DLDMV01] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. Independent component analysis and (simultaneous) third-order tensor diagonalization. IEEE Transactions on Signal Processing, 49(10):2262–2271, 2001.
[Dom18] Krzysztof Domino. The use of fourth order cumulant tensors to detect outlier features modelled by a t-student copula. arXiv preprint arXiv:1804.00541, 2018.
[EK04] J. Eriksson and V. Koivunen. Identifiability, separability, and uniqueness of linear ICA models. IEEE Signal Processing Letters, 11(7):601–604, 2004.
[GW20] Xiurui Geng and Lei Wang. NPSA: Nonorthogonal principal skewness analysis. IEEE Transactions on Image Processing, 29:6396–6408, 2020.
[Hac12] Wolfgang Hackbusch. Tensor spaces and numerical tensor calculus, volume 42. Springer, 2012.
[HGC15] Clara Higuera, Katheleen J Gardiner, and Krzysztof J Cios. Self-organizing feature maps identify proteins critical to learning in a mouse model of Down syndrome. PloS one, 10(6):e0129126, 2015.
[HM16] Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. Advances in neural information processing systems, 29, 2016.
[HST19] Aapo Hyvarinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ICA using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859–868. PMLR, 2019.
[JA95] A. Hirschowitz J. Alexander. Polynomial interpolation in several variables. J. Algebraic Geom. 4(4) (1995), 1995.
[KP19] Joe Kileel and Joao M Pereira. Subspace power method for symmetric tensor decomposition and generalized PCA. arXiv preprint arXiv:1912.04007, 2019.
[LF22] Qi Lyu and Xiao Fu. On finite-sample identifiability of contrastive learning-based nonlinear independent component analysis. In International Conference on Machine Learning, pages 14582–14600. PMLR, 2022.
[LHH⁺24] Romain Lopez, Jan-Christian Huetter, Ehsan Hajiramezanali, Jonathan Pritchard, and Aviv Regev. Toward the identifiability of comparative deep generative models. arXiv preprint arXiv:2401.15903, 2024.
[LJE20] Didong Li, Andrew Jones, and Barbara Engelhardt. Probabilistic contrastive principal component analysis. arXiv preprint arXiv:2012.07977, 2020.
[LM08] Lek-Heng Lim and Jason Morton. Cumulant component analysis: a simultaneous generalization of PCA and ICA. CASTA2008, 18, 2008.
[McC18] Peter McCullagh. Tensor methods in statistics: Monographs on statistics and applied probability. Chapman and Hall/CRC, 2018.
[Rob16] Elina Robeva. Orthogonal decomposition of symmetric tensors. SIAM Journal on Matrix Analysis and Applications, 37(1):86–102, 2016.
[Rou87] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987.
[SCJ⁺23] Hamsini Suresh, Megan Crow, Nikolas Jorstad, Rebecca Hodge, Ed Lein, Alexander Dobin, Trygve Bakken, and Jesse Gillis. Comparative single-cell transcriptomic analysis of primate brains highlights human-specific regulatory evolution. Nature Ecology & Evolution, 7(11):1930–1943, 2023.
[SGN19] Kristen A Severson, Soumya Ghosh, and Kenney Ng. Unsupervised learning with contrastive latent variable models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4862–4869, 2019.
[SRK09] Jussi Salmi, Andreas Richter, and Visa Koivunen. Sequential unfolding SVD for tensors with applications in array signal processing. IEEE Transactions on Signal Processing, 57(12):4719–4733, 2009.
[SSBU23] Chandler Squires, Anna Seigal, Salil S Bhate, and Caroline Uhler. Linear causal disentanglement via interventions. In International Conference on Machine Learning, pages 32540–32560. PMLR, 2023.
[SSDU24] Nils Sturma, Chandler Squires, Mathias Drton, and Caroline Uhler. Unpaired multi-domain causal representation learning. Advances in Neural Information Processing Systems, 36, 2024.
[WBWL22] Ethan Weinberger, Nicasia Beebe-Wang, and Su-In Lee. Moment matching deep contrastive latent variable models. arXiv preprint arXiv:2202.10560, 2022.
[Wey12] Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, 1912.
[WS24] Kexin Wang and Anna Seigal. Identifiability of overcomplete independent component analysis. arXiv preprint arXiv:2401.14709, 2024.
[YLC⁺21] Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. CausalVAE: Disentangled representation learning via neural structural causal models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9593–9602, 2021.
[ZHPA13] James Y Zou, Daniel J Hsu, David C Parkes, and Ryan P Adams. Contrastive learning using spectral methods. Advances in Neural Information Processing Systems, 26, 2013.
[ZTB⁺17] Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler, Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1):14049, 2017.

Appendix A Details of numerical experiments

All experiments are run on an Apple M2 Pro with 16 GB memory. Each run of each algorithm takes at most 1 minute.

A.1. Finding patterns from synthetic data

We describe the details of the synthetic data setup in Section 6.1.1 that produced Figure 2. We consider $p\in[4,12]$ . Our samples come from the distributions (3), where matrices $A\in\mathbb{R}^{p\times p}$ and $B\in\mathbb{R}^{p\times(p-1)}$ are random with unit vector columns, and the columns of $B$ are assumed to be orthogonal. We assume orthogonality of the columns of $B$ to facilitate comparison with the methods cPCA and PCPCA, which require this assumption.

For testing Algorithm 2 in Figure 2(a) and (b), variables $s_{i}$ are exponential distributions $\exp(\theta_{i})$ where $\theta_{i}=2$ when $i$ is odd and $\theta_{i}=1.5$ when $i$ is even. Variables $z_{i}$ and $z^{\prime}_{i}$ are exponential distributions $\exp(\nu_{i}),\exp(\nu_{i}^{\prime})$ where $\nu_{i}=2,\nu_{i}^{\prime}=1$ when $i$ is odd and $\nu_{i}=1,\nu_{i}^{\prime}=2$ when $i$ is even. We generate $10^{5}$ datapoints for both the foreground and background data and apply cICA to the sample cumulant tensors. cICA has randomness coming from the subspace power method. We apply our algorithm 100 times and get 100 recovered foreground mixing matrices $B\in\mathbb{R}^{p\times(p-1)}$ .

For testing Algorithm 3 in Figure 2(c) and (d), we let $z_{i},z^{\prime}_{i}$ be exponential distributions $\exp(\nu_{i}),\exp(\nu_{i}^{\prime})$ where $\nu_{i}=\nu_{i}^{\prime}=1$ . We learn the hyperparameter $\gamma^{\prime}$ via Theorem 4.5. The true $\gamma^{\prime}$ is 1 and the recovered $\gamma^{\prime}$ are all in the range $[0.94,1.08]$ .

We describe the implementation of the two methods we compare to. For cPCA [AZBZ17], we test 100 log-evenly spaced hyperparameters $\alpha$ between 0 and 1000 with $p-1$ components. Each run returns a matrix of size $p\times(p-1)$ , whose columns are contrastive principal components with norm 1. For PCPCA, we test 100 evenly spaced hyperparameters $\gamma$ between 0 and 0.9 and fix $p-1$ components. Each run returns a matrix of size $p\times(p-1)$ . We normalize the columns to unit norm, to compare PCPCA with the other algorithms.

Since the columns of $B$ that are recovered are only unique up to permutation and sign, we describe how to align the outputs. Let $B^{\prime}\in\mathbb{R}^{p\times(p-1)}$ be a recovered matrix. Rather than searching over all ways to match the columns of $B$ to those of $B^{\prime}$ , we use a greedy algorithm to approximate the matching, as follows. We fix the first column of $B$ , denoted $\mathbf{b}_{1}$ . We choose one of the columns of $B^{\prime}$ whose cosine similarity with $\mathbf{b}_{1}$ has largest absolute value. We set this to be the first column of $B^{\prime}$ , changing its sign if the cosine similarity is negative. Then we select among the remaining columns, the one with the largest absolute cosine similarity with $\mathbf{b}_{2}$ and set this as the second column of $B^{\prime}$ (again, changing the sign if the cosine similarity is negative). We continue until we reach the last column. Then we compute the relative Frobenius error and mean cosine similarity which are, respectively,

\sqrt{\sum_{i=1}^{p}\sum_{j=1}^{p-1}(b_{ij}-b_{ij}^{\prime})^{2}/(p-1)}\qquad% \text{and}\qquad\frac{1}{p-1}\sum_{i=1}^{p-1}\langle\mathbf{b}_{i},\mathbf{b}_% {i}^{\prime}\rangle.

A.2. Finding patterns from gene expression data

We describe the patterns obtained from the comparison of human and monkey gene expression in Section 6.1.2. The selected 15 highest variance genes among the 139 selected genes in [SCJ⁺23] are EIF3K, NDUFA13, SARNP, MYL10, TAF9, PRCD, BBS5, MRPS14, RING1, AGPAT5, FLOT1, BTBD7, MASTL, KANK1, BDP1. The 15 highest variance genes among the remaining $3244=3383-139$ genes are LUC7L3, RBKS, RBM7, AP4S1, CLCN1, CLASP1, ADTRP, CNNM3, NDUFAF7, CNIH4, RPUSD2, NELFCD, RPP14, ROMO1, RNF181.

We use the plots of the eigenvalues of the flattenings of $\kappa_{4}(\mathbf{y}),\kappa_{4}(\mathbf{x})$ to choose $r=22$ and $\ell=46-22=24$ . The top two foreground patterns are:

	$\displaystyle\mathbf{b}_{1}^{\mathsf{T}}=[$	$\displaystyle-0.04,-0.041,-0.09,-0.051,-0.12,0.075,0.01,-0.004,0.002,0.007,$
		$\displaystyle-0.07,-0.061,0.95,0.192,-0.009,-0.007,-0.002,-0.001,-0.076,-0.042,$
		$\displaystyle-0.008,-0.04,0.005,-0.058,0.012,-0.012,-0.05,-0.006,-0.046,-0.005]$
	$\displaystyle\mathbf{b}_{2}^{\mathsf{T}}=[$	$\displaystyle 0.615,-0.166,0.185,0.119,0.113,-0.099,-0.118,0.011,0.045,-0.025,$
		$\displaystyle 0.098,0.141,-0.482,-0.339,0.054,0.028,-0.005,0.03,0.247,-0.017,$
		$\displaystyle-0.031,0.043,0.012,0.043,0.015,0.04,0.025,0.002,0.236,-0.016],$

where the coordinates are labelled by the 30 genes in the order listed above. The 15 genes with the largest absolute values of the top foreground pattern include 10 genes among the 139 selected in [SCJ⁺23]. The 15 genes with the largest absolute values of the second foreground pattern include 13 genes from [SCJ⁺23]. Therefore, the foreground patterns obtained via cICA demonstrate consistency with the finding in [SCJ⁺23] that this subset of 139 genes captures human-specific information.

A.3. Mouse protein data

There are 270 foreground samples. These are the protein expression in the cortex of mice subjected to shock therapy. Of these samples, 135 have Down syndrome and 135 do not. There are 135 background samples, protein expression measurements from mice without Down Syndrome who did not receive shock therapy. Each sample measures the expression of 77 proteins; that is, $p=77$ .

For cICA, we preprocess using PCA as described in Appendix 5.1. We take $k=15$ components, which explain $90\%$ of the variance. We then choose $r$ and $\ell$ , as described in Appendix 5.2. That is, we compute the eigenvalues of $\mathrm{Mat}(\kappa_{4}(\mathbf{y}))$ and $\mathrm{Mat}(\kappa_{4}(\mathbf{x}))$ , ranking the eigenvalues by magnitude, see Figure 6. Based on these plots, we choose $r=27$ and $\ell=53-27=26$ .

For general cICA, we fix the random seed to be 0. For proportional cICA, we run the algorithm for 100 log-evenly spaced $\gamma$ between 0 and $10^{6}$ . The highest silhouette score is obtained at $\gamma=0$ .

We run cPCA for 100 $\alpha$ between 0 to 1000. These are the default values of $\alpha$ in the code of [AZBZ17]. Note that our parameters for proportional cICA are square of the cPCA parameters, since if $\mathbf{z}=\lambda\mathbf{z}^{\prime}$ , then $\kappa_{2}(\mathbf{z})=\lambda^{2}\kappa_{2}(\mathbf{z}^{\prime})$ and $\kappa_{4}(\mathbf{z})=\lambda^{4}\kappa_{4}(\mathbf{z}^{\prime})$ . We plotted the choice with highest silhouette score, which was achieved for $\alpha=26.2$ .

We run PCPCA for 100 evenly spaced $\gamma^{\prime}$ values between 0 and $0.9\cdot\frac{270}{135}$ . 270 and 135 are the number of samples in the foreground and background datasets, respectively. Such choices of $\gamma^{\prime}$ are in accordance with the setup in [LJE20] and are sufficient to find the highest silhouette score. The best score was obtained when $\gamma^{\prime}=0.9\cdot\frac{270}{135}$ . In [LJE20], the authors take a further step to scale the probabilistic contrastive principal components, before calculating the silhouette score. The silhouette score obtained after this additional step is 0.450.

A.4. Corrupted MNIST data

For the hyperparameters of cICA, we choose the number of components to be 30, which explains $80\%$ of the variance. We then choose $r,\ell$ for general cICA and $\ell$ for proportional cICA. We order the eigenvalues of $\mathrm{Mat}(\kappa_{4}(\mathbf{y}))$ and $\mathrm{Mat}(\kappa_{4}(\mathbf{x}))$ according to their absolute values and plot parts of the ordered eigenvalues in Figure 7. Based on these plots, we choose $r=53$ and $r+\ell=114$ .

We fix the random seed to be 0 for general cICA. The visualization for the general cICA algorithm is in Figure 4(a) and the silhouette score is 0.451.

For proportional cICA, we run the algorithm for 100 log-evenly spaces $\gamma$ between 0 and $10^{6}$ and the highest silhouette score is 0.508, obtained when $\gamma=0.21$ . The plot is Figure 4(b).

For cPCA, we plot the first two cPCA components. As above, we run cPCA for 100 $\alpha$ values between 0 and 1000. The highest silhouette score is 0.546, obtained when $\alpha=3.7$ . The plot is Figure 4(c).

We run PCPCA for 100 evenly spaced $\gamma^{\prime}$ between 0 and $0.9$ , in accordance with the setup in [LJE20]. The best silhouette score for the plot of the first two probabilistic contrastive principal components is 0.009, obtained when $\gamma^{\prime}=0.9$ . If we normalize the probabilistic contrastive principal components before calculating the silhouette score, the silhouette score obtained is 0.386. The plot is Figure 4(d).

A.5. Transplant gene expression data

There are 7525 pre-transplant patients and 4874 post-transplant patients in the foreground dataset. The background dataset consists of 4457 healthy patients. Each sample contains gene expression measurements of bone marrow mononuclear cells. We preprocess the data by log-transforming and subsetting to the 500 most variable genes, in accordance with previous analyses on these data [ZTB⁺17, AZBZ18, LJE20].

For the hyperparameters of cICA and HTD, we choose the number of components to be 15 which explains $54.5\%$ of the variance. We then choose $r,\ell$ for cICA and $\ell$ for proportional cICA. We order the eigenvalues of $\mathrm{Mat}(\kappa_{4}(\mathbf{y}))$ and $\mathrm{Mat}(\kappa_{4}(\mathbf{x}))$ according to their absolute values and plot out parts of the ranked eigenvalues in Figure 8. We choose $r=53$ and $r+\ell=116$ .

We fix the random seed to be 0 for cICA. The visualization for the cICA algorithm is in Figure 5(a) and the silhouette score is 0.344.

For proportional cICA, we run the algorithm for 100 log-evenly spaces $\gamma$ between 0 and $10^{6}$ and the highest silhouette score is 0.402, obtained when $\gamma=0.50$ . The plot is Figure 5(b).

For cPCA, we plot the first two cPCA components. As above, we run cPCA using 100 $\alpha$ between 0 to 1000, the default values from [AZBZ17]. The highest silhouette score is 0.457, obtained when $\alpha=3.5$ . The plot is Figure 5(c).

We run PCPCA for 100 evenly spaced $\gamma^{\prime}$ between 0 and $0.9\cdot\frac{12399}{4457}$ , in accordance with [LJE20]. The numbers 12399 and 4457 are the sample sizes of the foreground and background datasets, respectively. In accordance with the experiment in [AZBZ17], we run PCPCA with 4 components. The best silhouette score over any $\gamma^{\prime}$ and any pair of probabilistic contrastive principal components is 0.164, obtained when $\gamma^{\prime}=0.41$ using the third and fourth components. If we normalize the probabilistic contrastive principal components and then calculate the silhouette score, the score obtained is 0.184. The plot is Figure 5(d).

	$\displaystyle\\|T^{\prime\prime}-T^{\prime}\\|=$	$\displaystyle\\|\sum_{i=1}^{r}\mu_{i}(\mathbf{v}_{i}^{\otimes 2}-\beta_{i}^{2}% \mathbf{B}_{i}^{\otimes 2})\\|$
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{r}\|\mu_{i}\|\\|\mathbf{v}_{i}^{\otimes 2}-\beta_{i}^{2}% \mathbf{B}_{i}^{\otimes 2}\\|$
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{r}\|\mu_{i}\|(\\|\mathbf{v}_{i}^{\otimes 2}-\beta_{i}% \mathbf{B}_{i}\otimes\mathbf{v}_{i}\\|+\\|\beta_{i}^{2}\mathbf{B}_{i}^{\otimes 2% }-\beta_{i}\mathbf{B}_{i}\otimes\mathbf{v}_{i}\\|)$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{r}\|\mu_{i}\|(\\|\mathbf{v}_{i}\\|+\|\beta_{i}\|\\|\mathbf{B% }_{i}\\|)\\|\mathbf{v}_{i}-\beta_{i}\mathbf{B}_{i}\\|$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{r}\|\mu_{i}\|(1+\|\beta_{i}\|){\left(\sum_{j=2}^{r_{i}}(% \beta_{i}^{(j)})^{2}\right)}^{\frac{1}{2}},$

	$\displaystyle\\|(\mathbf{B}_{i}^{\prime})^{\otimes 2}-\mathbf{B}_{i}^{\otimes 2% }\\|\leq$	$\displaystyle\\|(\mathbf{B}_{i}^{\prime})^{\otimes 2}-\mathbf{B}_{i}\otimes% \mathbf{B}_{i}^{\prime}\\|+\\|\mathbf{B}_{i}\otimes\mathbf{B}_{i}^{\prime}-(% \mathbf{B}^{\prime}_{i})^{\otimes 2}\\|$
	$\displaystyle=$	$\displaystyle\\|\mathbf{B}_{i}^{\prime\prime}\\|\cdot\\|\mathbf{B}_{i}^{\prime}-% \mathbf{B}_{i}\\|+\\|\mathbf{B}_{i}\\|\cdot\\|\mathbf{B}_{i}^{\prime}-\mathbf{B}_{% i}\\|$
	$\displaystyle=$	$\displaystyle 2\\|\mathbf{B}_{i}^{\prime}-\mathbf{B}_{i}\\|\leq(2^{i+1}-4)% \epsilon^{2},$

	$\displaystyle\\|M^{\prime}\mathbf{v}_{i}-\nu_{i}\mathbf{v}_{i}\\|$	$\displaystyle=\\|(M^{\prime}-M)\mathbf{v}_{i}+(\mu_{i}-\nu_{i})\mathbf{v}_{i}\\|$
		$\displaystyle\leq\\|M^{\prime}-M\\|_{2}+\|\mu_{i}-\nu_{i}\|$
		$\displaystyle\leq\\|M^{\prime}-M\\|+\|\mu_{i}-\nu_{i}\|\leq 2K\epsilon^{2},$

	$\displaystyle\|\nu_{i}^{\prime}-\nu_{i}\|$	$\displaystyle\leq\|\mu_{i}\beta_{i}^{2}-\mu_{i}\|+\|\mu_{i}-\nu_{i}\|$
		$\displaystyle\leq\|\mu_{i}(\beta_{i}+1)\|\cdot\|\beta_{i}-1\|+\|\mu_{i}-\nu_{i}\|$
		$\displaystyle\leq(\|\nu_{i}\|+\|\nu_{i}-\mu_{i}\|)(2+\|\beta_{i}-1\|)\|\beta_{i}-1\|+\|% \mu_{i}-\nu_{i}\|$
		$\displaystyle\leq(\|\nu_{i}\|+K\epsilon^{2})(2+2\sqrt{\frac{K}{\nu}}\epsilon+(2^% {i}-2)\epsilon^{2})(2\sqrt{\frac{K}{\nu}}\epsilon+(2^{i}-2)\epsilon^{2})+K% \epsilon^{2}$
		$\displaystyle=\|\nu_{i}\|\sqrt{\frac{K}{\nu}}\mathcal{O}(\epsilon),$

Contrastive independent component analysis

Abstract.

1. Introduction

2. From ICA to contrastive ICA

2.1. Related Work

3. Hierarchical tensor decomposition

3.1. The HTD algorithm

Example 3.1 (2×2×2×222222\times 2\times 2\times 22 × 2 × 2 × 2 example).

3.2. Properties of the decomposition

Proposition 3.2.

Proof.

Proposition 3.3.

Proof.

Theorem 3.4.

Lemma 3.5.

Proof.

Lemma 3.6.

Proof.

Proof of Theorem 3.4.

3.3. Comparison of HTD with other hierarchical tensor decompositions

4. Tensor decompositions for cICA

4.1. General cICA

Lemma 4.1.

Proof.

Proposition 4.2 (Identifiability of the cICA tensor decomposition).

Proof.

Proposition 4.3.

Lemma 4.4.

Proof of Proposition 4.3.

4.2. Proportional cICA

Theorem 4.5.

Proof.

4.3. cICA for dimensionality reduction

5. Practicalities and interpretation of cICA

5.1. Preprocessing with PCA

5.2. Choosing the ranks

5.3. Visualization

Proposition 5.1.

Proof.

Proposition 5.2.

Proof.

6. Numerical experiments

6.1. Finding patterns

6.1.1. Synthetic data

6.1.2. Human and monkey gene expression data

6.2. Data visualization

6.2.1. Mouse protein data

6.2.2. Corrupted MNIST data

6.2.3. Transplant gene expression data

7. Summary

Acknowledgements

References

Appendix A Details of numerical experiments

A.1. Finding patterns from synthetic data

A.2. Finding patterns from gene expression data

A.3. Mouse protein data

A.4. Corrupted MNIST data

A.5. Transplant gene expression data

Example 3.1 ( $2\times 2\times 2\times 2$ example).