Contrastive independent component analysis
Abstract.
Visualizing data and finding patterns in data are ubiquitous problems in the sciences. Increasingly, applications seek signal and structure in a contrastive setting: a foreground dataset relative to a background dataset. For this purpose, we propose contrastive independent component analysis (cICA). This generalizes independent component analysis to independent latent variables across a foreground and background. We propose a hierarchical tensor decomposition algorithm for cICA. We study the identifiability of cICA and demonstrate its performance visualizing data and finding patterns in data, using synthetic and real-world datasets, comparing the approach to existing contrastive methods.
1. Introduction
Finding and understanding patterns in data is fundamental in various scientific fields. Often, data have been collected under two different settings, such as a group of patients receiving a treatment and a control group, or a group of patients with a certain disease and a group without the disease. The goal may be to understand the effect of the treatment, or to understand the genetic changes that describe the disease. While standard data analysis methods can be used, which restrict attention to one of the datasets or combine them together, an alternate view is offered by contrastive methods. Contrastive methods view the two settings as a foreground and a background. They seek to learn patterns in the foreground after accounting for (or, “subtracting off”) the background. The hope is that such patterns encode useful structure and offer a good basis for dimensionality reduction and visualization of the data, to identify fine-grained structure and clusters particular to the foreground.
The contrastive viewpoint is first addressed in [ZHPA13], in contrastive topic modeling and contrastive hidden Markov Models applied to genomic sequence analysis. Principal component analysis (PCA) is generalized to contrastive PCA (cPCA) in [AZBZ17, AZBZ18]. The contrastive patterns are principal components of the foreground covariance matrix minus a scalar multiple of the background covariance matrix. The paper [SGN19] studies a linear contrastive latent variable model. Probabilistic contrastive PCA (PCPCA) is introduced in [LJE20], where foreground patterns are inferred by maximizing a likelihood ratio of linear Gaussian mixtures.
In this paper, we propose contrastive independent component analysis (cICA). Independent component analysis (ICA) is a blind source separation method, which seeks to recover latent sources and unknown mixing from observations of mixtures of signals [CJ10]. ICA assumes that latent sources are independent. In extending ICA to the contrastive setting, the idea is that background data is generated by mixing of independent sources while foreground data is generated by the background mixing together with a foreground mixing of independent sources. The patterns of interest are the foreground mixing.
We show that cICA has strong identifiability properties. These enable the contribution of each background pattern to the foreground to be found uniquely. This avoids the need for a sweep of hyperparameters to find the best multiple of the background to subtract from the foreground and even avoids the assumption that the background contribution to the foreground is via a single scalar multiple, both of which are required in cPCA and PCPCA [AZBZ17, AZ19, LJE20]. We develop tensor decomposition algorithms for cICA and show that they recover accurate patterns for synthetic data. For this, we devise a new hierarchical tensor decomposition based on recursive eigendecompositions. We turn cICA into a dimensionality reduction tool and investigate its performance on real-world data, comparing the plots to those obtained with other contrastive methods to see its competitiveness.
The paper is organized as follows. We define cICA and introduce a tensor decomposition approach to learn it in Section 2. A key ingredient of learning cICA is a new hierarchical tensor decomposition, which we introduce and study in Section 3. We study the identifiability results and present algorithms for cICA in Section 4. Numerical results are in Section 6.
2. From ICA to contrastive ICA
Independent component analysis (ICA) studies observations that are a linear mixture of independent source variables. We write the ICA model as
(1) |
where is a vector of independent latent random variables, the mixing matrix is , and is a vector of observed variables. The -th column of records a pattern in the data: the contribution of variable to each of the observed variables. The identifiability of ICA refers to the uniqueness of the mixing matrix and sometimes also of the variables ; see [EK04, Com94, WS24].
Many algorithms for ICA proceed via tensor decomposition, see e.g. [CJ10, CS93, DLDMV01, DLCC07]. The cumulants of a distribution are symmetric tensors that encode it. The -th cumulant of is a symmetric order tensor of format with decomposition
(2) |
where the scalar is the -th cumulant of and the vector is the -th column of . This decomposition (2) follows from the multi-linear properties of cumulants and the fact that cumulant tensors of independent variables are diagonal, see [McC18, Chapter 2]. The matrix can be recovered using tensor decomposition of the cumulant tensor (2). If the tensor decomposition is identifiable, then the columns with can be recovered uniquely up to permutation and scaling of columns. Thus tensor decomposition of higher-order cumulant tensors gives an algorithm for ICA, provided no source variable is Gaussian (this is required for non-zero higher-order cumulants).
In this paper we extend ICA, and tensor decomposition for ICA, to the comparison of two distributions. We call this contrastive ICA (cICA), by analogy with cPCA [AZBZ18]. We have two observed distributions, a foreground and a background. Both are assumed to be linear mixtures of independent source variables. The cICA model expresses the background and foreground as
(3) |
The background distribution is a linear mixture of a random vector of independent random variables, as in (1). The foreground is a mixture of independent variables and . The columns of are the patterns in the background: column records how source variable appears among the background variables as well as how source variable appears among the foreground variables. The columns of are patterns that appear only in the foreground. They correspond to the variables , referred to as the salient variables in [AZ19].
We propose a tensor decomposition algorithm to recover mixing matrices and from (3). These matrices record the patterns that encode our background and foreground distributions. We apply the algorithm to empirical cumulant tensors of and obtained from sample data. We order the columns of matrix to obtain a dimensionality reduction tool. We work under the assumption that are non-Gaussian, an assumption that also appears for usual ICA. This can likely be relaxed to that at most one source is Gaussian, cf. [Com94, WS24].
Under the model (3), the -th cumulants of the background and foreground data are, respectively,
(4) |
where is the -th cumulant of , is the -th cumulant of , and is the -th cumulant of . This follows from the multilinearity of cumulants and that cumulant tensors of independent sources are diagonal, as for usual ICA. We hence have the following optimization problem to recover and : find a joint decomposition of cumulant tensors and of the form in (4). Our approach is:
-
(1)
Compute a symmetric tensor decomposition of to learn .
-
(2)
Find the coefficients of each in to obtain .
-
(3)
Compute a symmetric tensor decomposition of to learn .
We work with the fourth order cumulants . We propose a hierarchical eigendecomposition based algorithm to decompose an order four symmetric tensor, which we describe in the next section. This is a key ingredient to our cICA algorithm.
2.1. Related Work
We relate cICA to other contrastive models. Setting and studying observed distributions and via their covariance matrices () specializes cICA to cPCA from [AZBZ17, AZBZ18]. cICA also relates to PCPCA [LJE20] but we do not impose distributional assumptions, beyond independence, on the variables and . Finally, the cICA model fits into the contrastive latent variable model framework of [SGN19], but we disregard noise terms and do not impose .
The setting of cICA relates to usual ICA, with block structure on the mixing matrix:
However, learning parameters via usual ICA in either of these settings requires access to the joint distribution of , In the first setting, are independent, so we can build the joint distribution of from unpaired observations of and . Identifiability can be characterized using [Com94], or using [EK04, WS24] if the model is overcomplete (i.e. the number of sources exceeds the number of observations, which occurs for ). However, in practice, the independence assumption on all of is too strong. In the second setting, we do not have access to the joint distribution of unless we have paired data samples, which is unrealistic in the settings we study.
In [SSDU24], the authors study multi-modal linear ICA. They recover the mixing matrices from each mode via usual linear ICA and then use a hypothesis test to decide which latent variables should be shared across modes. Our method differs from this as we seek shared patterns across datasets rather than shared latent variables.
Nonlinear ICA related contrastive methods have been explored in the literature. Nonlinear ICA is studied using contrastive learning in [HM16, HST19, LF22]. Here contrastive is used in a different context: it describes a method to train a network to distinguish two datasets. A nonlinear contrastive method called a contrastive variational autoencoder (cVAE) is introduced in [AZ19, SGN19]. The paper [WBWL22] presents a method for cVAE using maximum mean discrepancy to prevent leakage of information between the two sets of latent variables. Identifiability of cVAE is studied using result from nonlinear ICA in [LHH+24]. These works produce a nonlinear latent encoding of data, whereas our focus is on pattern vectors to describe observed variables.
3. Hierarchical tensor decomposition
We introduce a hierarchical tensor decomposition (HTD) that decomposes an order four tensor via recursive eigendecompositions. The idea is to find a low-rank approximation of a tensor, whose rank one summands offer an interpretable basis on which to project data. Later, we use the decomposition for cICA. In this section, we define the decomposition and study its properties.
3.1. The HTD algorithm
Consider a symmetric tensor of format . We compute a rank approximation,
(5) |
as follows. Let be the flattening of that rearranges its entries into a matrix of size . The entries of are indexed , where . We compute the approximation (5) by first computing the eigendecomposition of , whose eigenvectors lie in , and then by resha** these eigenvectors into matrices and computing their top eigenvalue and corresponding eigenvector. By top eigenvalue we mean those of highest magnitude. This decomposition has not to our knowledge been studied before but has connections to the hierarchical tensor representations of [Hac12, Chapter 11] and to the PARATREE model in [SRK09], see Subsection 3.3. Here is the algorithm.
We record some observations pertaining to Algorithm 1. The matrix is symmetric since is symmetric. The matrices are also symmetric, because the vectors are in the column space of , whose -th row coincides with its -th row. Although the output vectors are in general not orthogonal, as each is an eigenvector of a distinct matrix, they can be nearly orthogonal in practice, see Section 3.2. This is because they are the leading eigenvectors of matrices that have been reshaped from orthogonal vectors .
Example 3.1 ( example).
Let . Fix
with eigenvalues and associated eigenvectors
Their corresponding matrices are symmetric with top eigenvalues and , respectively, with associated eigenvectors and . The HTD algorithm with input and thus outputs
(6) |
We note the similarity to the input tensor .
3.2. Properties of the decomposition
The HTD algorithm outputs a rank approximation of a tensor. In certain cases, the output closely approximates the input tensor, as in Example 3.1. We bound the distance between the HTD approximation and the input tensor. We give a bound that applies to all tensors in Proposition 3.2. We show that the input and output coincide for orthogonally decomposable tensors in Proposition 3.3. Our main result is Theorem 3.4, which bounds the distance between an input and output tensor for a tensor decomposition involving vectors that are close to orthogonal.
The norm refers to the Frobenius norm for matrices and tensors and the -norm for vectors; i.e., the square root of the sum of the squares of the entries. The -norm of a matrix is denoted by .
Proposition 3.2.
Let be a symmetric tensor of format . Let be the rank HTD approximation of . Then
where is the rank of , is the rank of , the numbers are the eigenvalues of in descending order of magnitude, and is the highest magnitude eigenvalue of with the other eigenvalues.
Proof.
We use the notation from Algorithm 1. We have
from the properties of the eigendecomposition of a symmetric matrix and the Frobenius norm. Let be the tensor obtained from resha** the truncated eigendecomposition of . Then . Let be the vectorization of . Then
where the penultimate equality follows from and the last equality uses . We conclude by the triangle inequality . ∎
The quantity in Proposition 3.2 is small if is well-approximated by a matrix of rank , and each is well-approximated by a matrix of rank one. Orthogonally decomposable tensors are those with a decomposition into orthogonal rank one terms; that is, a decomposition , where are orthonormal [Rob16]. For orthogonally decomposable tensors, HTD recovers the exact decomposition.
Proposition 3.3.
Let , where the vectors are orthonormal and the coefficients are distinct. Then the rank HTD approximation is the tensor .
Proof.
The flattening has decomposition , where is the vectorization of . We have the orthogonality for all , since the vectors are orthogonal. Hence this expression for is a sum of outer products of orthogonal vectors, so it is the eigendecomposition of . The matrix reshaped from the eigenvector is . It has top eigenvalue with corresponding eigenvector . Hence the output of HTD is . ∎
We extend Proposition 3.3 to decompositions where the vectors are close to orthogonal.
Theorem 3.4.
Let . Fix with distinct and in decreasing order of magnitude and unit vectors with linearly independent. Assume that for all , for small . Suppose the output of rank HTD applied to is with in decreasing order of magnitude and unit vectors. Then
where and .
Note that the quantity arises in Theorem 3.4, because of the sign indeterminacy in the vectors in the decompositions, due to the equality for even.
We prove Theorem 3.4 via two lemmas. The condition that the matrices are linearly independent ensures that has rank . This condition holds for generic vectors , provided .
Lemma 3.5.
Assume the hypotheses of Theorem 3.4. Let be the vectorization of for and define , the flattening of tensor . Then there exists a matrix with eigendecomposition
(7) |
such that for and
Proof.
Our first goal is to construct the vectors . We generate orthogonal vectors via Gram-Schmidt:
The vectors satisfy for all and for all . We will prove by induction on that
The proofs rely on the inequalities
and
where the last inequality uses Cauchy-Schwarz. When , and , as desired. Suppose the statement is true for all . Then,
and
This concludes the induction. We now define . Then are orthonormal, hence (7) is the eigendecomposition of . Moreover,
It remains to bound . We have . The summands satisfy
where the first and third lines follow from the triangle inequality and the second follows from . Collecting the summands proves the claim. ∎
Lemma 3.6.
Proof.
We have the bound
where the first inequality follows from Weyl’s inequality [Wey12] and the second from Lemma 3.5. For the similarity of eigenvectors, we lower bound using the difference . We have
where the second inequality follows from the triangle inequality and the definition of the 2-norm and the third inequality follows from the fact that the 2-norm for matrices is bounded above by the Frobenius norm. If we complete to an orthonormal basis of . Then . Hence
Then , where . Then
Since , we have
Proof of Theorem 3.4.
For , the matrix reshapes the unit vector into a matrix. The output of HTD is as in the statement, where are the eigenvectors of , respectively. The output of HTD has non-zero summands, since the matrix has rank . This follows from the fact that vectors are linearly independent. Then,
We use the distance between and to bound the distance between their eigenvectors and eigenvalues:
by Weyl’s inequality. Note from Algorithm 1 that . Hence
where the first and second inequalities follow from the triangle inequality and the third is obtained via substituting in the bounds and , obtained above.
To bound , we consider . On the one hand, we have
On the other hand, and . So, . Hence,
The last expression simplifies to . ∎
3.3. Comparison of HTD with other hierarchical tensor decompositions
We compare HTD in Algorithm 1 to other hierarchical tensor decompositions. The goal of hierarchical tensor decomposition [Hac12, Chapter 11] is to efficiently represent a tensor that lives in a high-dimensional space. Given a tensor of order , a hierarchical decomposition is based on a hierarchy of vector spaces given by a dimension partition tree on indices , such as those in Figure 1.
Hierarchical tensor representations in [Hac12, Chapter 11] start at the leaves of the tree, which are labelled by single indices. One finds subspaces such that the tensor is well-approximated by a tensor in the lower-dimensional space . Proceeding from leaves to root, when two indices and combine to form the subset , the representation finds a subspace that well-approximates the tensor. This repeats until we have a low-dimensional subspace such that the tensor lies in this subspace to reasonable accuracy. Fixing ranks in the representation fixes the allowable dimension of the subspaces for the subsets in the tree. See [Hac12, Figure 11.1].
The PARATREE model starts at the root of the tree. For example, if the root is the splitting of into (i.e. Figure 1 in the case ) then one computes a decomposition of the flattened tensor in to give a sum , with and . The second step is the splitting of indices . This decomposes each vector , where is viewed as a matrix of size . This results in the decomposition
(8) |
This pattern can be continued for larger , see [SRK09, Equation 9].
Our HTD takes as input a symmetric tensor. We use the dimension partition tree in Figure 1(b). HTD can be viewed as a symmetric analogue of the PARATREE model, but differs in that it uses a different dimension partition tree, and leverages the symmetry of the tensor and decomposition to produce a rank decomposition, rather than the rank (or, more generally, rank ) decomposition obtained from (8). Compared to the hierarchical tensor representations of [Hac12, Chapter 11], it differs in that the tensor is symmetric and it uses the dimension partition tree from root to leaves rather than leaves to root.
4. Tensor decompositions for cICA
The cICA model assumes and , for and , see (3). This leads to the cICA tensor decompositions (4). We present two variants of cICA to compute the decompositions (4). One does not assume a relationship between and ; we call this general cICA, see Section 4.1. The other assumes the proportional relationship for some scalar ; we call this proportional cICA, see Section 4.2. We explain how to use cICA for dimensionality reduction in Section 4.3. This projects data onto a subspace given by certain columns of the foreground mixing matrix .
4.1. General cICA
We present Algorithm 2 for general cICA. Steps 1 and 3 both decompose a symmetric order four tensor. We use the subspace power method [KP19] in Step 1 to prioritize the accuracy of the tensor decomposition. We use Algorithm 1 in Step 3 to prioritize interpretability and efficiency.
We study the identifiability of the algorithm, that is, the uniqueness of the vectors and scalars it outputs, assuming genericity. Our genericity assumption holds almost surely in the space of parameters.
We use the following lemma.
Lemma 4.1.
Let vectors and scalars be generic. Then the decomposition of a symmetric tensor is unique for
Proof.
Proposition 4.2 (Identifiability of the cICA tensor decomposition).
The joint decomposition
(9) |
is unique for generic , where and , when for , for , and when for .
Proof.
The cICA tensor decomposition in the statement is identifiable when the symmetric tensor decomposition of is unique, as follows. The tensor decomposition of , gives vectors up to permutation and scaling. Then we can solve a linear system to find the decomposition . It therefore remains to study the identifiability of the decomposition of . It is a symmetric tensor of rank . Hence the uniqueness follows from Lemma 4.1, setting . ∎
We say that Algorithm 2 is identifiable if, for generic where , , we can uniquely recover the vectors , the coefficients , and the vectors .
Proposition 4.3.
Algorithm 2 is identifiable when for and for .
To prove Proposition 4.3 and latter Theorem 4.5, we use the following linear algebra result. See [KP19, Lemma B.1] for a proof.
Lemma 4.4.
Let and be full rank matrices with . Let , where denotes the pseudo-inverse, and . Then
with equality if and only if .
Proof of Proposition 4.3.
Tensors and are generic rank and rank tensors, respectively. So, the identifiability of Steps 1 and 3 of Algorithm 2 hold if for or for or for , setting and in Lemma 4.1.
It remains to consider Step 2, learning the coefficients of in . The flattening of has the form , where vectorize and , respectively. The scalar is unique if , by Lemma 4.4. It is , where is the thin eigendecomposition of . In particular, the coefficient is unique when
For generic and , this holds provided is at most , the dimension of the space of symmetric matrices. Inequalities for and for hold. Combining the above conditions, Algorithm 2 is identifiable when for and for . ∎
In some settings, we assume that the vectors are orthogonal. In particular, . This assumption is natural for visualization purposes, since the projection onto foreground patterns is orthogonal. In this case, HTD gives an exact decomposition, by Proposition 3.3. The identifiability requirements are the same as in Propositions 4.2 and 4.3, as follows. The identifiability conditions in the two propositions are unchanged under a change of basis by an invertible matrix. When , we can apply a change of basis to so that the vectors become orthogonal. We apply the same change of basis to .
4.2. Proportional cICA
In this subsection, we assume a proportional relationship for some scalar . This assumption also appears in cPCA [AZBZ17]. There, the choice of the hyperparameter is not unique. However, in our setting—which involves the fourth-order cumulants and , under the assumption that —the value of is uniquely determined, with a closed-form expression, see Theorem 4.5. The details of the ensuing algorithm for computing matrix are as follows.
Theorem 4.5.
Consider proportional cICA with , for . For generic and with and , the hyperparameter is the unique value , where is any index between and , is the coefficient of in and is the thin eigendecomposition of .
Proof.
The flattenings of the cumulants and are, respectively,
where vectorize the matrices and , respectively and we use that . We have and , by the assumptions in the statement.
Let be the matrix with columns and define . We have . Suppose that is the thin eigendecomposition of . We have
We have that , the upper bound , and finally that . Hence
by Lemma 4.4. Matrices can be recovered uniquely from tensor decomposition of and the eigendecomposition of . So can also be recovered uniquely, and hence is unique: it is for any . ∎
One can test the proportionality assumption by seeing whether the values from Theorem 4.5 are approximately equal as varies. In practice, exact proportionality may not hold, and learning via Theorem 4.5 could be challenging. An alternative is to use a sweep of values and choose according to visualization plots, a similar method to that used in cPCA [AZBZ17].
4.3. cICA for dimensionality reduction
Usual ICA has been used as a tool to project data, see [Dom18, GW20, LM08]. We extend this to cICA. In practice, the input to cICA consists of samples from the foreground and background . These samples comprise the foreground data and the background data , where and are the number of samples in the foreground and background datasets respectively. We then construct the sample cumulants and as follows.
A dataset of samples in gives a data matrix . Its fourth cumulant is computed as follows. Let denote the mean vector over all observations. The sample covariance matrix for has entries . The fourth-order central sample moment is a tensor with entries Entry of the fourth-order sample cumulant is If the data are samples from a distribution , this sample cumulant approximates . The computation for is similar.
When is large, forming the fourth cumulants may be prohibitively expensive. To get around this, one can reduce the dimension before forming the cumulants, as follows.
We combine the foreground and background datasets together to form a single dataset, a matrix of size . Let have as its columns the top principal components of this combined data. The background and foreground transformed variables are then
(10) |
respectively, where and . The recovered foreground patterns from cICA are the columns of . The columns of convert these projected foreground patterns back into the original space.
In practice, for our data visualization in Section 6.2, we choose the number of PCA components to be 30 or the number of components that explains at least variance, whichever comes first.
We compute the mixing matrix with columns using Algorithm 2 or 3. When employing cICA for dimensionality reduction, we project the foreground data onto . For a two-dimensional plot, we plot the projections for a pair . To select the most relevant vectors out of our recovered vectors , we order them by the ratio
(11) |
We justify this ranking in Section 5.2. We interpret the axes of a cICA dimensionality reduction plot in Section 5.3.
5. Practicalities and interpretation of cICA
In this section, we discuss the practicalities of cICA: preprocessing the input to speed up the algorithm and how to choose the ranks and . We also discuss how to interpret coordinates when viewing cICA as a dimensionality reduction method.
5.1. Preprocessing with PCA
A dataset of samples in gives a data matrix . Its fourth cumulant is computed as follows. Let denote the mean vector over all observations. The sample covariance matrix for has entries . The fourth-order central sample moment is a tensor with entries Entry of the fourth-order sample cumulant is If the data are samples from a distribution , this sample cumulant approximates . The computation for is similar.
When is large, forming the fourth cumulants may be prohibitively expensive. To get around this, one can reduce the dimension before forming the cumulants, as follows. We combine the foreground and background datasets together to form a single dataset, a matrix of size . Let have as its columns the top principal components of this combined data. The background and foreground transformed variables then have the form
(12) |
respectively, where and . The recovered foreground patterns from cICA are the columns of . The columns of convert these projected foreground patterns back into the original space.
In practice, for our data visualization in Section 6.2, we choose the number of PCA components to be 30 or the number of components that explains at least variance, whichever comes first.
5.2. Choosing the ranks
When computing the tensor decompositions in cICA, a key step is to determine the ranks and . To choose the ranks, we can use the flattenings of the cumulants, the matrices . If the expressions for the cumulant tensors and in (4) hold exactly, and if and the vectors are generic, then
For non-exact cumluants, such as sample cumulants, we do not work with the exact ranks of the flattening matrices, but instead examine plots of the eigenvalues in descending magnitude (see e.g. Figure 7) to choose an appropriate cut-off. We choose such that the decrease of the eigenvalue plot of slows down, choose such that the decrease of the eigenvalue plot of slows down, and calculate . General cICA has hyperparameters and ; proportional cICA has one hyperparameter .
We discuss how the results may be affected by an incorrect choice of and and justify our proposed way to order the foreground patterns by importance in (11).
Let the true ranks be and and assume that we have used and in the input to Algorithm 2.
-
•
If , then foreground patterns are noise.
-
•
If , then foreground patterns are not recovered.
-
•
If , then background patterns are mixed with foreground patterns, as follows. Assuming without loss of generality that we have recovered , the third step of Algorithm 2 decomposes the tensor via HTD, as in Algorithm 1. If the orthogonality hypotheses of Proposition 3.3 hold, then the recovered foreground patterns are recovered together with some background patterns that are incorrectly interpreted as foreground patterns. If the approximate orthogonality hypotheses of Theorem 3.4 hold, then the foreground patterns are recovered approximately, together with background patterns that are classed as foreground patterns. Without an orthogonality condition, the recovered foreground patterns will be polluted but still roughly collinear to the true foreground patterns for small or when the dimension of the dataset is large, resulting in almost orthogonality between random vectors.
-
•
If , then foreground patterns are mixed with background noise, as follows. Some background patterns from Algorithm 2 will be noise, say . Step 2 of Algorithm 2 computes the coefficients of the tensors in , though they are not true rank one components of . In Step 3, the tensor to be decomposed has the form for some . As in the case , the foreground patterns can still be exactly or approximately recovered, under the hypotheses of Proposition 3.3 and Theorem 3.4 respectively, albeit with some background noise recovered as foreground patterns.
The above discussion shows that when , the vectors obtained from Algorithm 2 could represent foreground patterns, background patterns, or noise. We order the vectors according to (11). The denominator of (11) is the variance of the linearly transformed background dataset . The numerator is that of the transformed dataset . Their ratio enables us to select the most relevant foreground patterns, as follows.
-
•
If is a foreground pattern, we expect to be small relative to , hence a large .
-
•
If is a background pattern, we expect for some constant and hence .
-
•
If is foreground noise, we expect a small , hence small .
-
•
If is background noise, we expect a small , hence a large . To prevent the background noise showing up in the recovered foreground pattern, we require .
In practice, we consider those patterns for which exceeds a certain threshold, or take the patterns with the two highest values of .
5.3. Visualization
We discuss how to interpret coordinates when using cICA for dimensionality reduction. The following proposition relates the projections for to the latent variables .
Proposition 5.1.
Consider the cICA model in (3). Suppose for . Assume that for some small that and for , . Then, for each ,
where and are upper bounds on the magnitudes of random variables in and . In particular, approximates the component with an error linear in .
Proof.
The almost orthogonality conditions in Proposition 5.1 are strong requirements. However, they can be relaxed – if for chosen and sources and have wider variance than and , then plotting against still approximates the plot of against .
If and are uncorrelated, we expect the plot of against to show axis-aligned clusters; otherwise, clusters may not be axis-aligned. We specify the condition for and to be uncorrelated, assuming that all variables in the tuple have the same variance.
Proposition 5.2.
Consider the cICA model in (3). Suppose that the independent variables is a tuple of independent random variables with the same variance. Then and are uncorrelated if and only if .
Proof.
Write and . By the bilinearity of the covariance
The last expression is zero if and only if . ∎
6. Numerical experiments
We investigate the performance of cICA for finding patterns in data (Section 6.1) and for data visualization (Section 6.2). Our code is available on GitHub at https://github.com/QWE123665/cICA.
6.1. Finding patterns
The cICA patterns are the foreground vectors . We show that cICA recovers these vectors accurately for synthetic data, comparing it with cPCA [AZBZ17] and PCPCA [LJE20]. We also apply cICA to gene expression data from [SCJ+23]. Taking monkey gene expression as the background and human gene expression as the foreground, we relate the cICA patterns to existing results to identify genes responsible for human evolution.
6.1.1. Synthetic data
We use synthetic data to assess the accuracy of the patterns recovered by cICA, both for general cICA (Algorithm 2) and proportional cICA (Algorithm 3). We compare against cPCA and PCPCA, illustrating that cICA algorithms recover the foreground patterns more accurately when generated under a model (3) that assumes independence of latent variables, see Figure 2. The details of the simulations are in Appendix A.1.
We see from Figure 2 that cICA outperforms cPCA and PCPCA in recovering the foreground patterns. Figure 2(a) shows that the interquartile range for cICA in Algorithm 2 is above the maximum cosine similarity results for cPCA and PCPCA. The best performing cICA has cosine similarity above 0.9 for all tested . Figure 2(b) shows analogous results with accuracy measured via relative Frobenius error. The variability in performance as changes is due to randomness in the matrix . Figures 2(c) and (d) show analogous results for proportional cICA from Algorithm 3, with hyperparameter learned from Theorem 4.5. The method outperforms cPCA and PCPCA, with the added benefit that no selection of hyperparameters is necessary.
6.1.2. Human and monkey gene expression data
We apply cICA to a dataset of human and monkey gene expression from [SCJ+23], in which the authors analyse human, chimp, gorilla, macaque, and marmoset datasets to identify genes that are responsible for evolutionary change. Out of 14131 genes, they identify 3383 genes with extensive differences between human and non-human primates, of which they identify a subset of 139 with deeply conserved co-expression across all non-human animals, and strongly divergent co-expression relationships in humans.
We select the 15 most variant genes among the 139 selected genes and the 15 most variant genes among the other genes. We combine 10000 chimp and 10000 gorilla data points to form the background dataset and 10000 human gene expression data points for the foreground dataset . Then we apply cICA as in Algorithm 2 and use (11) to order the and extract the first two vectors . We observe that the 15 genes with highest absolute values in (resp. ) have 10 (resp. 13) genes among the 15 selected genes that come from the subset of 139 in [SCJ+23]. This demonstrates consistency with the results from [SCJ+23]: the vectors assign higher weights to the genes from the subset of 139. For details, see Appendix A.2.
6.2. Data visualization
We use cICA for dimensionality reduction and data visualization, as described in Section 4.3. We investigate the performance on three datasets: mouse protein expression, corrupted MNIST images, and gene expression, the same data studied in the papers [AZBZ17, LJE20]. We quantify the performance of the methods, using the silhouette score [Rou87] of the projected data; higher values indicate better clustering of points.
6.2.1. Mouse protein data
We study the mouse protein dataset from [HGC15]. The foreground data measure protein expression in the cortex of mice subjected to shock therapy, some of whom have Down syndrome. The background dataset consists of protein expression measurements from mice without Down Syndrome who did not receive shock therapy. We compare general cICA, proportional cICA, as well as cPCA and PCPCA. All four algorithms can separate the two clusters in the foreground data, corresponding to mice with Down syndrome and those without, though the projections differ. The general cICA algorithm has the highest silhouette score (0.606), followed by proportional cICA (0.604), then cPCA (0.421), and then PCPCA (0.220), see Figure 3. See Appendix A.3 for details.
6.2.2. Corrupted MNIST data
Next we explore the corrupted MNIST dataset from [AZBZ18]. The foreground data are digits 0, 1 from the MNIST dataset superimposed with strength 0.25 onto 5000 randomly selected grass images from ImageNet. The background data are the 5000 grass images. Each image has size . The projections are shown in Figure 4. All four algorithms cluster the foreground data between digits 0 and 1. The cPCA algorithm has the highest silhouette score (0.546), followed by proportional cICA (0.508), general cICA (0.451) and PCPCA (0.009). See Appendix A.4.
6.2.3. Transplant gene expression data
We study the single-cell RNA sequencing data from [ZTB+17]. The foreground datas are gene expressions of bone marrow mononuclear cells from patients with acute myeloid leukemia before and after they received a stem-cell transplant; the background dataset contains gene expression measurements of healthy people. The projection plots of the four algorithms are shown in Figure 5. cPCA has the highest silhouette score (0.451), followed by proportional cICA (0.402), then general cICA (0.344), then PCPCA (0.164). See Appendix A.5 for details.
7. Summary
We have presented contrastive independent component analysis (cICA), a tool to explore patterns and visualize data in one setting relative to another. We designed algorithms for cICA based on a new hierarchical tensor decomposition that we introduce. We studied two variants: general and proportional cICA. The upside to general cICA is its higher expressivity: it is able to model background patterns that each contribute to the foreground in different relative amounts . The advantage of proportional cICA is that it is deterministic, based solely on recursive eigendecompositions. We used our algorithms to find contrastive patterns that describe a foreground dataset relative to a background, testing the results on synthetic and real-world datasets. We saw its potential to extract foreground patterns of interest and its competitiveness with other contrastive methods.
We investigated the identifiability of cICA, via the uniqueness of its associated coupled tensor decomposition, seeing identifiability improvements relative to cPCA and PCPCA. This echoes the improved identifiability of ICA over PCA: a general linear mixing can be recovered uniquely via ICA, whereas PCA requires an orthogonal mixing.
We conclude with two directions for further study. Our cICA model describes observations as a linear mixing of independent latent variables. Drop** the linearity assumption, we may seek patterns that have nonlinear signature across the observed variables. This would combine the nonlinear contrastive methods of [AZ19, SGN19, WBWL22, LHH+24] with approaches to find interpretable patterns, generalizing our vectors . Finally, drop** the independence assumption on the latent variables is also a promising direction for further study, which would connect cICA to other latent variable models such as those arising in causal disentanglement [YLC+21, SSBU23].
Acknowledgements
We thank Salil Bhate for helpful discussions. AM and AS were partially supported by the NSF (DMS-2306672 and DMR-2011754).
References
- [AZ19] Abubakar Abid and James Zou. Contrastive variational autoencoder enhances salient features. arXiv preprint arXiv:1902.04601, 2019.
- [AZBZ17] Abubakar Abid, Martin J Zhang, Vivek K Bagaria, and James Zou. Contrastive principal component analysis. arXiv preprint arXiv:1709.06716, 2017.
- [AZBZ18] Abubakar Abid, Martin J Zhang, Vivek K Bagaria, and James Zou. Exploring patterns enriched in a dataset with contrastive principal component analysis. Nature communications, 9(1):2134, 2018.
- [CJ10] Pierre Comon and Christian Jutten. Handbook of Blind Source Separation: Independent component analysis and applications. Academic press, 2010.
- [Com94] Pierre Comon. Independent component analysis, a new concept? Signal processing, 36(3):287–314, 1994.
- [COV17] Luca Chiantini, Giorgio Ottaviani, and Nick Vannieuwenhoven. On generic identifiability of symmetric tensors of subgeneric rank. Transactions of the American Mathematical Society, 369(6):4021–4042, 2017.
- [CS93] Jean-François Cardoso and Antoine Souloumiac. Blind beamforming for non-Gaussian signals. In IEE proceedings F (radar and signal processing), volume 140, pages 362–370. IET, 1993.
- [DLCC07] Lieven De Lathauwer, Josphine Castaing, and Jean-Franois Cardoso. Fourth-order cumulant-based blind identification of underdetermined mixtures. IEEE Transactions on Signal Processing, 55:2965–2973, 2007.
- [DLDMV01] Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. Independent component analysis and (simultaneous) third-order tensor diagonalization. IEEE Transactions on Signal Processing, 49(10):2262–2271, 2001.
- [Dom18] Krzysztof Domino. The use of fourth order cumulant tensors to detect outlier features modelled by a t-student copula. arXiv preprint arXiv:1804.00541, 2018.
- [EK04] J. Eriksson and V. Koivunen. Identifiability, separability, and uniqueness of linear ICA models. IEEE Signal Processing Letters, 11(7):601–604, 2004.
- [GW20] Xiurui Geng and Lei Wang. NPSA: Nonorthogonal principal skewness analysis. IEEE Transactions on Image Processing, 29:6396–6408, 2020.
- [Hac12] Wolfgang Hackbusch. Tensor spaces and numerical tensor calculus, volume 42. Springer, 2012.
- [HGC15] Clara Higuera, Katheleen J Gardiner, and Krzysztof J Cios. Self-organizing feature maps identify proteins critical to learning in a mouse model of Down syndrome. PloS one, 10(6):e0129126, 2015.
- [HM16] Aapo Hyvarinen and Hiroshi Morioka. Unsupervised feature extraction by time-contrastive learning and nonlinear ICA. Advances in neural information processing systems, 29, 2016.
- [HST19] Aapo Hyvarinen, Hiroaki Sasaki, and Richard Turner. Nonlinear ICA using auxiliary variables and generalized contrastive learning. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 859–868. PMLR, 2019.
- [JA95] A. Hirschowitz J. Alexander. Polynomial interpolation in several variables. J. Algebraic Geom. 4(4) (1995), 1995.
- [KP19] Joe Kileel and Joao M Pereira. Subspace power method for symmetric tensor decomposition and generalized PCA. arXiv preprint arXiv:1912.04007, 2019.
- [LF22] Qi Lyu and Xiao Fu. On finite-sample identifiability of contrastive learning-based nonlinear independent component analysis. In International Conference on Machine Learning, pages 14582–14600. PMLR, 2022.
- [LHH+24] Romain Lopez, Jan-Christian Huetter, Ehsan Hajiramezanali, Jonathan Pritchard, and Aviv Regev. Toward the identifiability of comparative deep generative models. arXiv preprint arXiv:2401.15903, 2024.
- [LJE20] Didong Li, Andrew Jones, and Barbara Engelhardt. Probabilistic contrastive principal component analysis. arXiv preprint arXiv:2012.07977, 2020.
- [LM08] Lek-Heng Lim and Jason Morton. Cumulant component analysis: a simultaneous generalization of PCA and ICA. CASTA2008, 18, 2008.
- [McC18] Peter McCullagh. Tensor methods in statistics: Monographs on statistics and applied probability. Chapman and Hall/CRC, 2018.
- [Rob16] Elina Robeva. Orthogonal decomposition of symmetric tensors. SIAM Journal on Matrix Analysis and Applications, 37(1):86–102, 2016.
- [Rou87] Peter J Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics, 20:53–65, 1987.
- [SCJ+23] Hamsini Suresh, Megan Crow, Nikolas Jorstad, Rebecca Hodge, Ed Lein, Alexander Dobin, Trygve Bakken, and Jesse Gillis. Comparative single-cell transcriptomic analysis of primate brains highlights human-specific regulatory evolution. Nature Ecology & Evolution, 7(11):1930–1943, 2023.
- [SGN19] Kristen A Severson, Soumya Ghosh, and Kenney Ng. Unsupervised learning with contrastive latent variable models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4862–4869, 2019.
- [SRK09] Jussi Salmi, Andreas Richter, and Visa Koivunen. Sequential unfolding SVD for tensors with applications in array signal processing. IEEE Transactions on Signal Processing, 57(12):4719–4733, 2009.
- [SSBU23] Chandler Squires, Anna Seigal, Salil S Bhate, and Caroline Uhler. Linear causal disentanglement via interventions. In International Conference on Machine Learning, pages 32540–32560. PMLR, 2023.
- [SSDU24] Nils Sturma, Chandler Squires, Mathias Drton, and Caroline Uhler. Unpaired multi-domain causal representation learning. Advances in Neural Information Processing Systems, 36, 2024.
- [WBWL22] Ethan Weinberger, Nicasia Beebe-Wang, and Su-In Lee. Moment matching deep contrastive latent variable models. arXiv preprint arXiv:2202.10560, 2022.
- [Wey12] Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische Annalen, 71(4):441–479, 1912.
- [WS24] Kexin Wang and Anna Seigal. Identifiability of overcomplete independent component analysis. arXiv preprint arXiv:2401.14709, 2024.
- [YLC+21] Mengyue Yang, Furui Liu, Zhitang Chen, Xinwei Shen, Jianye Hao, and Jun Wang. CausalVAE: Disentangled representation learning via neural structural causal models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9593–9602, 2021.
- [ZHPA13] James Y Zou, Daniel J Hsu, David C Parkes, and Ryan P Adams. Contrastive learning using spectral methods. Advances in Neural Information Processing Systems, 26, 2013.
- [ZTB+17] Grace XY Zheng, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, Tobias D Wheeler, Geoff P McDermott, Junjie Zhu, et al. Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1):14049, 2017.
Appendix A Details of numerical experiments
All experiments are run on an Apple M2 Pro with 16 GB memory. Each run of each algorithm takes at most 1 minute.
A.1. Finding patterns from synthetic data
We describe the details of the synthetic data setup in Section 6.1.1 that produced Figure 2. We consider . Our samples come from the distributions (3), where matrices and are random with unit vector columns, and the columns of are assumed to be orthogonal. We assume orthogonality of the columns of to facilitate comparison with the methods cPCA and PCPCA, which require this assumption.
For testing Algorithm 2 in Figure 2(a) and (b), variables are exponential distributions where when is odd and when is even. Variables and are exponential distributions where when is odd and when is even. We generate datapoints for both the foreground and background data and apply cICA to the sample cumulant tensors. cICA has randomness coming from the subspace power method. We apply our algorithm 100 times and get 100 recovered foreground mixing matrices .
For testing Algorithm 3 in Figure 2(c) and (d), we let be exponential distributions where . We learn the hyperparameter via Theorem 4.5. The true is 1 and the recovered are all in the range .
We describe the implementation of the two methods we compare to. For cPCA [AZBZ17], we test 100 log-evenly spaced hyperparameters between 0 and 1000 with components. Each run returns a matrix of size , whose columns are contrastive principal components with norm 1. For PCPCA, we test 100 evenly spaced hyperparameters between 0 and 0.9 and fix components. Each run returns a matrix of size . We normalize the columns to unit norm, to compare PCPCA with the other algorithms.
Since the columns of that are recovered are only unique up to permutation and sign, we describe how to align the outputs. Let be a recovered matrix. Rather than searching over all ways to match the columns of to those of , we use a greedy algorithm to approximate the matching, as follows. We fix the first column of , denoted . We choose one of the columns of whose cosine similarity with has largest absolute value. We set this to be the first column of , changing its sign if the cosine similarity is negative. Then we select among the remaining columns, the one with the largest absolute cosine similarity with and set this as the second column of (again, changing the sign if the cosine similarity is negative). We continue until we reach the last column. Then we compute the relative Frobenius error and mean cosine similarity which are, respectively,
A.2. Finding patterns from gene expression data
We describe the patterns obtained from the comparison of human and monkey gene expression in Section 6.1.2. The selected 15 highest variance genes among the 139 selected genes in [SCJ+23] are EIF3K, NDUFA13, SARNP, MYL10, TAF9, PRCD, BBS5, MRPS14, RING1, AGPAT5, FLOT1, BTBD7, MASTL, KANK1, BDP1. The 15 highest variance genes among the remaining genes are LUC7L3, RBKS, RBM7, AP4S1, CLCN1, CLASP1, ADTRP, CNNM3, NDUFAF7, CNIH4, RPUSD2, NELFCD, RPP14, ROMO1, RNF181.
We use the plots of the eigenvalues of the flattenings of to choose and . The top two foreground patterns are:
where the coordinates are labelled by the 30 genes in the order listed above. The 15 genes with the largest absolute values of the top foreground pattern include 10 genes among the 139 selected in [SCJ+23]. The 15 genes with the largest absolute values of the second foreground pattern include 13 genes from [SCJ+23]. Therefore, the foreground patterns obtained via cICA demonstrate consistency with the finding in [SCJ+23] that this subset of 139 genes captures human-specific information.
A.3. Mouse protein data
There are 270 foreground samples. These are the protein expression in the cortex of mice subjected to shock therapy. Of these samples, 135 have Down syndrome and 135 do not. There are 135 background samples, protein expression measurements from mice without Down Syndrome who did not receive shock therapy. Each sample measures the expression of 77 proteins; that is, .
For cICA, we preprocess using PCA as described in Appendix 5.1. We take components, which explain of the variance. We then choose and , as described in Appendix 5.2. That is, we compute the eigenvalues of and , ranking the eigenvalues by magnitude, see Figure 6. Based on these plots, we choose and .
For general cICA, we fix the random seed to be 0. For proportional cICA, we run the algorithm for 100 log-evenly spaced between 0 and . The highest silhouette score is obtained at .
We run cPCA for 100 between 0 to 1000. These are the default values of in the code of [AZBZ17]. Note that our parameters for proportional cICA are square of the cPCA parameters, since if , then and . We plotted the choice with highest silhouette score, which was achieved for .
We run PCPCA for 100 evenly spaced values between 0 and . 270 and 135 are the number of samples in the foreground and background datasets, respectively. Such choices of are in accordance with the setup in [LJE20] and are sufficient to find the highest silhouette score. The best score was obtained when . In [LJE20], the authors take a further step to scale the probabilistic contrastive principal components, before calculating the silhouette score. The silhouette score obtained after this additional step is 0.450.
A.4. Corrupted MNIST data
For the hyperparameters of cICA, we choose the number of components to be 30, which explains of the variance. We then choose for general cICA and for proportional cICA. We order the eigenvalues of and according to their absolute values and plot parts of the ordered eigenvalues in Figure 7. Based on these plots, we choose and .
We fix the random seed to be 0 for general cICA. The visualization for the general cICA algorithm is in Figure 4(a) and the silhouette score is 0.451.
For proportional cICA, we run the algorithm for 100 log-evenly spaces between 0 and and the highest silhouette score is 0.508, obtained when . The plot is Figure 4(b).
For cPCA, we plot the first two cPCA components. As above, we run cPCA for 100 values between 0 and 1000. The highest silhouette score is 0.546, obtained when . The plot is Figure 4(c).
We run PCPCA for 100 evenly spaced between 0 and , in accordance with the setup in [LJE20]. The best silhouette score for the plot of the first two probabilistic contrastive principal components is 0.009, obtained when . If we normalize the probabilistic contrastive principal components before calculating the silhouette score, the silhouette score obtained is 0.386. The plot is Figure 4(d).
A.5. Transplant gene expression data
There are 7525 pre-transplant patients and 4874 post-transplant patients in the foreground dataset. The background dataset consists of 4457 healthy patients. Each sample contains gene expression measurements of bone marrow mononuclear cells. We preprocess the data by log-transforming and subsetting to the 500 most variable genes, in accordance with previous analyses on these data [ZTB+17, AZBZ18, LJE20].
For the hyperparameters of cICA and HTD, we choose the number of components to be 15 which explains of the variance. We then choose for cICA and for proportional cICA. We order the eigenvalues of and according to their absolute values and plot out parts of the ranked eigenvalues in Figure 8. We choose and .
We fix the random seed to be 0 for cICA. The visualization for the cICA algorithm is in Figure 5(a) and the silhouette score is 0.344.
For proportional cICA, we run the algorithm for 100 log-evenly spaces between 0 and and the highest silhouette score is 0.402, obtained when . The plot is Figure 5(b).
For cPCA, we plot the first two cPCA components. As above, we run cPCA using 100 between 0 to 1000, the default values from [AZBZ17]. The highest silhouette score is 0.457, obtained when . The plot is Figure 5(c).
We run PCPCA for 100 evenly spaced between 0 and , in accordance with [LJE20]. The numbers 12399 and 4457 are the sample sizes of the foreground and background datasets, respectively. In accordance with the experiment in [AZBZ17], we run PCPCA with 4 components. The best silhouette score over any and any pair of probabilistic contrastive principal components is 0.164, obtained when using the third and fourth components. If we normalize the probabilistic contrastive principal components and then calculate the silhouette score, the score obtained is 0.184. The plot is Figure 5(d).