1 Introduction

Principal component analysis (PCA) and factor analysis (FA) can be found in any textbook of multivariate statistical analysis; see e.g. Anderson (1984). The two methods both have long histories of over one hundred years. As basic dimensionality reduction methods, PCA and FA reduce the dimension of the original random vector via eigenvectors of the covariance matrix and interpretable factors, respectively. Their variants and applications have been extremely studied in the literature (Jolliffe 2002; Bartholomew, Knott, and Moustaki 2011).

Generally speaking, PCA is model free, but FA is based on the factor model. There are some connections between PCA and FA. Tip** and Bishop (1999) used a special factor model to define a probabilistic PCA method. PCA is often used as the start point in the iterations of computing maximum likelihood estimates under the factor model (Jolliffe 2002). Overall, PCA and FA are viewed as two relatively separate topics. There is no work on more general and closer connections between the two fundamental methods. In this paper we introduce a framework to unify them within a general latent variable model. We show that they can be viewed as solutions that minimize specific loss functions under this model. Specifically, PCA can be derived from a broad class of loss functions including the $\ell_{2}$ norm, while FA corresponds to a modified $\ell_{0}$ norm loss. Besides, other main contributions of this paper include:

: (a) We show that PCA can be derived from a penalized maximum likelihood estimation problem. When treating the penalty as a Bayesian prior, we obtain a Bayesian version of PCA, which can be viewed as a new probabilistic PCA model.
: (b) We propose a penalized least squares method for FA, which performs well in terms of dimensionality reduction under unidentifiable factor models. The corresponding iterative algorithm is also given.
: (c) We present a combination of PCA loss and FA loss, and construct the corresponding path between PCA and FA.
: (d) We introduce a principal component factor model that links FA to PCA more closely.

To the best of our knowledge, the above findings have not appeared in the literature. It is expected that these findings cannot only help us better understand PCA and FA, but also lead to new tools of data analysis and research topics.

The rest of this paper is organized as follows. Section 2 briefly reviews PCA and FA. Section 3 introduces the general latent variable model and related optimization problems. Sections 4 and 5 present the objective functions corresponding to PCA and FA, respectively. Section 6 discusses combinations of PCA and FA. Section 7 provides numerical illustrations. We conclude the paper with some discussion in Section 8. All proofs are given in the Appendix.

2 Review of principal component analysis and factor analysis

In this section we present a brief review of principal component analysis (PCA) and factor analysis (FA). Some notation and definitions are needed. For a vector or matrix $\mathbf{a}$ , let $\mathbf{a}^{\prime}$ denote its transpose. For two symmetric matrices $\mathbf{A}$ and $\mathbf{B}$ of the same size, write $\mathbf{A}\leqslant$ $(\geqslant)$ $\mathbf{B}$ if $\mathbf{B}-\mathbf{A}$ ( $\mathbf{A}-\mathbf{B}$ ) is nonnegative definite, and write $\mathbf{A}<$ $(>)$ $\mathbf{B}$ if $\mathbf{B}-\mathbf{A}$ ( $\mathbf{A}-\mathbf{B}$ ) is positive definite. Let $\mathbf{0}$ denote the zero matrix.

Consider a random $p$ -vector $\boldsymbol{X}$ whose covariance matrix $\mathrm{Cov}(\boldsymbol{X})=\mathbf{\Sigma}>\mathbf{0}$ . Without loss of generality, let its expectation $\mathrm{E}(\boldsymbol{X})=\boldsymbol{0}$ . PCA of $\boldsymbol{X}$ is based on the eigenvalue decomposition of $\mathbf{\Sigma}$ , denoted by $\mathbf{\Sigma}=\mathbf{R}\mathbf{\Lambda}\mathbf{R}^{\prime}$ , where $\mathbf{R}=(\boldsymbol{r}_{1},\ldots,\boldsymbol{r}_{p})$ is an orthogonal matrix and $\mathbf{\Lambda}$ is a diagonal matrix $\mathrm{diag}(\lambda_{1},\ldots,\lambda_{p})$ with $\lambda_{1}\geqslant\cdots\geqslant\lambda_{p}>0$ . The $k$ th principal component of $\boldsymbol{X}$ is $\boldsymbol{r}_{k}^{\prime}\boldsymbol{X}$ , whose variance is $\lambda_{k}$ . Partition $\mathbf{R}$ as $\mathbf{R}=(\mathbf{R}_{1}\ \mathbf{R}_{2})$ , where $\mathbf{R}_{1}\in\mathbb{R}^{p\times q}$ . Let $\mathbf{\Lambda}_{1}=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{q})$ . The cumulative proportion of the first $q$ principal components $\mathbf{R}_{1}^{\prime}\boldsymbol{X}$ to $\boldsymbol{X}$ is defined as $(\lambda_{1}+\cdots+\lambda_{q})/(\lambda_{1}+\cdots+\lambda_{p})=\mathrm{% trace}\left(\mathbf{\Lambda}_{1}\right)/\mathrm{trace}\left(\mathbf{\Sigma}% \right)=\mathrm{trace}\left(\mathrm{Cov}(\mathbf{R}_{1}^{\prime}\boldsymbol{X}% )\right)/\mathrm{trace}\left(\mathrm{Cov}(\boldsymbol{X})\right)$ . If the cumulative proportion is sufficiently large, say, at least $85\%$ , then we can say that $\boldsymbol{X}$ is well represented by the first $q$ principal components. Hence PCA is a linear dimensionality reduction technique. In practice, PCA can be implemented through decomposing an estimate of $\mathbf{\Sigma}$ , say, the sample covariance matrix, with observations of $\boldsymbol{X}$ .

FA of $\boldsymbol{X}$ is based on the following factor model,

\boldsymbol{X}=\mathbf{A}\boldsymbol{U}+\boldsymbol{\varepsilon},

(1)

where the factor loading matrix $\mathbf{A}$ is an $p\times q$ constant matrix of full column rank, $1\leqslant q<p$ , the common factors $\boldsymbol{U}\sim N_{q}(\boldsymbol{0},\mathbf{I}_{q})$ is independent of $\boldsymbol{\varepsilon}\sim N_{p}(\boldsymbol{0},\mathbf{V})$ , $\mathbf{I}_{q}$ is the identity matrix of order $q$ , and $\mathbf{V}=\mathrm{diag}(v_{1},\ldots,v_{p})$ with $v_{i}\geqslant 0,\ i=1,\ldots,p$ . We have $\mathrm{Cov}(\boldsymbol{X})=\mathbf{\Sigma}=\mathbf{A}\mathbf{A}^{\prime}+% \mathbf{V}>\mathbf{0}$ . The main purpose of FA is to estimate $\mathbf{A}$ , or $\mathbf{A}\mathbf{A}^{\prime}$ , which yields a low-dimensional representation of $\boldsymbol{X}$ . Note that the factor model (1) can be unidentifiable. Usually we need to make the following assumption.

Assumption 1.

Model (1) is identifiable (Shapiro 1985), i.e., there does not exist a nonnegative definite diagonal matrix $\tilde{\mathbf{V}}$ , different from $\mathbf{V}$ , such that $\mathbf{\Sigma}-\tilde{\mathbf{V}}\geqslant\mathbf{0}$ and $\mathrm{rank}(\mathbf{\Sigma}-\tilde{\mathbf{V}})\leqslant\mathrm{rank}(% \mathbf{\Sigma}-\mathbf{V})$ .

Next we provide some comparisons between PCA and FA. A major distinction between them is that there is a definite model underlying FA, while PCA does not require a model. Tip** and Bishop (1999) proposed a probabilistic model for PCA, which is actually a special case of (1) with $v_{1}=\cdots=v_{p}$ ; see also Roweis (1997). This model cannot cover the broad range of applications of PCA. An advantage of FA is its interpretation because the common factors $\boldsymbol{U}$ in (1) usually have real meanings, and this is the main reason why FA is so popular in various fields. In addition, PCA is often used in computations of FA (Jolliffe 2002; Bartholomew, Knott, and Moustaki 2011). For detailed review of PCA and FA, we refer readers to several monographes and review papers, e.g., Bartholomew, Knott, and Moustaki (2011), Jolliffe and Cadima (2016), and Ghojogh, Crowley, and Karray et al. (2023).

3 A latent variable model and related matrix optimization problems

Consider the following latent variable model that generalizes (1),

\boldsymbol{X}=\mathbf{A}\boldsymbol{U}+\boldsymbol{\varepsilon}.

(2)

Like in (1), $\mathbf{A}$ is an $p\times q$ constant matrix of full column rank, $q<p$ , $\boldsymbol{U}\sim N_{q}(\boldsymbol{0},\mathbf{I}_{q})$ is an unobserved latent vectors independent of $\boldsymbol{\varepsilon}\sim N_{p}(\boldsymbol{0},\mathbf{V}),\ \mathbf{V}% \geqslant\mathbf{0}$ . Generally, $\boldsymbol{V}$ is not restricted to be diagonal. However, we sometimes restrict it to be a diagonal or other special matrix according to the problem. Write $\mathbf{T}=\mathbf{A}\mathbf{A}^{\prime}\geqslant\mathbf{0}$ , and thus $\mathrm{rank}(\mathbf{T})=q$ . By (2), $\mathrm{Cov}(\boldsymbol{X})=\mathbf{\Sigma}=\mathbf{A}\mathbf{A}^{\prime}+% \mathbf{V}=\mathbf{T}+\mathbf{V}$ . Unlike the factor model, any $\mathbf{\Sigma}$ can be written as this form. Clearly the model (2) is unidentifiable, and we can specify $\mathbf{T}$ and $\mathbf{V}$ through minimizing various loss functions. In the following, $\mathbf{\Sigma}$ also denotes an estimate of the population covariance matrix when only observations of $\boldsymbol{X}$ are available.

We consider two classes of methods for specifying $\mathbf{T}$ and $\mathbf{V}$ . One is to first specify $\mathbf{T}$ by solving the following matrix optimization problem,

		$\displaystyle\min_{\mathbf{T}}\ F_{1}(\mathbf{T}),$		(3)
		$\displaystyle{\text{s.t.}}\ \mathbf{0}\leqslant\mathbf{T}\leqslant\mathbf{% \Sigma},\ \mathrm{rank}(\mathbf{T})=q,$		(3)

where $F_{1}$ is an objective function defined on the set of all nonnegative definite matrices. Consequently, we specify $\mathbf{V}$ as $\mathbf{V}^{*}=\mathbf{\Sigma}-\mathbf{T}^{*}$ (or other strategies, since $\mathbf{V}$ is less important), where $\mathbf{T}^{*}$ is the solution to (3). The other is to simultaneously specify $\mathbf{T}$ and $\mathbf{V}$ without the requirement $\mathbf{T}+\mathbf{V}=\mathbf{\Sigma}$ through solving the two-matrix optimization problem,

		$\displaystyle\min_{\mathbf{T},\mathbf{V}}\ F_{2}(\mathbf{T},\mathbf{V}),$		(4)
		$\displaystyle{\text{s.t.}}\ \mathbf{T}\geqslant\mathbf{0},\ \mathbf{V}% \geqslant\mathbf{0},\ \mathrm{rank}(\mathbf{T})=q,$		(4)

where $F_{2}$ is an objective function.

The two-matrix problem can yield a one-matrix problem by taking

F_{1}(\mathbf{T})=F_{2}(\mathbf{T},\mathbf{V}^{*}(\mathbf{T}))

(5)

in (3), where $\mathbf{V}^{*}(\mathbf{T})$ is the minimum of $F_{2}(\mathbf{T},\mathbf{V})$ for given $\mathbf{T}$ . Conversely, one-matrix problems induce two-matrix problems under the following assumption. Let $h$ be a function defined on the set of all nonnegative definite matrices.

Assumption 2.

For all $\mathbf{S}\geqslant\mathbf{0}$ , $h(\mathbf{S})\geqslant h(\mathbf{\Sigma})$ .

Theorem 1.

Let $\mathbf{T}^{*}$ be the solution to (3). If Assumption 2 holds, then $(\mathbf{T}^{*},\mathbf{\Sigma}-\mathbf{T}^{*})$ is the solution to (4) with $F_{2}(\mathbf{T},\mathbf{V})=h(\mathbf{T}+\mathbf{V})+F_{1}(\mathbf{T})$ .

In the following two sections we will show that PCA and FA can be derived by (3) or (4) with specified loss/objective functions.

4 Objective functions corresponding to PCA

4.1 One-matrix optimization

In the viewpoint of dimensionality reduction, it is required that the distribution of $\boldsymbol{X}$ can be well approximated by that of $\mathbf{A}\boldsymbol{U}$ , i.e., $\mathbf{T}$ should be close to $\mathbf{\Sigma}$ . In other words, we need to find a low-rank approximation of $\mathbf{\Sigma}$ . This inspires us to consider the objective function $F_{1}(\mathbf{T})=f(\mathbf{\Sigma}-\mathbf{T})$ , where $f$ is a function defined on the set of all nonnegative definite matrices that evaluates the distance between a nonnegative definite matrix and the zero matrix. To induce PCA, we present the following assumptions on $f$ .

Assumption 3.

For all $\mathbf{S}\geqslant\mathbf{0}$ , $f(\mathbf{S})\geqslant 0$ and $f(\mathbf{S})=0$ if and only if $\mathbf{S}=\mathbf{0}$ .

Assumption 4.

For all $\mathbf{S}\geqslant\mathbf{0}$ and orthogonal matrix $\mathbf{R}$ of order $p$ , $f(\mathbf{R}\mathbf{S}\mathbf{R}^{\prime})=f(\mathbf{S})$ .

Assumption 5.

For all $\mathbf{S}=(s_{ij})_{i,j=1,\ldots,p}\geqslant\mathbf{0}$ , $f(\mathbf{S})\geqslant f\left(\mathrm{diag}(s_{11},\ldots,s_{pp})\right)$ .

Assumption 6.

The function $g(\boldsymbol{x})=f\left(\mathrm{diag}(\boldsymbol{x})\right),\ \boldsymbol{x}% \in[0,+\infty)^{p}$ , is nondecreasing in each variable.

Consider the following $f$ : the trace $f(\mathbf{S})=\mathrm{trace}(\mathbf{S})$ , the $\ell_{2}$ norm (also called spectral norm) $f(\mathbf{S})=\|\mathbf{S}\|_{2}$ which is the largest eigenvalue of $\mathbf{S}$ , and the Frobenius norm $f(\mathbf{S})=\|\mathbf{S}\|_{F}=(\mathrm{trace}(\mathbf{S}^{2}))^{1/2}$ .

Lemma 1.

The $\ell_{2}$ norm $f(\mathbf{S})=\|\mathbf{S}\|_{2}$ satisfies Assumption 5.

By Lemma 1 and simple algebra, $f={\mathrm{trace}}(\mathbf{S}),\ \|\mathbf{S}\|_{2}$ , and $\|\mathbf{S}\|_{F}$ satisfy Assumptions 3-6. Actually, Assumptions 4-6 are technical, designed only for the following proofs. Besides the above $f$ ’s and their combinations such as $f(\mathbf{S})=\|\mathbf{S}\|_{2}+\|\mathbf{S}\|_{F}+{\mathrm{trace}}(\mathbf{S})$ , it is hard to find other $f$ to satisfy the assumptions.

Theorem 2.

Under Assumptions 3-6, the solution to (3) with $F_{1}(\mathbf{T})=f(\mathbf{\Sigma}-\mathbf{T})$ is $\mathbf{T}^{*}=\mathbf{R}_{1}\mathbf{\Lambda}_{1}\mathbf{R}^{\prime}_{1}$ .

Theorem 2 indicates that PCA can be derived from the optimization problem (3) with a broad class of objective functions. Taking $f(\mathbf{S})={\mathrm{trace}}(\mathbf{S})$ corresponds to the common concept of PCA, variance maximization. Besides, other objective functions that do not satisfy the above assumptions can yield PCA. For example, let $f(\mathbf{S})={\mathrm{rank}}(\mathbf{S})$ , which does not satisfy Assumption 5. Note that ${\mathrm{rank}}(\mathbf{\Sigma}-\mathbf{T})\geqslant{\mathrm{rank}}(\mathbf{% \Sigma})-{\mathrm{rank}}(\mathbf{T})=p-q={\mathrm{rank}}(\mathbf{\Sigma}-% \mathbf{T}^{*})$ . The solution to (3) with $f(\mathbf{S})={\mathrm{rank}}(\mathbf{S})$ is $\mathbf{T}^{*}=\mathbf{R}_{1}\mathbf{\Lambda}_{1}\mathbf{R}^{\prime}_{1}$ .

Interestingly, if we expand the domain of $f$ to the set of all symmetric matrices, then PCA also corresponds to the solution to the optimization problem (3) without the constraint $\mathbf{T}\leqslant\mathbf{\Sigma}$ , i.e.,

		$\displaystyle\min_{\mathbf{T}}\ f(\mathbf{\Sigma}-\mathbf{T}),$		(6)
		$\displaystyle{\text{s.t.}}\ \mathbf{T}\geqslant\mathbf{0},\ \mathrm{rank}(% \mathbf{T})=q,$		(6)

for $f$ satisfying the further assumption.

Assumption 7.

For all symmetric $\mathbf{S}$ , $f(-\mathbf{S})=f(\mathbf{S})$ .

The following theorem can be proven similarly to Theorem 2.

Theorem 3.

Under Assumptions 3-7, the solution to (6) is $\mathbf{T}^{*}=\mathbf{R}_{1}\mathbf{\Lambda}_{1}\mathbf{R}^{\prime}_{1}$ .

It is clear that $f(\mathbf{S})=\|\mathbf{S}\|_{2}$ and $\|\mathbf{S}\|_{F}$ satisfy Assumption 7 while ${\mathrm{trace}}(\mathbf{S})$ does not. The special case with $f(\mathbf{S})=\|\mathbf{S}\|_{F}$ of Theorem 3 was proven by Srebro and Jaakkola (2003) with matrix derivatives.

4.2 Penalized maximum likelihood estimation

In this subsection we derive PCA by the likelihood principle. The negative log-likelihood (up to a constant) under (2) is

l(\mathbf{T},\mathbf{V})=\log(|\mathbf{T}+\mathbf{V}|)+{\mathrm{trace}}\left((% \mathbf{T}+\mathbf{V})^{-1}\mathbf{\Sigma}\right),

(7)

and thus we can adopt $l$ as the objective function in the two-matrix optimization problem (4). Clearly, this problem is not well defined since any $(\mathbf{T},\mathbf{V})$ satisfying $\mathbf{T}+\mathbf{V}=\mathbf{\Sigma}$ is a solution. To derive PCA, we add a penalty and let

F_{2}(\mathbf{T},\mathbf{V})=l(\mathbf{T},\mathbf{V})+\lambda\,f(\mathbf{% \Sigma}-\mathbf{T}),

(8)

where $\lambda>0$ is a pre-specified parameter.

Taking $h(\mathbf{S})=\log(|\mathbf{S}|)+{\mathrm{trace}}\left(\mathbf{S}^{-1}\mathbf{% \Sigma}\right)$ in Theorem 1, by Theorem 2, we can get

Theorem 4.

Under Assumptions 3-6, for all $\lambda$ , the solution to (4) with $F_{2}$ in (8) is $(\mathbf{T}^{*},\mathbf{V}^{*})=(\mathbf{R}_{1}\mathbf{\Lambda}_{1}\mathbf{R}^% {\prime}_{1},\mathbf{R}_{2}\mathbf{\Lambda}_{2}\mathbf{R}^{\prime}_{2})$ .

Theorem 4 indicates that PCA can be viewed as a penalized maximum likelihood estimate. We can also view PCA as a Bayesian posterior mode under the following priors

(\mathbf{T},\mathbf{V})\propto\exp\left[-\lambda\,f(\mathbf{\Sigma}-\mathbf{T}% )\right]\cdot I(\mathbf{T}\geqslant\mathbf{0},\ \mathbf{V}\geqslant\mathbf{0},% \ \mathrm{rank}(\mathbf{T})=q),

(9)

where $I$ denotes the indicator function. As a result we can obtain a Bayesian framework of PCA that allows uncertainty quantification for PCA. Unlike probabilistic PCA in Tip** and Bishop (1999), this probabilistic PCA framework accommodates to all covariance matrices.

Note that the priors in (9) may be improper. We can also consider another type of penalties. For example, with a tuning parameter $\lambda>0$ , let

F_{2}(\mathbf{T},\mathbf{V})=l(\mathbf{T},\mathbf{V})+\lambda\,\|\mathbf{V}\|_% {2}^{2},

(10)

which corresponds to the priors,

(\mathbf{T},\mathbf{V})\propto\exp\left[-\lambda\,\|\mathbf{V}\|_{2}^{2}\right% ]\cdot I(\mathbf{T}\geqslant\mathbf{0},\ \mathbf{V}\geqslant\mathbf{0},\ % \mathrm{rank}(\mathbf{T})=q).

Theorem 5.

For some $\lambda$ , the solution to (4) with $F_{2}$ in (10) is $(\mathbf{T}^{*},\mathbf{V}^{*})=(\mathbf{R}_{1}\mathbf{\Lambda}_{1}\mathbf{R}^% {\prime}_{1},\mathbf{R}_{2}\mathbf{\Lambda}_{2}\mathbf{R}^{\prime}_{2})$ .

5 Objective functions corresponding to FA

5.1 One-matrix optimization

Since $\mathbf{V}=\mathbf{\Sigma}-\mathbf{T}$ is a diagonal matrix in the factor model (1), we first consider the $\ell_{0}$ norm that can yield sparsity. For $\mathbf{S}\geqslant\mathbf{0}$ , define its modified $\ell_{0}$ norm as

\tilde{f}_{0}(\mathbf{S})=\left\{\begin{aligned} &\|\mathbf{S}\|_{0},\quad\,|% \mathbf{S}|>0\\ &p,\quad\quad\ \ |\mathbf{S}|=0,\ \mathbf{S}\neq\mathbf{0},\\ &0,\quad\quad\quad\mathbf{S}=\mathbf{0},\end{aligned}\right.

(11)

where $\|\mathbf{S}\|_{0}$ and $|\mathbf{S}|$ denote the number of nonzero elements in $\mathbf{S}$ and the determinant of $\mathbf{S}$ , respectively. We have that $\tilde{f}_{0}$ satisfies Assumption 3, but does not satisfy Assumptions 4-6.

Theorem 6.

Under model (1), $\mathbf{T}^{*}=\mathbf{A}\mathbf{A}^{\prime}$ is a solution to (3) with $F_{1}(\mathbf{S})=\tilde{f}_{0}(\mathbf{\Sigma}-\mathbf{S})$ .

Theorem 7.

Under model (1), if Assumption 1 holds, then (3) with $F_{1}(\mathbf{S})=\tilde{f}_{0}(\mathbf{\Sigma}-\mathbf{S})$ has the unique solution $\mathbf{T}^{*}=\mathbf{A}\mathbf{A}^{\prime}$ .

Besides the above $F_{1}$ constructed by the modified $\ell_{0}$ norm, many other objective functions can induce FA. For instance, let

F_{1}(\mathbf{S})=f_{\tau}(\mathbf{\Sigma}-\mathbf{S})\quad\text{and}\quad f_{% \tau}(\mathbf{S})=\sum_{i<j}|s_{ij}|^{\tau},\quad\tau\geqslant 0

(12)

or

F_{1}(\mathbf{S})=f_{\mathrm{d}}(\mathbf{\Sigma}-\mathbf{S})\quad\text{and}% \quad f_{\mathrm{d}}(\mathbf{S})=f(\mathbf{S})-f({\mathrm{diag}}(s_{11},\ldots% ,s_{pp})),

(13)

where $f$ satisfies Assumption 5. Theorems 6 and 7 hold by replacing $\tilde{f}_{0}$ with $f_{\tau}$ or $f_{\mathrm{d}}$ . By (5) and (7), the maximum likelihood estimates of $\mathbf{T}$ can be derived from the objective function

F_{1}(\mathbf{S})=l(\mathbf{S},\mathbf{V}^{*}(\mathbf{S})),

(14)

where $l$ is defined in (7), and $\mathbf{V}^{*}(\mathbf{S})$ is the solution that minimizes $l(\mathbf{S},\mathbf{V})$ under the constraint that $\mathbf{V}\geqslant\mathbf{0}$ is diagonal for given $\mathbf{S}$ .

Algorithm 1 The unidirectional search algorithm for solving (3)

0:

\mathbf{\Sigma},\ q,\ s,\ \varepsilon

.

0:

1: Initialization: Select

\mathbf{A}^{(0)}=(a_{ij}^{(0)})_{i=1,\ldots,p,\ j=1,\ldots,q}\in\mathbb{R}^{p% \times q}

.

2: Iteration: For

k=0,1,\ldots,

For

i=1,\ldots,pq

, Change

a_{ij}^{(k)}

in

\mathbf{A}^{(k)}

to

a_{ij}^{(k)}+s

, and let

\mathbf{A}_{+}^{(ij)}

denote the new matrix; Change

a_{ij}^{(k)}

in

\mathbf{A}^{(k)}

to

a_{ij}^{(k)}-s

, and let

\mathbf{A}_{-}^{(ij)}

denote the new matrix. Find

\mathcal{A}=\left\{\mathbf{A}\in\left\{\mathbf{A}_{+}^{(ij)},\ \mathbf{A}_{-}^% {(ij)}\right\}_{i=1,\ldots,p,\ j=1,\ldots,q}:\ F_{1}\left(\mathbf{A}\mathbf{A}% ^{\prime}\right)<F_{1}\left(\mathbf{A}^{(k)}\mathbf{A}^{(k)^{\prime}}\right),% \ \mathbf{A}\mathbf{A}^{\prime}\leqslant\mathbf{\Sigma}\right\}

. If

\mathcal{A}=\emptyset

, let

s\leftarrow s/2

; Else let

\mathbf{A}^{(k+1)}=\arg\min_{\mathbf{A}\in\mathcal{A}}F_{1}(\mathbf{A}\mathbf{% A}^{\prime})

, and let

k\leftarrow k+1

. If

s<\varepsilon

, then stop the iterations, and output

\mathbf{T}^{*}=\mathbf{A}^{(k)}\mathbf{A}^{(k)^{\prime}}

.

Even when the factor model (1) does not hold, we can solve (3) with the objective functions in (11)-(14). The corresponding solution can be viewed as an approximate FA approach for $\boldsymbol{X}$ , which extends FA to more general covariance structures such as the graphical factor models (Giudici and Stanghelline 2001; Grzebyk, Wild, and Chouanière 2004).

The matrix optimization problem (3) with the objective functions corresponding to FA are usually hard to solve. Here we provide an unidirectional search algorithm (Kolda, Lewis, and Torczon 2003) for solving (3) with general objective functions; see Algorithm 1. This is a direct search algorithm that does not require an explicit representation of the gradient of the objective function. The initial $\mathbf{A}^{(0)}$ can be given by PCA. Actually Algorithm 1 is applicable to other objective functions including those corresponding to PCA. Note that Algorithm 1 may be computationally expensive, especially when $p$ is large. We need to develop more effective strategies in our future work.

5.2 Two-matrix optimization

The most popular method for estimating the parameters of the factor model (1) is maximum likelihood estimate (Anderson 1984). It is a two-matrix optimization with the likelihood objective function in (7) and an additional constraint that $\mathbf{V}$ is diagonal. Here we focus on another important estimation method, the least squares estimation (Bartholomew, Knott, and Moustaki 2011), which solves the two-matrix optimization problem,

		$\displaystyle\min_{\mathbf{T},\mathbf{V}}\ \left\\|\mathbf{\Sigma}-(\mathbf{T}+% \mathbf{V})\right\\|_{F}^{2},$		(15)
		$\displaystyle{\text{s.t.}}\ \mathbf{T}\geqslant\mathbf{0},\ \mathrm{rank}(% \mathbf{T})=q,\ \mathbf{V}\geqslant\mathbf{0}\ \text{is diagonal}.$		(15)

This method does not require the normal assumption.

With matrix derivatives, iterative formulas for solving (15) can be derived (Bartholomew, Knott, and Moustaki 2011). We first show that these formulas can also be given by a coordinate descent algorithm (Tseng 2001). Specifically, given $\mathbf{V}$ , (15) reduces to the PCA problem (6); given $\mathbf{T}$ , the solution is trivial by taking diagonal elements. Compared with existing algorithms, this coordinate descent algorithm is easy to understand and implement. Furthermore, from the coordinate descent viewpoint, the iterative formulas possess the monotonicity property that the objective value is non-increasing in each iteration.

Algorithm 2 The coordinate descent algorithm for solving (16)

0:

\mathbf{\Sigma},\ q,\ \lambda,\ \varepsilon

.

0:

1: Initialization: Select

\mathbf{V}^{(0)}\leqslant\mathbf{\Sigma}

.

2: Iteration: For each

k=0,1,\ldots

, Compute

\mathbf{T}^{(k+1)}=\bar{\mathbf{R}}_{1}\bar{\mathbf{\Lambda}}_{1}\bar{\mathbf{% R}}^{\prime}_{1}

, where

\bar{\mathbf{R}}_{1}

and

\bar{\mathbf{\Lambda}}_{1}

correspond to the first

q

eigenvectors and eigenvalues of

\mathbf{\Sigma}-\mathbf{V}^{(k)}

, respectively. Compute

\mathbf{V}^{(k+1)}=\mathrm{diag}(\mathbf{\Sigma}-\mathbf{T}^{(k+1)})/(1+\lambda)

, where

\mathrm{diag}(\mathbf{\Sigma}-\mathbf{T}^{(k+1)})

denotes the diagonal matrix whose diagonal is the same as

\mathbf{\Sigma}-\mathbf{T}^{(k+1)}

. If

\mathbf{V}^{(k+1)}\leqslant\mathbf{\Sigma}

does not hold, then stop the iterations, and output

(\mathbf{T}^{*},\mathbf{V}^{*})=(\mathbf{T}^{(k)},\mathbf{V}^{(k)})

. Else if

\|\mathbf{\Sigma}-\mathbf{T}^{(k)}-\mathbf{V}^{(k)}\|_{F}-\|\mathbf{\Sigma}-% \mathbf{T}^{(k+1)}-\mathbf{V}^{(k+1)}\|_{F}<\varepsilon

, then stop the iterations, and output

(\mathbf{T}^{*},\mathbf{V}^{*})=(\mathbf{T}^{(k+1)},\mathbf{V}^{(k+1)})

. Otherwise, let

k\leftarrow k+1

.

When Assumption 1 does not hold, there are more than one pairs of $(\mathbf{T},\mathbf{V})$ satisfying the factor model (1). Note that $\mathbf{V}$ should be close to the zero matrix from the viewpoint of dimensionality reduction. We next propose a penalized least squares estimation approach to produce the pair in which $\mathbf{V}$ is closest to the zero matrix. With a tuning parameter $\lambda\geqslant 0$ , the approach solves

		$\displaystyle\min_{\mathbf{T},\mathbf{V}}\ \left\\|\mathbf{\Sigma}-(\mathbf{T}+% \mathbf{V})\right\\|_{F}^{2}+\lambda\\|\mathbf{V}\\|_{F}^{2},$		(16)
		$\displaystyle{\text{s.t.}}\ \mathbf{T}\geqslant\mathbf{0},\ \mathrm{rank}(% \mathbf{T})=q,\ \mathbf{V}\geqslant\mathbf{0}\ \text{is diagonal}.$		(16)

Suppose that $\mathbf{\Sigma}$ is a $\sqrt{n}-$ consistent estimate of the true covariance matrix, where $n$ denotes the sample size. If $\lambda=\lambda_{n}\to 0,\ n\lambda_{n}\to+\infty$ , then the solution to (16) converges to the pair corresponding to the smallest value of $\|\mathbf{V}\|_{F}$ . Recall that $\|\mathbf{\cdot}\|_{F}$ is an objective function to derive PCA. The penalized least squares approach can be viewed as a combination of PCA and FA. The coordinate descent algorithm can also be used to solve (16), and we present the detailed steps in Algorithm 2.

Like in Section 4.2, we can derive penalized maximum estimation for FA and the corresponding Bayesian approach. Note that in these methods, $\mathbf{V}$ is not restricted to be a diagonal matrix. Therefore, we actually get a Bayesian version of the approximate FA approach for $\boldsymbol{X}$ , which can quantify the difference between the underlying model and a factor model.

6 Combinations of PCA and FA

6.1 Path optimization between PCA and FA

In this subsection we present a path optimization problem to connect PCA and FA. Consider the one-matrix optimization problem (3). Let $F_{1,\mathrm{PCA}}$ and $F_{1,\mathrm{FA}}$ denote two objective functions corresponding to PCA and FA, respectively. For $w\in[0,1]$ , let $\mathbf{T}_{w}^{*}$ denote the solution to

		$\displaystyle\min_{\mathbf{T}}\ wF_{1,\mathrm{PCA}}(\mathbf{T})+(1-w)F_{1,% \mathrm{FA}}(\mathbf{T}),$		(17)
		$\displaystyle{\text{s.t.}}\ \mathbf{0}\leqslant\mathbf{T}\leqslant\mathbf{% \Sigma},\ \mathrm{rank}(\mathbf{T})=q.$		(17)

Therefore $\{\mathbf{T}_{w}^{*}\}_{w\in[0,1]}$ can be viewed as a path between PCA and FA. From the path we can find how the direction of dimensionality reduction varies from PCA to FA. This may help us better understand the difference between PCA and FA. The problem (17) can be computed by Algorithm 1. A simple choice of the objective functions in (17) is $(F_{1,\mathrm{PCA}}(\mathbf{T}),F_{1,\mathrm{FA}}(\mathbf{T}))=(\|\mathbf{% \Sigma}-\mathbf{T}\|_{F}^{2},f_{2}(\mathbf{\Sigma}-\mathbf{T}))$ , where $f_{2}$ is defined in (12).

Besides, (17) provides a method to fit the factor model from the relatively simple PCA. For a sequence $1=w_{0}>w_{1}>\cdots<w_{m}=0$ , we use PCA as the start point to compute the solution $\mathbf{T}_{w_{1}}^{*}$ , and then use $\mathbf{T}_{w_{1}}^{*}$ as the start point to compute the solution $\mathbf{T}_{w_{2}}^{*}$ , and so on. Finally, $\mathbf{T}_{w_{m}}^{*}$ obtained by this way is the FA solution. Note that the optimization problem with the FA objective function may have many local solutions. The solution from the path optimization is relatively close to PCA, which reaches some consistency between PCA and FA.

We can also consider two-matrix optimization for combining PCA and FA. The penalized least squares estimation approach in (16) is an example. Similar to (16), (17) can yield approximate estimates corresponding to the smallest PCA loss for unidentifiable factor models by taking $w\to 0$ .

6.2 A principal component factor model

As two dimensionality reduction methods, PCA and FA are sometimes required to have identical directions of dimensionality reduction (Tip** and Bishop 1999). In this subsection we introduce a special factor model, called principal component factor model, which adds a constraint

\mathcal{L}(\mathbf{A})=\mathcal{L}(\mathbf{R}_{1})

(18)

to the original factor model (1), where $\mathcal{L}(\mathbf{A})$ denotes the column space of $\mathbf{A}$ . Tip** and Bishop (1999)’s probabilistic PCA is a special case of the principal component factor model.

We now discuss estimation for $\mathbf{T}$ and $\mathbf{V}$ under (18). Given an objective function $G(\mathbf{T},\mathbf{V})$ , we estimate them by solving the optimization problem,

		$\displaystyle\min_{\mathbf{T},\mathbf{V}}\ G(\mathbf{T},\mathbf{V}),$		(19)
		$\displaystyle{\text{s.t.}}\ \mathbf{T}=\mathcal{R}_{1}[\mathbf{T}+\mathbf{V}]% \mathbf{\Gamma}\mathcal{R}_{1}[\mathbf{T}+\mathbf{V}]^{\prime},\ \mathbf{% \Gamma}\geqslant\mathbf{0}\ \text{is diagonal},\mathbf{V}\geqslant\mathbf{0}\ % \text{is diagonal},$		(19)

where $\mathcal{R}_{1}[\mathbf{T}+\mathbf{V}]$ denotes the $p\times q$ matrix corresponding to the first $q$ principal components of $\mathbf{T}+\mathbf{V}$ . Note that the variables in $\mathbf{\Gamma}$ , which only appear in the constraint, are also needed to optimize. The objective functions in (19) corresponding to maximum likelihood estimation and least squares estimation for the principal component factor model can be found to in (7) and (15), respectively.

We also use the idea of coordinate descent to solve (19). Let $\boldsymbol{\gamma}=(\gamma_{1},\ldots,\gamma_{q})^{\prime}$ and $\boldsymbol{v}=(v_{1},\ldots,v_{p})^{\prime}$ be the diagonals of $\mathbf{\Gamma}$ and $\mathbf{V}$ , respectively. Note that if $\mathcal{R}_{1}[\mathbf{T}+\mathbf{V}]=\mathbf{R}_{1}$ is given, then (19) reduces to a $(p+q)$ -dimensional optimization problem

		$\displaystyle\min_{\boldsymbol{\gamma},\boldsymbol{v}}\ G(\mathbf{R}_{1}{% \mathrm{diag}}(\boldsymbol{\gamma})\mathbf{R}_{1}^{\prime},{\mathrm{diag}}(% \boldsymbol{v})),$		(20)
		$\displaystyle{\text{s.t.}}\ \gamma_{1},\ldots,\gamma_{q},v_{1},\ldots,v_{p}% \geqslant 0.$		(20)

Therefore, we have the following iterative strategy. Given $\mathbf{T}^{(k)},\mathbf{V}^{(k)}$ , compute $\mathbf{R}_{1}^{(k)}=\mathcal{R}_{1}[\mathbf{T}^{(k)}+\mathbf{V}^{(k)}]$ ; use (20) to get $\mathbf{\Gamma}^{(k)},\mathbf{V}^{(k+1)}$ ; compute $\mathbf{T}^{(k+1)}=\mathbf{R}_{1}^{(k)}\mathbf{\Gamma}^{(k)}\mathbf{R}_{1}^{(k% )^{\prime}}$ .

For the objective function in least squares estimation, (20) becomes the standard quadratic program problem,

		$\displaystyle\min_{\boldsymbol{z}}\ \boldsymbol{z}^{\prime}\mathbf{H}% \boldsymbol{z}-2\boldsymbol{b}^{\prime}\boldsymbol{z},$		(21)
		$\displaystyle{\text{s.t.}}\ z_{i}\geqslant{0},\ i=1,\ldots,p+q,$		(21)

where $\boldsymbol{z}=(\boldsymbol{\gamma}^{\prime},\boldsymbol{v}^{\prime})^{\prime}% =(z_{1},\ldots,z_{p+q})^{\prime}$ , $\mathbf{H}=\begin{pmatrix}q\boldsymbol{1}_{q}\boldsymbol{1}_{q}^{\prime}&% \mathbf{R}_{1}^{\prime}\odot\mathbf{R}_{1}^{\prime}\\ \mathbf{R}_{1}\odot\mathbf{R}_{1}&\mathbf{I}_{p}\end{pmatrix}$ ,
$\boldsymbol{b}=\left(\sum_{i=1}^{p}\sum_{j=1}^{p}\sigma_{ij}\boldsymbol{s}_{ij% }^{\prime},\,\sigma_{11},\ldots,\sigma_{pp}\right)^{\prime}$ , $\sigma_{ij}$ is the $(i,j)$ -element of the current $\mathbf{\Sigma}=\mathbf{T}+\mathbf{V}$ , $\boldsymbol{s}_{ij}=\boldsymbol{r}^{(i)}\odot\boldsymbol{r}^{(j)}$ , $\boldsymbol{r}^{(i)}$ is the $i$ th row vector of $\mathbf{R}_{1}$ , and $\odot$ denotes element-wise multiplication. The solution to (21) can be used as initial points in computing the maximum likelihood estimates.

The above principal component factor model can be extend to cover general factor models. Let $\mathbf{R}_{(i)}$ denote the $p\times i$ sub-matrix of $\mathbf{R}$ corresponding to the first $i$ eigenvectors, $1\leqslant i\leqslant p$ . Consider the factor model (1) with a factor loading matrix $\mathbf{A}$ . We define the principal component index of this model as the smallest integer $q*$ such that $\mathcal{L}(\mathbf{A})\subset\mathcal{L}(\mathbf{R}_{(q*)})$ . Therefore, any factor model, with a principal component index at most $p$ , is connected to PCA. In addition, for an unidentifiable factor model, we may improve its identifiability by reducing its principal component index.

7 Numerical examples

7.1 Penalized least squares estimation

We use numerical examples to illustrate the penalized least squares method for unidentifiable factor models in Section 5.2. Consider the following two covariance matrices,

\displaystyle\mathrm{(I)}\ \mathbf{\Sigma}=\begin{pmatrix}2&1\\ 1&3\end{pmatrix}=\begin{pmatrix}1/\sqrt{3}\\ \sqrt{3}\end{pmatrix}\begin{pmatrix}1/\sqrt{3}&\sqrt{3}\end{pmatrix}+\begin{% pmatrix}5/3&0\\ 0&0\end{pmatrix},

\displaystyle\mathrm{(II)}\ \mathbf{\Sigma}=\begin{pmatrix}3&-1&-2&2\\ -1&2&0&-1\\ -2&0&4&-2\\ 2&-1&-2&2\end{pmatrix}=\begin{pmatrix}1&-1\\ -1&0\\ 0&2\\ 1&-1\end{pmatrix}\begin{pmatrix}1&-1&0&1\\ -1&0&2&-1\end{pmatrix}+\begin{pmatrix}1&0&0&0\\ 0&1&0&0\\ 0&0&0&0\\ 0&0&0&0\end{pmatrix}.

Since all factor models with $(p,q)=(2,1)$ or $(p,q)=(4,2)$ are unidentifiable (Anderson and Rubin 1956), $\mathbf{\Sigma}$ can be written as the form of $\mathbf{A}\mathbf{A}^{\prime}+\mathbf{V}$ with different $(\mathbf{A},\mathbf{V})$ . The above equations show the ones with the smallest $\|\mathbf{V}\|_{F}$ . For (I) and (II), we solve the problem of penalized least squares estimation in (16) with varying tuning parameter $\lambda$ by Algorithm 2, and show the results in Figure 1. It can be seen that, as $1/\lambda$ increases, the resulted $\|\mathbf{V}\|_{F}$ from the penalized least squares estimates converges to minimal $\|\mathbf{V}\|_{F}$ . At the meantime, the loss of least squares converges to zero, which indicates that the estimates are consistent with the structure of factor models for large $1/\lambda$ . Therefore, the proposed penalized least squares estimation can effectively yield the estimates with good dimensionality reduction performance under unidentifiable factor models.

Refer to caption — Figure 1: Results of penalized least squares estimation for (I) and (II) in Section 7.1.

7.2 Path optimization

We show two examples of the path between PCA and FA introduced in Section 6.1. Consider the following two covariance matrices,

	$\displaystyle\mathrm{(i)}\ \mathbf{\Sigma}=\begin{pmatrix}2&1&1\\ 1&3&1\\ 1&1&3\end{pmatrix}=\begin{pmatrix}1\\ 1\\ 1\end{pmatrix}\begin{pmatrix}1&1&1\end{pmatrix}+\begin{pmatrix}1&0&0\\ 0&2&0\\ 0&0&2\end{pmatrix},$
	$\displaystyle\mathrm{(ii)}\ \mathbf{\Sigma}=\begin{pmatrix}3&-2&-1&1&-2\\ -2&5&0&-2&2\\ -1&0&4&0&1\\ 1&-2&0&4&-1\\ -2&2&1&-1&5\end{pmatrix}=\begin{pmatrix}1&-1\\ -2&0\\ 0&1\\ 1&0\\ -1&1\end{pmatrix}\begin{pmatrix}1&-2&0&1&-1\\ -1&0&1&0&1\end{pmatrix}+\begin{pmatrix}1&0&0&0&0\\ 0&1&0&0&0\\ 0&0&3&0&0\\ 0&0&0&3&0\\ 0&0&0&0&3\end{pmatrix}.$

We solve the path optimization problem (17) with objective functions $(F_{1,\mathrm{PCA}}(\mathbf{T}),F_{1,\mathrm{FA}}(\mathbf{T}))=(\|\mathbf{% \Sigma}-\mathbf{T}\|_{F}^{2},f_{2}(\mathbf{\Sigma}-\mathbf{T}))$ . For each $w$ , we use Algorithm 1 to compute the solution. The results are displayed in Figure 2. It can be seen that, as $w$ varies from 1 to 0, the objective value of PCA increases, while that of FA decreases to 0. This indicates that the fitted model becomes closer to the factor model, and reaches it at last.

7.3 Examples of fitting the principal component factor model

We now use two real datasets to fit the principal component factor model in Section 6.2. The first dataset comes from 48 Chinese cities, and contains data of three indices of urban development in the year 2008. These indices are as follows:

	$\displaystyle X_{1}:$		per-capita investment in the fixed assets (thousand RMB)
	$\displaystyle X_{2}:$		per-capita net income of rural people (thousand RMB)
	$\displaystyle X_{3}:$		birthrate (‰)

All data used in this example can be found on the internet (available from the author).

Here $p=3$ , and thus we only need to consider $q=1$ for fitting a factor model. We first compute the maximum likelihood estimate of the population covariance matrix as

\mathbf{\Sigma}=\begin{pmatrix}82.5524&4.6990&-5.6177\\ 4.6990&4.6262&-1.5502\\ -5.6177&-1.5502&4.7571\end{pmatrix}.

The first principal component is $\boldsymbol{r}_{1}=(0.9971,0.0441,-0.0614)^{\prime}$ . Consequently, we compute the maximum likelihood estimates of the parameters in the principal component factor model by the iterative method in Section 6.2, and obtain the fitted model,

\left\{\begin{aligned} &X_{1}=8.8792\,U+\varepsilon_{1},\\ &X_{2}=0.3928\,U+\varepsilon_{2},\\ &X_{3}=-0.5466\,U+\varepsilon_{3},\end{aligned}\right.

(22)

where $U\sim N(0,1),\ \varepsilon_{1}\sim N(0,2.0827^{2}),\ \varepsilon_{2}\sim N(0,2% .0789^{2}),\ \varepsilon_{3}\sim N(0,2.0819^{2})$ are independent. The factor loading vector in (22) is $\boldsymbol{a}=(8.8792,0.3928,-0.5466)^{\prime}=79.2938\,\boldsymbol{r}_{1}$ , which indicates that $\boldsymbol{a}$ has the same direction as the first principal component. We also conduct the likelihood ratio test for testing whether the covariance structure of the principal component factor model holds, and get $p$ -value $=0.1538$ .

In (22), the common factor $U$ can be interpreted as the level of development. With a higher level of development, the two economic indicators, $X_{1}$ and $X_{2}$ , are expected to be high, but the birthrate has a decreasing trend.

The second dataset contains data of air pollution emissions. It is known that air pollution is one of the world’s largest health and environmental problems. A feasible method to prevent air pollution is to analyze the sources of air pollution emissions. We now study the relationship among concentrations (unit: $\mathrm{mg}/\mathrm{m}^{3}$ ) of five ions in the air. The ions are as follows:

	$\displaystyle X_{1}:$		$\displaystyle\text{chloridion}\quad(\text{CL}^{-})$
	$\displaystyle X_{2}:$		$\displaystyle\text{ammonium ion}\quad({\text{NH}_{4}}^{+})$
	$\displaystyle X_{3}:$		$\displaystyle\text{calcium ion}\quad(\text{Ca}^{2+})$
	$\displaystyle X_{4}:$		$\displaystyle\text{sodium ion}\quad(\text{Na}^{+})$
	$\displaystyle X_{5}:$		$\displaystyle\text{potassium ion}\quad(\text{K}^{+})$

The data used in this example are sampled from 56 pollution events in 2001-2007 in a Chinese city, which has suffered from serious air pollution.

Consider $q=2$ for fitting a principal component factor model in Section 6.2. We first compute the maximum likelihood estimate of the population covariance matrix as

\mathbf{\Sigma}=\begin{pmatrix}134.7&848.1&460.4&118.3&62.0\\ 848.1&7706.9&2941.1&885.2&296.7\\ 460.4&2941.1&2189.4&576.9&302.9\\ 118.3&885.2&576.9&264.8&70.4\\ 62.0&296.7&302.9&70.4&61.9\end{pmatrix}.

The first two principal components are $\boldsymbol{r}_{1}=(-0.1036,-0.9033,-0.3944,-0.1259,-0.0426)^{\prime}$ and $\boldsymbol{r}_{2}=(-0.0662,0.4193,-0.8006,-0.4053,-0.1207)^{\prime}$ . Consequently, we compute the maximum likelihood estimates of the parameters in the principal component factor model, and obtain the fitted model,

\left\{\begin{aligned} &X_{1}=-9.9518\,U_{1}-2.1146\,U_{2}+\varepsilon_{1},\\ &X_{2}=-86.7484\,U_{1}+13.3896\,U_{2}+\varepsilon_{2},\\ &X_{3}=-37.8793\,U_{1}-25.5632\,U_{2}+\varepsilon_{3},\\ &X_{4}=-12.0859\,U_{1}-12.9416\,U_{2}+\varepsilon_{4},\\ &X_{5}=-4.0875\,U_{1}-3.8551\,U_{2}+\varepsilon_{5},\end{aligned}\right.

(23)

where $U_{1}\sim N(0,1),\ U_{2}\sim N(0,1),\ \varepsilon_{1}\sim N(0,4.7523^{2}),\ % \varepsilon_{2}\sim N(0,10.1974^{2}),\ \varepsilon_{3}\sim N(0,7.5504^{2}),\ % \varepsilon_{4}\sim N(0,10.2346^{2}),\ \varepsilon_{5}\sim N(0,3.9732^{2})$ are independent. The factor loading vectors in (23) are $9221.9\,\boldsymbol{r}_{1}$ and $1019.6\,\boldsymbol{r}_{2}$ . We also conduct the likelihood ratio test for testing whether the covariance structure of the principal component factor model holds, and get $p$ -value $=0.3036$ . The common factors $U_{1}$ and $U_{2}$ can be interpreted as two sources of air pollution.

8 Discussion

In this paper we have proposed a unified way to handle PCA and FA under a general latent variable model. With specific loss functions, the two methods can be viewed as solutions to matrix optimization problems. Along this way, we have discussed a number of topics, including effective algorithms, penalized methods, and some combinations of PCA and FA.

The use of this paper is twofold. First, many students, including myself of the bygone age, feel confused about the connection and difference between PCA and FA when learning the course of multivariate statistical analysis, because the two methods are much different from other contents in textbooks such as statistical inference. The proposed framework to handle PCA and FA is relatively simple, and can complement present textbooks. Hopefully this will help students better understand them.

Second, we have provided new viewpoints to reexamine the two classical methods. New methods for data analysis are presented. As a result researchers can find some future directions from them. For example, there are numerous other loss functions for fitting the model (2). A natural class of objective functions in (3) is $F_{1}(\mathbf{T})=d\left(N(\boldsymbol{0},\mathbf{\Sigma}),N(\boldsymbol{0},% \mathbf{T})\right)$ , where $d$ evaluates some distance between two distributions. Note that the Wasserstein distance, which is related to the optimal transport problem, has been widely applied in machine learning and statistics (Panaretos and Zemel 2020). With $d$ being the Wasserstein distance, $F_{1}$ has a closed form as $F_{1}(\mathbf{T})={\mathrm{trace}}\left(\mathbf{\Sigma}+\mathbf{T}-2\left(% \mathbf{\Sigma}^{1/2}\mathbf{T}\mathbf{\Sigma}^{1/2}\right)^{1/2}\right)$ (Knott and Smith 1984). We conjecture that the solution to (3) with this objective function yields PCA. It seems correct even when we remove the constraint $\mathbf{T}\leqslant\mathbf{\Sigma}$ like in (6) after our numerous numerical experiments. This conjecture may construct a connection between PCA and optimal transport. Note that eigenvalue decomposition is computationally expensive with large datasets. We also hope to find efficient iterative algorithms for specific objective functions, which induce PCA, instead of eigenvalue decomposition in PCA (Roweis 1997; Hippert-Ferrer, Bouchard, and Mian et al. 2023). Another direction is to extend our methods to more complex problems, including supervised PCA (Bair, Hastie, and Paul et al. 2006), sparse PCA (Zou, Hastie, and Tibshirani 2006; Johnstone and Lu 2009), nonlinear dimensionality reduction (Ghojogh, Crowley, and Karray et al. 2023), categorical data analysis (Bhattacharya and Dunson 2012), and estimation of intrinsic dimension (Levina and Bickel 2004).

Appendix

Proof of Theorem 1 Under Assumption 2, for fixed $\mathbf{T}$ , the solution to (4) with $F_{2}(\mathbf{T},\mathbf{V})=h(\mathbf{T}+\mathbf{V})+F_{1}(\mathbf{T})$ is $\mathbf{V}^{*}(\mathbf{T})=\mathbf{\Sigma}-\mathbf{T}$ . Then the optimal $\mathbf{T}^{*}$ is the solution to (3), and this completes the proof. ∎

Proof of Lemma 1 It suffices to show that $\|\mathbf{S}\|_{2}\geqslant s_{11}$ , which is obtained by noting $\|\mathbf{S}\|_{2}=\max_{\boldsymbol{x},\ \|\boldsymbol{x}\|_{2}=1}\boldsymbol% {x}^{\prime}\mathbf{S}\boldsymbol{x}$ and taking $\boldsymbol{x}=(1,0,\ldots,0)^{\prime}$ . ∎

Proof of Theorem 2 Let $\tilde{\mathbf{T}}=\mathbf{R}^{\prime}\mathbf{T}\mathbf{R}$ . By Assumption 4, (3) is equivalent to

		$\displaystyle\min_{\mathbf{T}}\ f(\mathbf{\Lambda}-\tilde{\mathbf{T}}),$		(24)
		$\displaystyle{\text{s.t.}}\ \mathbf{0}\leqslant\tilde{\mathbf{T}}\leqslant% \mathbf{\Lambda},\ \mathrm{rank}(\tilde{\mathbf{T}})=q.$		(24)

By Assumption 5, we only need to consider $\tilde{\mathbf{T}}$ to be a diagonal matrix. By Assumption 6, the solution to (24) is $\tilde{\mathbf{T}}^{*}=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{q},0,\ldots,0)$ , which completes the proof. ∎

Proof of Theorem 5 Note that $(\mathbf{T}^{*},\mathbf{V}^{*})$ is the solution to

		$\displaystyle\min_{\mathbf{T},\mathbf{V}}\ l(\mathbf{T},\mathbf{V}),$
		$\displaystyle{\text{s.t.}}\ \mathbf{T}\geqslant\mathbf{0},\ \mathbf{V}% \geqslant\mathbf{0},\ \mathrm{rank}(\mathbf{T})=q,\ \\|\mathbf{V}\\|_{2}% \leqslant\kappa,$

where $\kappa\geqslant\|\mathbf{V}^{*}\|_{2}$ is a tuning parameter. This completes the proof. ∎

Proof of Theorem 6 Under model (1), $\tilde{f}_{0}(\mathbf{\Sigma}-\mathbf{T}^{*})=\tilde{f}_{0}(\mathbf{V})\leqslant p$ . By (11), for any matrix $\mathbf{S}$ satisfying the constraints in (3), $\tilde{f}_{0}(\mathbf{\Sigma}-\mathbf{S})<\tilde{f}_{0}(\mathbf{\Sigma}-% \mathbf{T}^{*})$ if and only if $\mathbf{\Sigma}-\mathbf{S}=\mathbf{0}$ , which is a contradiction since $\mathrm{rank}(\mathbf{S})=q<\mathrm{rank}(\mathbf{\Sigma})$ . ∎

Acknowledgements

This work is partially supported by National Key R&D Program of China (Grant nos. 2021YFA1000300, 2021YFA1000301, and 2021YFA1000303) and the National Natural Science Foundation of China (Grant no. 12171462).

References

: Anderson, T. W. (1984), An Introduction to Multivariate Statistical Analysis, Second Edition, John Wiley & Sons.
: Anderson, T. and Rubin, H. (1956), Statistical inference in factor analysis. In: Neyman, J., ed., Proc. 3rd Berkeley Symp. Mathematical Statistics and Probability, vol. V, Berkeley, CA: University of California Press, pp. 111-150.
: Bhattacharya, A. and Dunson, D. B. (2012), Simplex factor models for multivariate unordered categorical data. Journal of the American Statistical Association, 107, 362-377.
: Bair, E., Hastie, T., Paul, D., and Tibshirani, R. (2006), Prediction by supervised principal components. Journal of the American Statistical Association, 101, 119-137.
: Bartholomew, D., Knott, M., and Moustaki, I. (2011), Latent Variable Models and Factor Analysis: A Unified Approach, Third Edition, John Wiley & Sons.
: Ghojogh, B., Crowley, M., Karray, F., and Ghodsi, A. (2023), Elements of Dimensionality Reduction and Manifold Learning, Springer Nature.
: Giudici, P. and Stanghelline, E. (2001). Bayesian inference for graphical factor analysis models. Psychometrika, 66, 577-592.
: Grzebyk, M., Wild, P., and Chouanière, D. (2004), On identification of multi-factor models with correlated residuals. Biometrika, 91, 141-151.
: Hippert-Ferrer, A., Bouchard, F., Mian, A., Vayer, T., and Breloy, A. (2023). Learning graphical factor models with riemannian optimization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 349-366). Cham: Springer Nature Switzerland. arXiv:2210.11950v2
: Levina, E., and Bickel, P. (2004), Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems, 17.
: Johnstone, I. M. and Lu, A. Y. (2009), On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104, 682-693.
: Jolliffe, I. T. (2002), Principal Component Analysis, Second Edition, New York: Springer.
: Jolliffe, I. T. and Cadima, J. (2016), Principal component analysis: a review and recent developments. Phil. Trans. R. Soc. A 374: 20150202.
: Knott, M. and Smith, C. S. (1984), On the optimal map** of distributions. Journal of Optimization Theory and Applications, 43: 39-49.
: Kolda, T. G., Lewis, R. M., and Torczon, V. (2003), Optimization by direct search: New perspectives on some classical and modern methods. SIAM review, 45: 385-482.
: Panaretos, V. M. and Zemel, Y. (2020), An Invitation to Statistics in Wasserstein Space, Springer.
: Roweis, S. (1997). EM algorithms for PCA and SPCA. Neural Inf. Proc. Syst., 10: 626-632.
: Shapiro, A. (1985), Identifiability of factor analysis: some results and open problems. Linear Algebra and Its Applications, 70: 1-7.
: Srebro, N. and Jaakkola, T. (2003), Weighted low-rank approximations. In ICML, Vol. 3, 720-727.
: Tip**, M. E. and Bishop, C. M. (1999), Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61, 611-622.
: Tseng, P. (2001), Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109, 475-494.
: Zou, H. Hastie, T., and Tibshirani, R. (2006), Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15, 265-286.