A unified framework of principal component analysis and factor analysis

Shifeng Xiong1,2***Corresponding author, Email: [email protected]

1. NCMIS, KLSC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences
Bei**g, China
2. School of Mathematical Sciences, University of Chinese Academy of Sciences
Bei**g, China

Abstract Principal component analysis and factor analysis are fundamental multivariate analysis methods. In this paper a unified framework to connect them is introduced. Under a general latent variable model, we present matrix optimization problems from the viewpoint of loss function minimization, and show that the two methods can be viewed as solutions to the optimization problems with specific loss functions. Specifically, principal component analysis can be derived from a broad class of loss functions including the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm, while factor analysis corresponds to a modified 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm problem. Related problems are discussed, including algorithms, penalized maximum likelihood estimation under the latent variable model, and a principal component factor model. These results can lead to new tools of data analysis and research topics.

KEY WORDS: Dimensionality reduction; Latent variable model; Low-rank approximation; Matrix optimization; Probabilistic principal component analysis.

1 Introduction

Principal component analysis (PCA) and factor analysis (FA) can be found in any textbook of multivariate statistical analysis; see e.g. Anderson (1984). The two methods both have long histories of over one hundred years. As basic dimensionality reduction methods, PCA and FA reduce the dimension of the original random vector via eigenvectors of the covariance matrix and interpretable factors, respectively. Their variants and applications have been extremely studied in the literature (Jolliffe 2002; Bartholomew, Knott, and Moustaki 2011).

Generally speaking, PCA is model free, but FA is based on the factor model. There are some connections between PCA and FA. Tip** and Bishop (1999) used a special factor model to define a probabilistic PCA method. PCA is often used as the start point in the iterations of computing maximum likelihood estimates under the factor model (Jolliffe 2002). Overall, PCA and FA are viewed as two relatively separate topics. There is no work on more general and closer connections between the two fundamental methods. In this paper we introduce a framework to unify them within a general latent variable model. We show that they can be viewed as solutions that minimize specific loss functions under this model. Specifically, PCA can be derived from a broad class of loss functions including the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm, while FA corresponds to a modified 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm loss. Besides, other main contributions of this paper include:

(a) We show that PCA can be derived from a penalized maximum likelihood estimation problem. When treating the penalty as a Bayesian prior, we obtain a Bayesian version of PCA, which can be viewed as a new probabilistic PCA model.

(b) We propose a penalized least squares method for FA, which performs well in terms of dimensionality reduction under unidentifiable factor models. The corresponding iterative algorithm is also given.

(c) We present a combination of PCA loss and FA loss, and construct the corresponding path between PCA and FA.

(d) We introduce a principal component factor model that links FA to PCA more closely.

To the best of our knowledge, the above findings have not appeared in the literature. It is expected that these findings cannot only help us better understand PCA and FA, but also lead to new tools of data analysis and research topics.

The rest of this paper is organized as follows. Section 2 briefly reviews PCA and FA. Section 3 introduces the general latent variable model and related optimization problems. Sections 4 and 5 present the objective functions corresponding to PCA and FA, respectively. Section 6 discusses combinations of PCA and FA. Section 7 provides numerical illustrations. We conclude the paper with some discussion in Section 8. All proofs are given in the Appendix.

2 Review of principal component analysis and factor analysis

In this section we present a brief review of principal component analysis (PCA) and factor analysis (FA). Some notation and definitions are needed. For a vector or matrix 𝐚𝐚\mathbf{a}bold_a, let 𝐚superscript𝐚\mathbf{a}^{\prime}bold_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT denote its transpose. For two symmetric matrices 𝐀𝐀\mathbf{A}bold_A and 𝐁𝐁\mathbf{B}bold_B of the same size, write 𝐀𝐀absent\mathbf{A}\leqslantbold_A ⩽ ()(\geqslant)( ⩾ ) 𝐁𝐁\mathbf{B}bold_B if 𝐁𝐀𝐁𝐀\mathbf{B}-\mathbf{A}bold_B - bold_A (𝐀𝐁𝐀𝐁\mathbf{A}-\mathbf{B}bold_A - bold_B) is nonnegative definite, and write 𝐀<𝐀absent\mathbf{A}<bold_A < (>)(>)( > ) 𝐁𝐁\mathbf{B}bold_B if 𝐁𝐀𝐁𝐀\mathbf{B}-\mathbf{A}bold_B - bold_A (𝐀𝐁𝐀𝐁\mathbf{A}-\mathbf{B}bold_A - bold_B) is positive definite. Let 𝟎0\mathbf{0}bold_0 denote the zero matrix.

Consider a random p𝑝pitalic_p-vector 𝑿𝑿\boldsymbol{X}bold_italic_X whose covariance matrix Cov(𝑿)=𝚺>𝟎Cov𝑿𝚺0\mathrm{Cov}(\boldsymbol{X})=\mathbf{\Sigma}>\mathbf{0}roman_Cov ( bold_italic_X ) = bold_Σ > bold_0. Without loss of generality, let its expectation E(𝑿)=𝟎E𝑿0\mathrm{E}(\boldsymbol{X})=\boldsymbol{0}roman_E ( bold_italic_X ) = bold_0. PCA of 𝑿𝑿\boldsymbol{X}bold_italic_X is based on the eigenvalue decomposition of 𝚺𝚺\mathbf{\Sigma}bold_Σ, denoted by 𝚺=𝐑𝚲𝐑𝚺𝐑𝚲superscript𝐑\mathbf{\Sigma}=\mathbf{R}\mathbf{\Lambda}\mathbf{R}^{\prime}bold_Σ = bold_R bold_Λ bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where 𝐑=(𝒓1,,𝒓p)𝐑subscript𝒓1subscript𝒓𝑝\mathbf{R}=(\boldsymbol{r}_{1},\ldots,\boldsymbol{r}_{p})bold_R = ( bold_italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) is an orthogonal matrix and 𝚲𝚲\mathbf{\Lambda}bold_Λ is a diagonal matrix diag(λ1,,λp)diagsubscript𝜆1subscript𝜆𝑝\mathrm{diag}(\lambda_{1},\ldots,\lambda_{p})roman_diag ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) with λ1λp>0subscript𝜆1subscript𝜆𝑝0\lambda_{1}\geqslant\cdots\geqslant\lambda_{p}>0italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⩾ ⋯ ⩾ italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT > 0. The k𝑘kitalic_kth principal component of 𝑿𝑿\boldsymbol{X}bold_italic_X is 𝒓k𝑿superscriptsubscript𝒓𝑘𝑿\boldsymbol{r}_{k}^{\prime}\boldsymbol{X}bold_italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_X, whose variance is λksubscript𝜆𝑘\lambda_{k}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Partition 𝐑𝐑\mathbf{R}bold_R as 𝐑=(𝐑1𝐑2)𝐑subscript𝐑1subscript𝐑2\mathbf{R}=(\mathbf{R}_{1}\ \mathbf{R}_{2})bold_R = ( bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), where 𝐑1p×qsubscript𝐑1superscript𝑝𝑞\mathbf{R}_{1}\in\mathbb{R}^{p\times q}bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_q end_POSTSUPERSCRIPT. Let 𝚲1=diag(λ1,,λq)subscript𝚲1diagsubscript𝜆1subscript𝜆𝑞\mathbf{\Lambda}_{1}=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{q})bold_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_diag ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ). The cumulative proportion of the first q𝑞qitalic_q principal components 𝐑1𝑿superscriptsubscript𝐑1𝑿\mathbf{R}_{1}^{\prime}\boldsymbol{X}bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_X to 𝑿𝑿\boldsymbol{X}bold_italic_X is defined as (λ1++λq)/(λ1++λp)=trace(𝚲1)/trace(𝚺)=trace(Cov(𝐑1𝑿))/trace(Cov(𝑿))subscript𝜆1subscript𝜆𝑞subscript𝜆1subscript𝜆𝑝tracesubscript𝚲1trace𝚺traceCovsuperscriptsubscript𝐑1𝑿traceCov𝑿(\lambda_{1}+\cdots+\lambda_{q})/(\lambda_{1}+\cdots+\lambda_{p})=\mathrm{% trace}\left(\mathbf{\Lambda}_{1}\right)/\mathrm{trace}\left(\mathbf{\Sigma}% \right)=\mathrm{trace}\left(\mathrm{Cov}(\mathbf{R}_{1}^{\prime}\boldsymbol{X}% )\right)/\mathrm{trace}\left(\mathrm{Cov}(\boldsymbol{X})\right)( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_λ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) / ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) = roman_trace ( bold_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) / roman_trace ( bold_Σ ) = roman_trace ( roman_Cov ( bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_X ) ) / roman_trace ( roman_Cov ( bold_italic_X ) ). If the cumulative proportion is sufficiently large, say, at least 85%percent8585\%85 %, then we can say that 𝑿𝑿\boldsymbol{X}bold_italic_X is well represented by the first q𝑞qitalic_q principal components. Hence PCA is a linear dimensionality reduction technique. In practice, PCA can be implemented through decomposing an estimate of 𝚺𝚺\mathbf{\Sigma}bold_Σ, say, the sample covariance matrix, with observations of 𝑿𝑿\boldsymbol{X}bold_italic_X.

FA of 𝑿𝑿\boldsymbol{X}bold_italic_X is based on the following factor model,

𝑿=𝐀𝑼+𝜺,𝑿𝐀𝑼𝜺\boldsymbol{X}=\mathbf{A}\boldsymbol{U}+\boldsymbol{\varepsilon},bold_italic_X = bold_A bold_italic_U + bold_italic_ε , (1)

where the factor loading matrix 𝐀𝐀\mathbf{A}bold_A is an p×q𝑝𝑞p\times qitalic_p × italic_q constant matrix of full column rank, 1q<p1𝑞𝑝1\leqslant q<p1 ⩽ italic_q < italic_p, the common factors 𝑼Nq(𝟎,𝐈q)similar-to𝑼subscript𝑁𝑞0subscript𝐈𝑞\boldsymbol{U}\sim N_{q}(\boldsymbol{0},\mathbf{I}_{q})bold_italic_U ∼ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_0 , bold_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) is independent of 𝜺Np(𝟎,𝐕)similar-to𝜺subscript𝑁𝑝0𝐕\boldsymbol{\varepsilon}\sim N_{p}(\boldsymbol{0},\mathbf{V})bold_italic_ε ∼ italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_0 , bold_V ), 𝐈qsubscript𝐈𝑞\mathbf{I}_{q}bold_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is the identity matrix of order q𝑞qitalic_q, and 𝐕=diag(v1,,vp)𝐕diagsubscript𝑣1subscript𝑣𝑝\mathbf{V}=\mathrm{diag}(v_{1},\ldots,v_{p})bold_V = roman_diag ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) with vi0,i=1,,pformulae-sequencesubscript𝑣𝑖0𝑖1𝑝v_{i}\geqslant 0,\ i=1,\ldots,pitalic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⩾ 0 , italic_i = 1 , … , italic_p. We have Cov(𝑿)=𝚺=𝐀𝐀+𝐕>𝟎Cov𝑿𝚺superscript𝐀𝐀𝐕0\mathrm{Cov}(\boldsymbol{X})=\mathbf{\Sigma}=\mathbf{A}\mathbf{A}^{\prime}+% \mathbf{V}>\mathbf{0}roman_Cov ( bold_italic_X ) = bold_Σ = bold_AA start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_V > bold_0. The main purpose of FA is to estimate 𝐀𝐀\mathbf{A}bold_A, or 𝐀𝐀superscript𝐀𝐀\mathbf{A}\mathbf{A}^{\prime}bold_AA start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which yields a low-dimensional representation of 𝑿𝑿\boldsymbol{X}bold_italic_X. Note that the factor model (1) can be unidentifiable. Usually we need to make the following assumption.

Assumption 1.

Model (1) is identifiable (Shapiro 1985), i.e., there does not exist a nonnegative definite diagonal matrix 𝐕~~𝐕\tilde{\mathbf{V}}over~ start_ARG bold_V end_ARG, different from 𝐕𝐕\mathbf{V}bold_V, such that 𝚺𝐕~𝟎𝚺~𝐕0\mathbf{\Sigma}-\tilde{\mathbf{V}}\geqslant\mathbf{0}bold_Σ - over~ start_ARG bold_V end_ARG ⩾ bold_0 and rank(𝚺𝐕~)rank(𝚺𝐕)rank𝚺~𝐕rank𝚺𝐕\mathrm{rank}(\mathbf{\Sigma}-\tilde{\mathbf{V}})\leqslant\mathrm{rank}(% \mathbf{\Sigma}-\mathbf{V})roman_rank ( bold_Σ - over~ start_ARG bold_V end_ARG ) ⩽ roman_rank ( bold_Σ - bold_V ).

Next we provide some comparisons between PCA and FA. A major distinction between them is that there is a definite model underlying FA, while PCA does not require a model. Tip** and Bishop (1999) proposed a probabilistic model for PCA, which is actually a special case of (1) with v1==vpsubscript𝑣1subscript𝑣𝑝v_{1}=\cdots=v_{p}italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⋯ = italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT; see also Roweis (1997). This model cannot cover the broad range of applications of PCA. An advantage of FA is its interpretation because the common factors 𝑼𝑼\boldsymbol{U}bold_italic_U in (1) usually have real meanings, and this is the main reason why FA is so popular in various fields. In addition, PCA is often used in computations of FA (Jolliffe 2002; Bartholomew, Knott, and Moustaki 2011). For detailed review of PCA and FA, we refer readers to several monographes and review papers, e.g., Bartholomew, Knott, and Moustaki (2011), Jolliffe and Cadima (2016), and Ghojogh, Crowley, and Karray et al. (2023).

3 A latent variable model and related matrix optimization problems

Consider the following latent variable model that generalizes (1),

𝑿=𝐀𝑼+𝜺.𝑿𝐀𝑼𝜺\boldsymbol{X}=\mathbf{A}\boldsymbol{U}+\boldsymbol{\varepsilon}.bold_italic_X = bold_A bold_italic_U + bold_italic_ε . (2)

Like in (1), 𝐀𝐀\mathbf{A}bold_A is an p×q𝑝𝑞p\times qitalic_p × italic_q constant matrix of full column rank, q<p𝑞𝑝q<pitalic_q < italic_p, 𝑼Nq(𝟎,𝐈q)similar-to𝑼subscript𝑁𝑞0subscript𝐈𝑞\boldsymbol{U}\sim N_{q}(\boldsymbol{0},\mathbf{I}_{q})bold_italic_U ∼ italic_N start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_0 , bold_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) is an unobserved latent vectors independent of 𝜺Np(𝟎,𝐕),𝐕𝟎formulae-sequencesimilar-to𝜺subscript𝑁𝑝0𝐕𝐕0\boldsymbol{\varepsilon}\sim N_{p}(\boldsymbol{0},\mathbf{V}),\ \mathbf{V}% \geqslant\mathbf{0}bold_italic_ε ∼ italic_N start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( bold_0 , bold_V ) , bold_V ⩾ bold_0. Generally, 𝑽𝑽\boldsymbol{V}bold_italic_V is not restricted to be diagonal. However, we sometimes restrict it to be a diagonal or other special matrix according to the problem. Write 𝐓=𝐀𝐀𝟎𝐓superscript𝐀𝐀0\mathbf{T}=\mathbf{A}\mathbf{A}^{\prime}\geqslant\mathbf{0}bold_T = bold_AA start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⩾ bold_0, and thus rank(𝐓)=qrank𝐓𝑞\mathrm{rank}(\mathbf{T})=qroman_rank ( bold_T ) = italic_q. By (2), Cov(𝑿)=𝚺=𝐀𝐀+𝐕=𝐓+𝐕Cov𝑿𝚺superscript𝐀𝐀𝐕𝐓𝐕\mathrm{Cov}(\boldsymbol{X})=\mathbf{\Sigma}=\mathbf{A}\mathbf{A}^{\prime}+% \mathbf{V}=\mathbf{T}+\mathbf{V}roman_Cov ( bold_italic_X ) = bold_Σ = bold_AA start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_V = bold_T + bold_V. Unlike the factor model, any 𝚺𝚺\mathbf{\Sigma}bold_Σ can be written as this form. Clearly the model (2) is unidentifiable, and we can specify 𝐓𝐓\mathbf{T}bold_T and 𝐕𝐕\mathbf{V}bold_V through minimizing various loss functions. In the following, 𝚺𝚺\mathbf{\Sigma}bold_Σ also denotes an estimate of the population covariance matrix when only observations of 𝑿𝑿\boldsymbol{X}bold_italic_X are available.

We consider two classes of methods for specifying 𝐓𝐓\mathbf{T}bold_T and 𝐕𝐕\mathbf{V}bold_V. One is to first specify 𝐓𝐓\mathbf{T}bold_T by solving the following matrix optimization problem,

min𝐓F1(𝐓),subscript𝐓subscript𝐹1𝐓\displaystyle\min_{\mathbf{T}}\ F_{1}(\mathbf{T}),roman_min start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_T ) , (3)
s.t. 0𝐓𝚺,rank(𝐓)=q,formulae-sequences.t. 0𝐓𝚺rank𝐓𝑞\displaystyle{\text{s.t.}}\ \mathbf{0}\leqslant\mathbf{T}\leqslant\mathbf{% \Sigma},\ \mathrm{rank}(\mathbf{T})=q,s.t. bold_0 ⩽ bold_T ⩽ bold_Σ , roman_rank ( bold_T ) = italic_q ,

where F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is an objective function defined on the set of all nonnegative definite matrices. Consequently, we specify 𝐕𝐕\mathbf{V}bold_V as 𝐕=𝚺𝐓superscript𝐕𝚺superscript𝐓\mathbf{V}^{*}=\mathbf{\Sigma}-\mathbf{T}^{*}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_Σ - bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (or other strategies, since 𝐕𝐕\mathbf{V}bold_V is less important), where 𝐓superscript𝐓\mathbf{T}^{*}bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the solution to (3). The other is to simultaneously specify 𝐓𝐓\mathbf{T}bold_T and 𝐕𝐕\mathbf{V}bold_V without the requirement 𝐓+𝐕=𝚺𝐓𝐕𝚺\mathbf{T}+\mathbf{V}=\mathbf{\Sigma}bold_T + bold_V = bold_Σ through solving the two-matrix optimization problem,

min𝐓,𝐕F2(𝐓,𝐕),subscript𝐓𝐕subscript𝐹2𝐓𝐕\displaystyle\min_{\mathbf{T},\mathbf{V}}\ F_{2}(\mathbf{T},\mathbf{V}),roman_min start_POSTSUBSCRIPT bold_T , bold_V end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_T , bold_V ) , (4)
s.t.𝐓𝟎,𝐕𝟎,rank(𝐓)=q,formulae-sequences.t.𝐓0formulae-sequence𝐕0rank𝐓𝑞\displaystyle{\text{s.t.}}\ \mathbf{T}\geqslant\mathbf{0},\ \mathbf{V}% \geqslant\mathbf{0},\ \mathrm{rank}(\mathbf{T})=q,s.t. bold_T ⩾ bold_0 , bold_V ⩾ bold_0 , roman_rank ( bold_T ) = italic_q ,

where F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is an objective function.

The two-matrix problem can yield a one-matrix problem by taking

F1(𝐓)=F2(𝐓,𝐕(𝐓))subscript𝐹1𝐓subscript𝐹2𝐓superscript𝐕𝐓F_{1}(\mathbf{T})=F_{2}(\mathbf{T},\mathbf{V}^{*}(\mathbf{T}))italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_T ) = italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_T , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_T ) ) (5)

in (3), where 𝐕(𝐓)superscript𝐕𝐓\mathbf{V}^{*}(\mathbf{T})bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_T ) is the minimum of F2(𝐓,𝐕)subscript𝐹2𝐓𝐕F_{2}(\mathbf{T},\mathbf{V})italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_T , bold_V ) for given 𝐓𝐓\mathbf{T}bold_T. Conversely, one-matrix problems induce two-matrix problems under the following assumption. Let hhitalic_h be a function defined on the set of all nonnegative definite matrices.

Assumption 2.

For all 𝐒𝟎𝐒0\mathbf{S}\geqslant\mathbf{0}bold_S ⩾ bold_0, h(𝐒)h(𝚺)𝐒𝚺h(\mathbf{S})\geqslant h(\mathbf{\Sigma})italic_h ( bold_S ) ⩾ italic_h ( bold_Σ ).

Theorem 1.

Let 𝐓superscript𝐓\mathbf{T}^{*}bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the solution to (3). If Assumption 2 holds, then (𝐓,𝚺𝐓)superscript𝐓𝚺superscript𝐓(\mathbf{T}^{*},\mathbf{\Sigma}-\mathbf{T}^{*})( bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Σ - bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the solution to (4) with F2(𝐓,𝐕)=h(𝐓+𝐕)+F1(𝐓)subscript𝐹2𝐓𝐕𝐓𝐕subscript𝐹1𝐓F_{2}(\mathbf{T},\mathbf{V})=h(\mathbf{T}+\mathbf{V})+F_{1}(\mathbf{T})italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_T , bold_V ) = italic_h ( bold_T + bold_V ) + italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_T ).

In the following two sections we will show that PCA and FA can be derived by (3) or (4) with specified loss/objective functions.

4 Objective functions corresponding to PCA

4.1 One-matrix optimization

In the viewpoint of dimensionality reduction, it is required that the distribution of 𝑿𝑿\boldsymbol{X}bold_italic_X can be well approximated by that of 𝐀𝑼𝐀𝑼\mathbf{A}\boldsymbol{U}bold_A bold_italic_U, i.e., 𝐓𝐓\mathbf{T}bold_T should be close to 𝚺𝚺\mathbf{\Sigma}bold_Σ. In other words, we need to find a low-rank approximation of 𝚺𝚺\mathbf{\Sigma}bold_Σ. This inspires us to consider the objective function F1(𝐓)=f(𝚺𝐓)subscript𝐹1𝐓𝑓𝚺𝐓F_{1}(\mathbf{T})=f(\mathbf{\Sigma}-\mathbf{T})italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_T ) = italic_f ( bold_Σ - bold_T ), where f𝑓fitalic_f is a function defined on the set of all nonnegative definite matrices that evaluates the distance between a nonnegative definite matrix and the zero matrix. To induce PCA, we present the following assumptions on f𝑓fitalic_f.

Assumption 3.

For all 𝐒𝟎𝐒0\mathbf{S}\geqslant\mathbf{0}bold_S ⩾ bold_0, f(𝐒)0𝑓𝐒0f(\mathbf{S})\geqslant 0italic_f ( bold_S ) ⩾ 0 and f(𝐒)=0𝑓𝐒0f(\mathbf{S})=0italic_f ( bold_S ) = 0 if and only if 𝐒=𝟎𝐒0\mathbf{S}=\mathbf{0}bold_S = bold_0.

Assumption 4.

For all 𝐒𝟎𝐒0\mathbf{S}\geqslant\mathbf{0}bold_S ⩾ bold_0 and orthogonal matrix 𝐑𝐑\mathbf{R}bold_R of order p𝑝pitalic_p, f(𝐑𝐒𝐑)=f(𝐒)𝑓superscript𝐑𝐒𝐑𝑓𝐒f(\mathbf{R}\mathbf{S}\mathbf{R}^{\prime})=f(\mathbf{S})italic_f ( bold_RSR start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_f ( bold_S ).

Assumption 5.

For all 𝐒=(sij)i,j=1,,p𝟎𝐒subscriptsubscript𝑠𝑖𝑗formulae-sequence𝑖𝑗1𝑝0\mathbf{S}=(s_{ij})_{i,j=1,\ldots,p}\geqslant\mathbf{0}bold_S = ( italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_j = 1 , … , italic_p end_POSTSUBSCRIPT ⩾ bold_0, f(𝐒)f(diag(s11,,spp))𝑓𝐒𝑓diagsubscript𝑠11subscript𝑠𝑝𝑝f(\mathbf{S})\geqslant f\left(\mathrm{diag}(s_{11},\ldots,s_{pp})\right)italic_f ( bold_S ) ⩾ italic_f ( roman_diag ( italic_s start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ) ).

Assumption 6.

The function g(𝐱)=f(diag(𝐱)),𝐱[0,+)pformulae-sequence𝑔𝐱𝑓diag𝐱𝐱superscript0𝑝g(\boldsymbol{x})=f\left(\mathrm{diag}(\boldsymbol{x})\right),\ \boldsymbol{x}% \in[0,+\infty)^{p}italic_g ( bold_italic_x ) = italic_f ( roman_diag ( bold_italic_x ) ) , bold_italic_x ∈ [ 0 , + ∞ ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT, is nondecreasing in each variable.

Consider the following f𝑓fitalic_f: the trace f(𝐒)=trace(𝐒)𝑓𝐒trace𝐒f(\mathbf{S})=\mathrm{trace}(\mathbf{S})italic_f ( bold_S ) = roman_trace ( bold_S ), the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm (also called spectral norm) f(𝐒)=𝐒2𝑓𝐒subscriptnorm𝐒2f(\mathbf{S})=\|\mathbf{S}\|_{2}italic_f ( bold_S ) = ∥ bold_S ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT which is the largest eigenvalue of 𝐒𝐒\mathbf{S}bold_S, and the Frobenius norm f(𝐒)=𝐒F=(trace(𝐒2))1/2𝑓𝐒subscriptnorm𝐒𝐹superscripttracesuperscript𝐒212f(\mathbf{S})=\|\mathbf{S}\|_{F}=(\mathrm{trace}(\mathbf{S}^{2}))^{1/2}italic_f ( bold_S ) = ∥ bold_S ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = ( roman_trace ( bold_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT.

Lemma 1.

The 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm f(𝐒)=𝐒2𝑓𝐒subscriptnorm𝐒2f(\mathbf{S})=\|\mathbf{S}\|_{2}italic_f ( bold_S ) = ∥ bold_S ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT satisfies Assumption 5.

By Lemma 1 and simple algebra, f=trace(𝐒),𝐒2𝑓trace𝐒subscriptnorm𝐒2f={\mathrm{trace}}(\mathbf{S}),\ \|\mathbf{S}\|_{2}italic_f = roman_trace ( bold_S ) , ∥ bold_S ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and 𝐒Fsubscriptnorm𝐒𝐹\|\mathbf{S}\|_{F}∥ bold_S ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT satisfy Assumptions 3-6. Actually, Assumptions 4-6 are technical, designed only for the following proofs. Besides the above f𝑓fitalic_f’s and their combinations such as f(𝐒)=𝐒2+𝐒F+trace(𝐒)𝑓𝐒subscriptnorm𝐒2subscriptnorm𝐒𝐹trace𝐒f(\mathbf{S})=\|\mathbf{S}\|_{2}+\|\mathbf{S}\|_{F}+{\mathrm{trace}}(\mathbf{S})italic_f ( bold_S ) = ∥ bold_S ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ bold_S ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT + roman_trace ( bold_S ), it is hard to find other f𝑓fitalic_f to satisfy the assumptions.

Theorem 2.

Under Assumptions 3-6, the solution to (3) with F1(𝐓)=f(𝚺𝐓)subscript𝐹1𝐓𝑓𝚺𝐓F_{1}(\mathbf{T})=f(\mathbf{\Sigma}-\mathbf{T})italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_T ) = italic_f ( bold_Σ - bold_T ) is 𝐓=𝐑1𝚲1𝐑1superscript𝐓subscript𝐑1subscript𝚲1subscriptsuperscript𝐑1\mathbf{T}^{*}=\mathbf{R}_{1}\mathbf{\Lambda}_{1}\mathbf{R}^{\prime}_{1}bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Theorem 2 indicates that PCA can be derived from the optimization problem (3) with a broad class of objective functions. Taking f(𝐒)=trace(𝐒)𝑓𝐒trace𝐒f(\mathbf{S})={\mathrm{trace}}(\mathbf{S})italic_f ( bold_S ) = roman_trace ( bold_S ) corresponds to the common concept of PCA, variance maximization. Besides, other objective functions that do not satisfy the above assumptions can yield PCA. For example, let f(𝐒)=rank(𝐒)𝑓𝐒rank𝐒f(\mathbf{S})={\mathrm{rank}}(\mathbf{S})italic_f ( bold_S ) = roman_rank ( bold_S ), which does not satisfy Assumption 5. Note that rank(𝚺𝐓)rank(𝚺)rank(𝐓)=pq=rank(𝚺𝐓)rank𝚺𝐓rank𝚺rank𝐓𝑝𝑞rank𝚺superscript𝐓{\mathrm{rank}}(\mathbf{\Sigma}-\mathbf{T})\geqslant{\mathrm{rank}}(\mathbf{% \Sigma})-{\mathrm{rank}}(\mathbf{T})=p-q={\mathrm{rank}}(\mathbf{\Sigma}-% \mathbf{T}^{*})roman_rank ( bold_Σ - bold_T ) ⩾ roman_rank ( bold_Σ ) - roman_rank ( bold_T ) = italic_p - italic_q = roman_rank ( bold_Σ - bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). The solution to (3) with f(𝐒)=rank(𝐒)𝑓𝐒rank𝐒f(\mathbf{S})={\mathrm{rank}}(\mathbf{S})italic_f ( bold_S ) = roman_rank ( bold_S ) is 𝐓=𝐑1𝚲1𝐑1superscript𝐓subscript𝐑1subscript𝚲1subscriptsuperscript𝐑1\mathbf{T}^{*}=\mathbf{R}_{1}\mathbf{\Lambda}_{1}\mathbf{R}^{\prime}_{1}bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Interestingly, if we expand the domain of f𝑓fitalic_f to the set of all symmetric matrices, then PCA also corresponds to the solution to the optimization problem (3) without the constraint 𝐓𝚺𝐓𝚺\mathbf{T}\leqslant\mathbf{\Sigma}bold_T ⩽ bold_Σ, i.e.,

min𝐓f(𝚺𝐓),subscript𝐓𝑓𝚺𝐓\displaystyle\min_{\mathbf{T}}\ f(\mathbf{\Sigma}-\mathbf{T}),roman_min start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT italic_f ( bold_Σ - bold_T ) , (6)
s.t.𝐓𝟎,rank(𝐓)=q,formulae-sequences.t.𝐓0rank𝐓𝑞\displaystyle{\text{s.t.}}\ \mathbf{T}\geqslant\mathbf{0},\ \mathrm{rank}(% \mathbf{T})=q,s.t. bold_T ⩾ bold_0 , roman_rank ( bold_T ) = italic_q ,

for f𝑓fitalic_f satisfying the further assumption.

Assumption 7.

For all symmetric 𝐒𝐒\mathbf{S}bold_S, f(𝐒)=f(𝐒)𝑓𝐒𝑓𝐒f(-\mathbf{S})=f(\mathbf{S})italic_f ( - bold_S ) = italic_f ( bold_S ).

The following theorem can be proven similarly to Theorem 2.

Theorem 3.

Under Assumptions 3-7, the solution to (6) is 𝐓=𝐑1𝚲1𝐑1superscript𝐓subscript𝐑1subscript𝚲1subscriptsuperscript𝐑1\mathbf{T}^{*}=\mathbf{R}_{1}\mathbf{\Lambda}_{1}\mathbf{R}^{\prime}_{1}bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

It is clear that f(𝐒)=𝐒2𝑓𝐒subscriptnorm𝐒2f(\mathbf{S})=\|\mathbf{S}\|_{2}italic_f ( bold_S ) = ∥ bold_S ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝐒Fsubscriptnorm𝐒𝐹\|\mathbf{S}\|_{F}∥ bold_S ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT satisfy Assumption 7 while trace(𝐒)trace𝐒{\mathrm{trace}}(\mathbf{S})roman_trace ( bold_S ) does not. The special case with f(𝐒)=𝐒F𝑓𝐒subscriptnorm𝐒𝐹f(\mathbf{S})=\|\mathbf{S}\|_{F}italic_f ( bold_S ) = ∥ bold_S ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT of Theorem 3 was proven by Srebro and Jaakkola (2003) with matrix derivatives.

4.2 Penalized maximum likelihood estimation

In this subsection we derive PCA by the likelihood principle. The negative log-likelihood (up to a constant) under (2) is

l(𝐓,𝐕)=log(|𝐓+𝐕|)+trace((𝐓+𝐕)1𝚺),𝑙𝐓𝐕𝐓𝐕tracesuperscript𝐓𝐕1𝚺l(\mathbf{T},\mathbf{V})=\log(|\mathbf{T}+\mathbf{V}|)+{\mathrm{trace}}\left((% \mathbf{T}+\mathbf{V})^{-1}\mathbf{\Sigma}\right),italic_l ( bold_T , bold_V ) = roman_log ( | bold_T + bold_V | ) + roman_trace ( ( bold_T + bold_V ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ ) , (7)

and thus we can adopt l𝑙litalic_l as the objective function in the two-matrix optimization problem (4). Clearly, this problem is not well defined since any (𝐓,𝐕)𝐓𝐕(\mathbf{T},\mathbf{V})( bold_T , bold_V ) satisfying 𝐓+𝐕=𝚺𝐓𝐕𝚺\mathbf{T}+\mathbf{V}=\mathbf{\Sigma}bold_T + bold_V = bold_Σ is a solution. To derive PCA, we add a penalty and let

F2(𝐓,𝐕)=l(𝐓,𝐕)+λf(𝚺𝐓),subscript𝐹2𝐓𝐕𝑙𝐓𝐕𝜆𝑓𝚺𝐓F_{2}(\mathbf{T},\mathbf{V})=l(\mathbf{T},\mathbf{V})+\lambda\,f(\mathbf{% \Sigma}-\mathbf{T}),italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_T , bold_V ) = italic_l ( bold_T , bold_V ) + italic_λ italic_f ( bold_Σ - bold_T ) , (8)

where λ>0𝜆0\lambda>0italic_λ > 0 is a pre-specified parameter.

Taking h(𝐒)=log(|𝐒|)+trace(𝐒1𝚺)𝐒𝐒tracesuperscript𝐒1𝚺h(\mathbf{S})=\log(|\mathbf{S}|)+{\mathrm{trace}}\left(\mathbf{S}^{-1}\mathbf{% \Sigma}\right)italic_h ( bold_S ) = roman_log ( | bold_S | ) + roman_trace ( bold_S start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ ) in Theorem 1, by Theorem 2, we can get

Theorem 4.

Under Assumptions 3-6, for all λ𝜆\lambdaitalic_λ, the solution to (4) with F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in (8) is (𝐓,𝐕)=(𝐑1𝚲1𝐑1,𝐑2𝚲2𝐑2)superscript𝐓superscript𝐕subscript𝐑1subscript𝚲1subscriptsuperscript𝐑1subscript𝐑2subscript𝚲2subscriptsuperscript𝐑2(\mathbf{T}^{*},\mathbf{V}^{*})=(\mathbf{R}_{1}\mathbf{\Lambda}_{1}\mathbf{R}^% {\prime}_{1},\mathbf{R}_{2}\mathbf{\Lambda}_{2}\mathbf{R}^{\prime}_{2})( bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ( bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

Theorem 4 indicates that PCA can be viewed as a penalized maximum likelihood estimate. We can also view PCA as a Bayesian posterior mode under the following priors

(𝐓,𝐕)exp[λf(𝚺𝐓)]I(𝐓𝟎,𝐕𝟎,rank(𝐓)=q),proportional-to𝐓𝐕𝜆𝑓𝚺𝐓𝐼formulae-sequence𝐓0formulae-sequence𝐕0rank𝐓𝑞(\mathbf{T},\mathbf{V})\propto\exp\left[-\lambda\,f(\mathbf{\Sigma}-\mathbf{T}% )\right]\cdot I(\mathbf{T}\geqslant\mathbf{0},\ \mathbf{V}\geqslant\mathbf{0},% \ \mathrm{rank}(\mathbf{T})=q),( bold_T , bold_V ) ∝ roman_exp [ - italic_λ italic_f ( bold_Σ - bold_T ) ] ⋅ italic_I ( bold_T ⩾ bold_0 , bold_V ⩾ bold_0 , roman_rank ( bold_T ) = italic_q ) , (9)

where I𝐼Iitalic_I denotes the indicator function. As a result we can obtain a Bayesian framework of PCA that allows uncertainty quantification for PCA. Unlike probabilistic PCA in Tip** and Bishop (1999), this probabilistic PCA framework accommodates to all covariance matrices.

Note that the priors in (9) may be improper. We can also consider another type of penalties. For example, with a tuning parameter λ>0𝜆0\lambda>0italic_λ > 0, let

F2(𝐓,𝐕)=l(𝐓,𝐕)+λ𝐕22,subscript𝐹2𝐓𝐕𝑙𝐓𝐕𝜆superscriptsubscriptnorm𝐕22F_{2}(\mathbf{T},\mathbf{V})=l(\mathbf{T},\mathbf{V})+\lambda\,\|\mathbf{V}\|_% {2}^{2},italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_T , bold_V ) = italic_l ( bold_T , bold_V ) + italic_λ ∥ bold_V ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

which corresponds to the priors,

(𝐓,𝐕)exp[λ𝐕22]I(𝐓𝟎,𝐕𝟎,rank(𝐓)=q).proportional-to𝐓𝐕𝜆superscriptsubscriptnorm𝐕22𝐼formulae-sequence𝐓0formulae-sequence𝐕0rank𝐓𝑞(\mathbf{T},\mathbf{V})\propto\exp\left[-\lambda\,\|\mathbf{V}\|_{2}^{2}\right% ]\cdot I(\mathbf{T}\geqslant\mathbf{0},\ \mathbf{V}\geqslant\mathbf{0},\ % \mathrm{rank}(\mathbf{T})=q).( bold_T , bold_V ) ∝ roman_exp [ - italic_λ ∥ bold_V ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ⋅ italic_I ( bold_T ⩾ bold_0 , bold_V ⩾ bold_0 , roman_rank ( bold_T ) = italic_q ) .
Theorem 5.

For some λ𝜆\lambdaitalic_λ, the solution to (4) with F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in (10) is (𝐓,𝐕)=(𝐑1𝚲1𝐑1,𝐑2𝚲2𝐑2)superscript𝐓superscript𝐕subscript𝐑1subscript𝚲1subscriptsuperscript𝐑1subscript𝐑2subscript𝚲2subscriptsuperscript𝐑2(\mathbf{T}^{*},\mathbf{V}^{*})=(\mathbf{R}_{1}\mathbf{\Lambda}_{1}\mathbf{R}^% {\prime}_{1},\mathbf{R}_{2}\mathbf{\Lambda}_{2}\mathbf{R}^{\prime}_{2})( bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ( bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_Λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_R start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_Λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

5 Objective functions corresponding to FA

5.1 One-matrix optimization

Since 𝐕=𝚺𝐓𝐕𝚺𝐓\mathbf{V}=\mathbf{\Sigma}-\mathbf{T}bold_V = bold_Σ - bold_T is a diagonal matrix in the factor model (1), we first consider the 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm that can yield sparsity. For 𝐒𝟎𝐒0\mathbf{S}\geqslant\mathbf{0}bold_S ⩾ bold_0, define its modified 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm as

f~0(𝐒)={𝐒0,|𝐒|>0p,|𝐒|=0,𝐒𝟎,0,𝐒=𝟎,\tilde{f}_{0}(\mathbf{S})=\left\{\begin{aligned} &\|\mathbf{S}\|_{0},\quad\,|% \mathbf{S}|>0\\ &p,\quad\quad\ \ |\mathbf{S}|=0,\ \mathbf{S}\neq\mathbf{0},\\ &0,\quad\quad\quad\mathbf{S}=\mathbf{0},\end{aligned}\right.over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_S ) = { start_ROW start_CELL end_CELL start_CELL ∥ bold_S ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , | bold_S | > 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_p , | bold_S | = 0 , bold_S ≠ bold_0 , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 0 , bold_S = bold_0 , end_CELL end_ROW (11)

where 𝐒0subscriptnorm𝐒0\|\mathbf{S}\|_{0}∥ bold_S ∥ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and |𝐒|𝐒|\mathbf{S}|| bold_S | denote the number of nonzero elements in 𝐒𝐒\mathbf{S}bold_S and the determinant of 𝐒𝐒\mathbf{S}bold_S, respectively. We have that f~0subscript~𝑓0\tilde{f}_{0}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT satisfies Assumption 3, but does not satisfy Assumptions 4-6.

Theorem 6.

Under model (1), 𝐓=𝐀𝐀superscript𝐓superscript𝐀𝐀\mathbf{T}^{*}=\mathbf{A}\mathbf{A}^{\prime}bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_AA start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is a solution to (3) with F1(𝐒)=f~0(𝚺𝐒)subscript𝐹1𝐒subscript~𝑓0𝚺𝐒F_{1}(\mathbf{S})=\tilde{f}_{0}(\mathbf{\Sigma}-\mathbf{S})italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_S ) = over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_Σ - bold_S ).

Theorem 7.

Under model (1), if Assumption 1 holds, then (3) with F1(𝐒)=f~0(𝚺𝐒)subscript𝐹1𝐒subscript~𝑓0𝚺𝐒F_{1}(\mathbf{S})=\tilde{f}_{0}(\mathbf{\Sigma}-\mathbf{S})italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_S ) = over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_Σ - bold_S ) has the unique solution 𝐓=𝐀𝐀superscript𝐓superscript𝐀𝐀\mathbf{T}^{*}=\mathbf{A}\mathbf{A}^{\prime}bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_AA start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Besides the above F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT constructed by the modified 0subscript0\ell_{0}roman_ℓ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT norm, many other objective functions can induce FA. For instance, let

F1(𝐒)=fτ(𝚺𝐒)andfτ(𝐒)=i<j|sij|τ,τ0formulae-sequencesubscript𝐹1𝐒subscript𝑓𝜏𝚺𝐒andformulae-sequencesubscript𝑓𝜏𝐒subscript𝑖𝑗superscriptsubscript𝑠𝑖𝑗𝜏𝜏0F_{1}(\mathbf{S})=f_{\tau}(\mathbf{\Sigma}-\mathbf{S})\quad\text{and}\quad f_{% \tau}(\mathbf{S})=\sum_{i<j}|s_{ij}|^{\tau},\quad\tau\geqslant 0italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_S ) = italic_f start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( bold_Σ - bold_S ) and italic_f start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( bold_S ) = ∑ start_POSTSUBSCRIPT italic_i < italic_j end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT , italic_τ ⩾ 0 (12)

or

F1(𝐒)=fd(𝚺𝐒)andfd(𝐒)=f(𝐒)f(diag(s11,,spp)),formulae-sequencesubscript𝐹1𝐒subscript𝑓d𝚺𝐒andsubscript𝑓d𝐒𝑓𝐒𝑓diagsubscript𝑠11subscript𝑠𝑝𝑝F_{1}(\mathbf{S})=f_{\mathrm{d}}(\mathbf{\Sigma}-\mathbf{S})\quad\text{and}% \quad f_{\mathrm{d}}(\mathbf{S})=f(\mathbf{S})-f({\mathrm{diag}}(s_{11},\ldots% ,s_{pp})),italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_S ) = italic_f start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( bold_Σ - bold_S ) and italic_f start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT ( bold_S ) = italic_f ( bold_S ) - italic_f ( roman_diag ( italic_s start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_s start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ) ) , (13)

where f𝑓fitalic_f satisfies Assumption 5. Theorems 6 and 7 hold by replacing f~0subscript~𝑓0\tilde{f}_{0}over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with fτsubscript𝑓𝜏f_{\tau}italic_f start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT or fdsubscript𝑓df_{\mathrm{d}}italic_f start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT. By (5) and (7), the maximum likelihood estimates of 𝐓𝐓\mathbf{T}bold_T can be derived from the objective function

F1(𝐒)=l(𝐒,𝐕(𝐒)),subscript𝐹1𝐒𝑙𝐒superscript𝐕𝐒F_{1}(\mathbf{S})=l(\mathbf{S},\mathbf{V}^{*}(\mathbf{S})),italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_S ) = italic_l ( bold_S , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_S ) ) , (14)

where l𝑙litalic_l is defined in (7), and 𝐕(𝐒)superscript𝐕𝐒\mathbf{V}^{*}(\mathbf{S})bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_S ) is the solution that minimizes l(𝐒,𝐕)𝑙𝐒𝐕l(\mathbf{S},\mathbf{V})italic_l ( bold_S , bold_V ) under the constraint that 𝐕𝟎𝐕0\mathbf{V}\geqslant\mathbf{0}bold_V ⩾ bold_0 is diagonal for given 𝐒𝐒\mathbf{S}bold_S.

Algorithm 1  The unidirectional search algorithm for solving (3)
0:    𝚺,q,s,ε𝚺𝑞𝑠𝜀\mathbf{\Sigma},\ q,\ s,\ \varepsilonbold_Σ , italic_q , italic_s , italic_ε.
0:    
1:  Initialization: Select 𝐀(0)=(aij(0))i=1,,p,j=1,,qp×qsuperscript𝐀0subscriptsuperscriptsubscript𝑎𝑖𝑗0formulae-sequence𝑖1𝑝𝑗1𝑞superscript𝑝𝑞\mathbf{A}^{(0)}=(a_{ij}^{(0)})_{i=1,\ldots,p,\ j=1,\ldots,q}\in\mathbb{R}^{p% \times q}bold_A start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = ( italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_p , italic_j = 1 , … , italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_q end_POSTSUPERSCRIPT.
2:  Iteration: For k=0,1,,𝑘01k=0,1,\ldots,italic_k = 0 , 1 , … ,    For i=1,,pq𝑖1𝑝𝑞i=1,\ldots,pqitalic_i = 1 , … , italic_p italic_q,       Change aij(k)superscriptsubscript𝑎𝑖𝑗𝑘a_{ij}^{(k)}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT in 𝐀(k)superscript𝐀𝑘\mathbf{A}^{(k)}bold_A start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT to aij(k)+ssuperscriptsubscript𝑎𝑖𝑗𝑘𝑠a_{ij}^{(k)}+sitalic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT + italic_s, and let 𝐀+(ij)superscriptsubscript𝐀𝑖𝑗\mathbf{A}_{+}^{(ij)}bold_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT denote the new matrix;        Change aij(k)superscriptsubscript𝑎𝑖𝑗𝑘a_{ij}^{(k)}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT in 𝐀(k)superscript𝐀𝑘\mathbf{A}^{(k)}bold_A start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT to aij(k)ssuperscriptsubscript𝑎𝑖𝑗𝑘𝑠a_{ij}^{(k)}-sitalic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - italic_s, and let 𝐀(ij)superscriptsubscript𝐀𝑖𝑗\mathbf{A}_{-}^{(ij)}bold_A start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT denote the new matrix.     Find 𝒜={𝐀{𝐀+(ij),𝐀(ij)}i=1,,p,j=1,,q:F1(𝐀𝐀)<F1(𝐀(k)𝐀(k)),𝐀𝐀𝚺}𝒜conditional-set𝐀subscriptsuperscriptsubscript𝐀𝑖𝑗superscriptsubscript𝐀𝑖𝑗formulae-sequence𝑖1𝑝𝑗1𝑞formulae-sequencesubscript𝐹1superscript𝐀𝐀subscript𝐹1superscript𝐀𝑘superscript𝐀superscript𝑘superscript𝐀𝐀𝚺\mathcal{A}=\left\{\mathbf{A}\in\left\{\mathbf{A}_{+}^{(ij)},\ \mathbf{A}_{-}^% {(ij)}\right\}_{i=1,\ldots,p,\ j=1,\ldots,q}:\ F_{1}\left(\mathbf{A}\mathbf{A}% ^{\prime}\right)<F_{1}\left(\mathbf{A}^{(k)}\mathbf{A}^{(k)^{\prime}}\right),% \ \mathbf{A}\mathbf{A}^{\prime}\leqslant\mathbf{\Sigma}\right\}caligraphic_A = { bold_A ∈ { bold_A start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT , bold_A start_POSTSUBSCRIPT - end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i italic_j ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , … , italic_p , italic_j = 1 , … , italic_q end_POSTSUBSCRIPT : italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_AA start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) < italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_A start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT bold_A start_POSTSUPERSCRIPT ( italic_k ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) , bold_AA start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⩽ bold_Σ }.     If 𝒜=𝒜\mathcal{A}=\emptysetcaligraphic_A = ∅, let ss/2𝑠𝑠2s\leftarrow s/2italic_s ← italic_s / 2;     Else let 𝐀(k+1)=argmin𝐀𝒜F1(𝐀𝐀)superscript𝐀𝑘1subscript𝐀𝒜subscript𝐹1superscript𝐀𝐀\mathbf{A}^{(k+1)}=\arg\min_{\mathbf{A}\in\mathcal{A}}F_{1}(\mathbf{A}\mathbf{% A}^{\prime})bold_A start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT bold_A ∈ caligraphic_A end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_AA start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), and let kk+1𝑘𝑘1k\leftarrow k+1italic_k ← italic_k + 1. If s<ε𝑠𝜀s<\varepsilonitalic_s < italic_ε, then stop the iterations, and output 𝐓=𝐀(k)𝐀(k)superscript𝐓superscript𝐀𝑘superscript𝐀superscript𝑘\mathbf{T}^{*}=\mathbf{A}^{(k)}\mathbf{A}^{(k)^{\prime}}bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_A start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT bold_A start_POSTSUPERSCRIPT ( italic_k ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

Even when the factor model (1) does not hold, we can solve (3) with the objective functions in (11)-(14). The corresponding solution can be viewed as an approximate FA approach for 𝑿𝑿\boldsymbol{X}bold_italic_X, which extends FA to more general covariance structures such as the graphical factor models (Giudici and Stanghelline 2001; Grzebyk, Wild, and Chouanière 2004).

The matrix optimization problem (3) with the objective functions corresponding to FA are usually hard to solve. Here we provide an unidirectional search algorithm (Kolda, Lewis, and Torczon 2003) for solving (3) with general objective functions; see Algorithm 1. This is a direct search algorithm that does not require an explicit representation of the gradient of the objective function. The initial 𝐀(0)superscript𝐀0\mathbf{A}^{(0)}bold_A start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT can be given by PCA. Actually Algorithm 1 is applicable to other objective functions including those corresponding to PCA. Note that Algorithm 1 may be computationally expensive, especially when p𝑝pitalic_p is large. We need to develop more effective strategies in our future work.

5.2 Two-matrix optimization

The most popular method for estimating the parameters of the factor model (1) is maximum likelihood estimate (Anderson 1984). It is a two-matrix optimization with the likelihood objective function in (7) and an additional constraint that 𝐕𝐕\mathbf{V}bold_V is diagonal. Here we focus on another important estimation method, the least squares estimation (Bartholomew, Knott, and Moustaki 2011), which solves the two-matrix optimization problem,

min𝐓,𝐕𝚺(𝐓+𝐕)F2,subscript𝐓𝐕superscriptsubscriptnorm𝚺𝐓𝐕𝐹2\displaystyle\min_{\mathbf{T},\mathbf{V}}\ \left\|\mathbf{\Sigma}-(\mathbf{T}+% \mathbf{V})\right\|_{F}^{2},roman_min start_POSTSUBSCRIPT bold_T , bold_V end_POSTSUBSCRIPT ∥ bold_Σ - ( bold_T + bold_V ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (15)
s.t.𝐓𝟎,rank(𝐓)=q,𝐕𝟎is diagonal.formulae-sequences.t.𝐓0formulae-sequencerank𝐓𝑞𝐕0is diagonal\displaystyle{\text{s.t.}}\ \mathbf{T}\geqslant\mathbf{0},\ \mathrm{rank}(% \mathbf{T})=q,\ \mathbf{V}\geqslant\mathbf{0}\ \text{is diagonal}.s.t. bold_T ⩾ bold_0 , roman_rank ( bold_T ) = italic_q , bold_V ⩾ bold_0 is diagonal .

This method does not require the normal assumption.

With matrix derivatives, iterative formulas for solving (15) can be derived (Bartholomew, Knott, and Moustaki 2011). We first show that these formulas can also be given by a coordinate descent algorithm (Tseng 2001). Specifically, given 𝐕𝐕\mathbf{V}bold_V, (15) reduces to the PCA problem (6); given 𝐓𝐓\mathbf{T}bold_T, the solution is trivial by taking diagonal elements. Compared with existing algorithms, this coordinate descent algorithm is easy to understand and implement. Furthermore, from the coordinate descent viewpoint, the iterative formulas possess the monotonicity property that the objective value is non-increasing in each iteration.

Algorithm 2  The coordinate descent algorithm for solving (16)
0:    𝚺,q,λ,ε𝚺𝑞𝜆𝜀\mathbf{\Sigma},\ q,\ \lambda,\ \varepsilonbold_Σ , italic_q , italic_λ , italic_ε.
0:    
1:  Initialization: Select 𝐕(0)𝚺superscript𝐕0𝚺\mathbf{V}^{(0)}\leqslant\mathbf{\Sigma}bold_V start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ⩽ bold_Σ.
2:  Iteration: For each k=0,1,𝑘01k=0,1,\ldotsitalic_k = 0 , 1 , …,    Compute 𝐓(k+1)=𝐑¯1𝚲¯1𝐑¯1superscript𝐓𝑘1subscript¯𝐑1subscript¯𝚲1subscriptsuperscript¯𝐑1\mathbf{T}^{(k+1)}=\bar{\mathbf{R}}_{1}\bar{\mathbf{\Lambda}}_{1}\bar{\mathbf{% R}}^{\prime}_{1}bold_T start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = over¯ start_ARG bold_R end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over¯ start_ARG bold_Λ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT over¯ start_ARG bold_R end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, where 𝐑¯1subscript¯𝐑1\bar{\mathbf{R}}_{1}over¯ start_ARG bold_R end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝚲¯1subscript¯𝚲1\bar{\mathbf{\Lambda}}_{1}over¯ start_ARG bold_Λ end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT correspond to the first q𝑞qitalic_q eigenvectors and eigenvalues of 𝚺𝐕(k)𝚺superscript𝐕𝑘\mathbf{\Sigma}-\mathbf{V}^{(k)}bold_Σ - bold_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, respectively.    Compute 𝐕(k+1)=diag(𝚺𝐓(k+1))/(1+λ)superscript𝐕𝑘1diag𝚺superscript𝐓𝑘11𝜆\mathbf{V}^{(k+1)}=\mathrm{diag}(\mathbf{\Sigma}-\mathbf{T}^{(k+1)})/(1+\lambda)bold_V start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = roman_diag ( bold_Σ - bold_T start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ) / ( 1 + italic_λ ), where diag(𝚺𝐓(k+1))diag𝚺superscript𝐓𝑘1\mathrm{diag}(\mathbf{\Sigma}-\mathbf{T}^{(k+1)})roman_diag ( bold_Σ - bold_T start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ) denotes the diagonal matrix whose diagonal is the same as 𝚺𝐓(k+1)𝚺superscript𝐓𝑘1\mathbf{\Sigma}-\mathbf{T}^{(k+1)}bold_Σ - bold_T start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT.    If 𝐕(k+1)𝚺superscript𝐕𝑘1𝚺\mathbf{V}^{(k+1)}\leqslant\mathbf{\Sigma}bold_V start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ⩽ bold_Σ does not hold, then stop the iterations, and output (𝐓,𝐕)=(𝐓(k),𝐕(k))superscript𝐓superscript𝐕superscript𝐓𝑘superscript𝐕𝑘(\mathbf{T}^{*},\mathbf{V}^{*})=(\mathbf{T}^{(k)},\mathbf{V}^{(k)})( bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ( bold_T start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ).    Else if 𝚺𝐓(k)𝐕(k)F𝚺𝐓(k+1)𝐕(k+1)F<εsubscriptnorm𝚺superscript𝐓𝑘superscript𝐕𝑘𝐹subscriptnorm𝚺superscript𝐓𝑘1superscript𝐕𝑘1𝐹𝜀\|\mathbf{\Sigma}-\mathbf{T}^{(k)}-\mathbf{V}^{(k)}\|_{F}-\|\mathbf{\Sigma}-% \mathbf{T}^{(k+1)}-\mathbf{V}^{(k+1)}\|_{F}<\varepsilon∥ bold_Σ - bold_T start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT - bold_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - ∥ bold_Σ - bold_T start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT - bold_V start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT < italic_ε, then stop the iterations, and output (𝐓,𝐕)=(𝐓(k+1),𝐕(k+1))superscript𝐓superscript𝐕superscript𝐓𝑘1superscript𝐕𝑘1(\mathbf{T}^{*},\mathbf{V}^{*})=(\mathbf{T}^{(k+1)},\mathbf{V}^{(k+1)})( bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = ( bold_T start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT ).    Otherwise, let kk+1𝑘𝑘1k\leftarrow k+1italic_k ← italic_k + 1.

When Assumption 1 does not hold, there are more than one pairs of (𝐓,𝐕)𝐓𝐕(\mathbf{T},\mathbf{V})( bold_T , bold_V ) satisfying the factor model (1). Note that 𝐕𝐕\mathbf{V}bold_V should be close to the zero matrix from the viewpoint of dimensionality reduction. We next propose a penalized least squares estimation approach to produce the pair in which 𝐕𝐕\mathbf{V}bold_V is closest to the zero matrix. With a tuning parameter λ0𝜆0\lambda\geqslant 0italic_λ ⩾ 0, the approach solves

min𝐓,𝐕𝚺(𝐓+𝐕)F2+λ𝐕F2,subscript𝐓𝐕superscriptsubscriptnorm𝚺𝐓𝐕𝐹2𝜆superscriptsubscriptnorm𝐕𝐹2\displaystyle\min_{\mathbf{T},\mathbf{V}}\ \left\|\mathbf{\Sigma}-(\mathbf{T}+% \mathbf{V})\right\|_{F}^{2}+\lambda\|\mathbf{V}\|_{F}^{2},roman_min start_POSTSUBSCRIPT bold_T , bold_V end_POSTSUBSCRIPT ∥ bold_Σ - ( bold_T + bold_V ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ bold_V ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (16)
s.t.𝐓𝟎,rank(𝐓)=q,𝐕𝟎is diagonal.formulae-sequences.t.𝐓0formulae-sequencerank𝐓𝑞𝐕0is diagonal\displaystyle{\text{s.t.}}\ \mathbf{T}\geqslant\mathbf{0},\ \mathrm{rank}(% \mathbf{T})=q,\ \mathbf{V}\geqslant\mathbf{0}\ \text{is diagonal}.s.t. bold_T ⩾ bold_0 , roman_rank ( bold_T ) = italic_q , bold_V ⩾ bold_0 is diagonal .

Suppose that 𝚺𝚺\mathbf{\Sigma}bold_Σ is a nlimit-from𝑛\sqrt{n}-square-root start_ARG italic_n end_ARG -consistent estimate of the true covariance matrix, where n𝑛nitalic_n denotes the sample size. If λ=λn0,nλn+formulae-sequence𝜆subscript𝜆𝑛0𝑛subscript𝜆𝑛\lambda=\lambda_{n}\to 0,\ n\lambda_{n}\to+\inftyitalic_λ = italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → 0 , italic_n italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → + ∞, then the solution to (16) converges to the pair corresponding to the smallest value of 𝐕Fsubscriptnorm𝐕𝐹\|\mathbf{V}\|_{F}∥ bold_V ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. Recall that F\|\mathbf{\cdot}\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is an objective function to derive PCA. The penalized least squares approach can be viewed as a combination of PCA and FA. The coordinate descent algorithm can also be used to solve (16), and we present the detailed steps in Algorithm 2.

Like in Section 4.2, we can derive penalized maximum estimation for FA and the corresponding Bayesian approach. Note that in these methods, 𝐕𝐕\mathbf{V}bold_V is not restricted to be a diagonal matrix. Therefore, we actually get a Bayesian version of the approximate FA approach for 𝑿𝑿\boldsymbol{X}bold_italic_X, which can quantify the difference between the underlying model and a factor model.

6 Combinations of PCA and FA

6.1 Path optimization between PCA and FA

In this subsection we present a path optimization problem to connect PCA and FA. Consider the one-matrix optimization problem (3). Let F1,PCAsubscript𝐹1PCAF_{1,\mathrm{PCA}}italic_F start_POSTSUBSCRIPT 1 , roman_PCA end_POSTSUBSCRIPT and F1,FAsubscript𝐹1FAF_{1,\mathrm{FA}}italic_F start_POSTSUBSCRIPT 1 , roman_FA end_POSTSUBSCRIPT denote two objective functions corresponding to PCA and FA, respectively. For w[0,1]𝑤01w\in[0,1]italic_w ∈ [ 0 , 1 ], let 𝐓wsuperscriptsubscript𝐓𝑤\mathbf{T}_{w}^{*}bold_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denote the solution to

min𝐓wF1,PCA(𝐓)+(1w)F1,FA(𝐓),subscript𝐓𝑤subscript𝐹1PCA𝐓1𝑤subscript𝐹1FA𝐓\displaystyle\min_{\mathbf{T}}\ wF_{1,\mathrm{PCA}}(\mathbf{T})+(1-w)F_{1,% \mathrm{FA}}(\mathbf{T}),roman_min start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT italic_w italic_F start_POSTSUBSCRIPT 1 , roman_PCA end_POSTSUBSCRIPT ( bold_T ) + ( 1 - italic_w ) italic_F start_POSTSUBSCRIPT 1 , roman_FA end_POSTSUBSCRIPT ( bold_T ) , (17)
s.t. 0𝐓𝚺,rank(𝐓)=q.formulae-sequences.t. 0𝐓𝚺rank𝐓𝑞\displaystyle{\text{s.t.}}\ \mathbf{0}\leqslant\mathbf{T}\leqslant\mathbf{% \Sigma},\ \mathrm{rank}(\mathbf{T})=q.s.t. bold_0 ⩽ bold_T ⩽ bold_Σ , roman_rank ( bold_T ) = italic_q .

Therefore {𝐓w}w[0,1]subscriptsuperscriptsubscript𝐓𝑤𝑤01\{\mathbf{T}_{w}^{*}\}_{w\in[0,1]}{ bold_T start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_w ∈ [ 0 , 1 ] end_POSTSUBSCRIPT can be viewed as a path between PCA and FA. From the path we can find how the direction of dimensionality reduction varies from PCA to FA. This may help us better understand the difference between PCA and FA. The problem (17) can be computed by Algorithm 1. A simple choice of the objective functions in (17) is (F1,PCA(𝐓),F1,FA(𝐓))=(𝚺𝐓F2,f2(𝚺𝐓))subscript𝐹1PCA𝐓subscript𝐹1FA𝐓superscriptsubscriptnorm𝚺𝐓𝐹2subscript𝑓2𝚺𝐓(F_{1,\mathrm{PCA}}(\mathbf{T}),F_{1,\mathrm{FA}}(\mathbf{T}))=(\|\mathbf{% \Sigma}-\mathbf{T}\|_{F}^{2},f_{2}(\mathbf{\Sigma}-\mathbf{T}))( italic_F start_POSTSUBSCRIPT 1 , roman_PCA end_POSTSUBSCRIPT ( bold_T ) , italic_F start_POSTSUBSCRIPT 1 , roman_FA end_POSTSUBSCRIPT ( bold_T ) ) = ( ∥ bold_Σ - bold_T ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Σ - bold_T ) ), where f2subscript𝑓2f_{2}italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is defined in (12).

Besides, (17) provides a method to fit the factor model from the relatively simple PCA. For a sequence 1=w0>w1><wm=01subscript𝑤0subscript𝑤1subscript𝑤𝑚01=w_{0}>w_{1}>\cdots<w_{m}=01 = italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > ⋯ < italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = 0, we use PCA as the start point to compute the solution 𝐓w1superscriptsubscript𝐓subscript𝑤1\mathbf{T}_{w_{1}}^{*}bold_T start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and then use 𝐓w1superscriptsubscript𝐓subscript𝑤1\mathbf{T}_{w_{1}}^{*}bold_T start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as the start point to compute the solution 𝐓w2superscriptsubscript𝐓subscript𝑤2\mathbf{T}_{w_{2}}^{*}bold_T start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and so on. Finally, 𝐓wmsuperscriptsubscript𝐓subscript𝑤𝑚\mathbf{T}_{w_{m}}^{*}bold_T start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT obtained by this way is the FA solution. Note that the optimization problem with the FA objective function may have many local solutions. The solution from the path optimization is relatively close to PCA, which reaches some consistency between PCA and FA.

We can also consider two-matrix optimization for combining PCA and FA. The penalized least squares estimation approach in (16) is an example. Similar to (16), (17) can yield approximate estimates corresponding to the smallest PCA loss for unidentifiable factor models by taking w0𝑤0w\to 0italic_w → 0.

6.2 A principal component factor model

As two dimensionality reduction methods, PCA and FA are sometimes required to have identical directions of dimensionality reduction (Tip** and Bishop 1999). In this subsection we introduce a special factor model, called principal component factor model, which adds a constraint

(𝐀)=(𝐑1)𝐀subscript𝐑1\mathcal{L}(\mathbf{A})=\mathcal{L}(\mathbf{R}_{1})caligraphic_L ( bold_A ) = caligraphic_L ( bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) (18)

to the original factor model (1), where (𝐀)𝐀\mathcal{L}(\mathbf{A})caligraphic_L ( bold_A ) denotes the column space of 𝐀𝐀\mathbf{A}bold_A. Tip** and Bishop (1999)’s probabilistic PCA is a special case of the principal component factor model.

We now discuss estimation for 𝐓𝐓\mathbf{T}bold_T and 𝐕𝐕\mathbf{V}bold_V under (18). Given an objective function G(𝐓,𝐕)𝐺𝐓𝐕G(\mathbf{T},\mathbf{V})italic_G ( bold_T , bold_V ), we estimate them by solving the optimization problem,

min𝐓,𝐕G(𝐓,𝐕),subscript𝐓𝐕𝐺𝐓𝐕\displaystyle\min_{\mathbf{T},\mathbf{V}}\ G(\mathbf{T},\mathbf{V}),roman_min start_POSTSUBSCRIPT bold_T , bold_V end_POSTSUBSCRIPT italic_G ( bold_T , bold_V ) , (19)
s.t.𝐓=1[𝐓+𝐕]𝚪1[𝐓+𝐕],𝚪𝟎is diagonal,𝐕𝟎is diagonal,formulae-sequences.t.𝐓subscript1delimited-[]𝐓𝐕𝚪subscript1superscriptdelimited-[]𝐓𝐕formulae-sequence𝚪0is diagonal𝐕0is diagonal\displaystyle{\text{s.t.}}\ \mathbf{T}=\mathcal{R}_{1}[\mathbf{T}+\mathbf{V}]% \mathbf{\Gamma}\mathcal{R}_{1}[\mathbf{T}+\mathbf{V}]^{\prime},\ \mathbf{% \Gamma}\geqslant\mathbf{0}\ \text{is diagonal},\mathbf{V}\geqslant\mathbf{0}\ % \text{is diagonal},s.t. bold_T = caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ bold_T + bold_V ] bold_Γ caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ bold_T + bold_V ] start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_Γ ⩾ bold_0 is diagonal , bold_V ⩾ bold_0 is diagonal ,

where 1[𝐓+𝐕]subscript1delimited-[]𝐓𝐕\mathcal{R}_{1}[\mathbf{T}+\mathbf{V}]caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ bold_T + bold_V ] denotes the p×q𝑝𝑞p\times qitalic_p × italic_q matrix corresponding to the first q𝑞qitalic_q principal components of 𝐓+𝐕𝐓𝐕\mathbf{T}+\mathbf{V}bold_T + bold_V. Note that the variables in 𝚪𝚪\mathbf{\Gamma}bold_Γ, which only appear in the constraint, are also needed to optimize. The objective functions in (19) corresponding to maximum likelihood estimation and least squares estimation for the principal component factor model can be found to in (7) and (15), respectively.

We also use the idea of coordinate descent to solve (19). Let 𝜸=(γ1,,γq)𝜸superscriptsubscript𝛾1subscript𝛾𝑞\boldsymbol{\gamma}=(\gamma_{1},\ldots,\gamma_{q})^{\prime}bold_italic_γ = ( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝒗=(v1,,vp)𝒗superscriptsubscript𝑣1subscript𝑣𝑝\boldsymbol{v}=(v_{1},\ldots,v_{p})^{\prime}bold_italic_v = ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT be the diagonals of 𝚪𝚪\mathbf{\Gamma}bold_Γ and 𝐕𝐕\mathbf{V}bold_V, respectively. Note that if 1[𝐓+𝐕]=𝐑1subscript1delimited-[]𝐓𝐕subscript𝐑1\mathcal{R}_{1}[\mathbf{T}+\mathbf{V}]=\mathbf{R}_{1}caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ bold_T + bold_V ] = bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is given, then (19) reduces to a (p+q)𝑝𝑞(p+q)( italic_p + italic_q )-dimensional optimization problem

min𝜸,𝒗G(𝐑1diag(𝜸)𝐑1,diag(𝒗)),subscript𝜸𝒗𝐺subscript𝐑1diag𝜸superscriptsubscript𝐑1diag𝒗\displaystyle\min_{\boldsymbol{\gamma},\boldsymbol{v}}\ G(\mathbf{R}_{1}{% \mathrm{diag}}(\boldsymbol{\gamma})\mathbf{R}_{1}^{\prime},{\mathrm{diag}}(% \boldsymbol{v})),roman_min start_POSTSUBSCRIPT bold_italic_γ , bold_italic_v end_POSTSUBSCRIPT italic_G ( bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_diag ( bold_italic_γ ) bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , roman_diag ( bold_italic_v ) ) , (20)
s.t.γ1,,γq,v1,,vp0.s.t.subscript𝛾1subscript𝛾𝑞subscript𝑣1subscript𝑣𝑝0\displaystyle{\text{s.t.}}\ \gamma_{1},\ldots,\gamma_{q},v_{1},\ldots,v_{p}% \geqslant 0.s.t. italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⩾ 0 .

Therefore, we have the following iterative strategy. Given 𝐓(k),𝐕(k)superscript𝐓𝑘superscript𝐕𝑘\mathbf{T}^{(k)},\mathbf{V}^{(k)}bold_T start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, compute 𝐑1(k)=1[𝐓(k)+𝐕(k)]superscriptsubscript𝐑1𝑘subscript1delimited-[]superscript𝐓𝑘superscript𝐕𝑘\mathbf{R}_{1}^{(k)}=\mathcal{R}_{1}[\mathbf{T}^{(k)}+\mathbf{V}^{(k)}]bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = caligraphic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ bold_T start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT + bold_V start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ]; use (20) to get 𝚪(k),𝐕(k+1)superscript𝚪𝑘superscript𝐕𝑘1\mathbf{\Gamma}^{(k)},\mathbf{V}^{(k+1)}bold_Γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT; compute 𝐓(k+1)=𝐑1(k)𝚪(k)𝐑1(k)superscript𝐓𝑘1superscriptsubscript𝐑1𝑘superscript𝚪𝑘superscriptsubscript𝐑1superscript𝑘\mathbf{T}^{(k+1)}=\mathbf{R}_{1}^{(k)}\mathbf{\Gamma}^{(k)}\mathbf{R}_{1}^{(k% )^{\prime}}bold_T start_POSTSUPERSCRIPT ( italic_k + 1 ) end_POSTSUPERSCRIPT = bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT bold_Γ start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT.

For the objective function in least squares estimation, (20) becomes the standard quadratic program problem,

min𝒛𝒛𝐇𝒛2𝒃𝒛,subscript𝒛superscript𝒛𝐇𝒛2superscript𝒃𝒛\displaystyle\min_{\boldsymbol{z}}\ \boldsymbol{z}^{\prime}\mathbf{H}% \boldsymbol{z}-2\boldsymbol{b}^{\prime}\boldsymbol{z},roman_min start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT bold_italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_H bold_italic_z - 2 bold_italic_b start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_z , (21)
s.t.zi0,i=1,,p+q,formulae-sequences.t.subscript𝑧𝑖0𝑖1𝑝𝑞\displaystyle{\text{s.t.}}\ z_{i}\geqslant{0},\ i=1,\ldots,p+q,s.t. italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⩾ 0 , italic_i = 1 , … , italic_p + italic_q ,

where 𝒛=(𝜸,𝒗)=(z1,,zp+q)𝒛superscriptsuperscript𝜸superscript𝒗superscriptsubscript𝑧1subscript𝑧𝑝𝑞\boldsymbol{z}=(\boldsymbol{\gamma}^{\prime},\boldsymbol{v}^{\prime})^{\prime}% =(z_{1},\ldots,z_{p+q})^{\prime}bold_italic_z = ( bold_italic_γ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = ( italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_p + italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, 𝐇=(q𝟏q𝟏q𝐑1𝐑1𝐑1𝐑1𝐈p)𝐇matrix𝑞subscript1𝑞superscriptsubscript1𝑞direct-productsuperscriptsubscript𝐑1superscriptsubscript𝐑1direct-productsubscript𝐑1subscript𝐑1subscript𝐈𝑝\mathbf{H}=\begin{pmatrix}q\boldsymbol{1}_{q}\boldsymbol{1}_{q}^{\prime}&% \mathbf{R}_{1}^{\prime}\odot\mathbf{R}_{1}^{\prime}\\ \mathbf{R}_{1}\odot\mathbf{R}_{1}&\mathbf{I}_{p}\end{pmatrix}bold_H = ( start_ARG start_ROW start_CELL italic_q bold_1 start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT bold_1 start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL start_CELL bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⊙ bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊙ bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_I start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ),
𝒃=(i=1pj=1pσij𝒔ij,σ11,,σpp)𝒃superscriptsuperscriptsubscript𝑖1𝑝superscriptsubscript𝑗1𝑝subscript𝜎𝑖𝑗superscriptsubscript𝒔𝑖𝑗subscript𝜎11subscript𝜎𝑝𝑝\boldsymbol{b}=\left(\sum_{i=1}^{p}\sum_{j=1}^{p}\sigma_{ij}\boldsymbol{s}_{ij% }^{\prime},\,\sigma_{11},\ldots,\sigma_{pp}\right)^{\prime}bold_italic_b = ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT bold_italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_p italic_p end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, σijsubscript𝜎𝑖𝑗\sigma_{ij}italic_σ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the (i,j)𝑖𝑗(i,j)( italic_i , italic_j )-element of the current 𝚺=𝐓+𝐕𝚺𝐓𝐕\mathbf{\Sigma}=\mathbf{T}+\mathbf{V}bold_Σ = bold_T + bold_V, 𝒔ij=𝒓(i)𝒓(j)subscript𝒔𝑖𝑗direct-productsuperscript𝒓𝑖superscript𝒓𝑗\boldsymbol{s}_{ij}=\boldsymbol{r}^{(i)}\odot\boldsymbol{r}^{(j)}bold_italic_s start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = bold_italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⊙ bold_italic_r start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT, 𝒓(i)superscript𝒓𝑖\boldsymbol{r}^{(i)}bold_italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is the i𝑖iitalic_ith row vector of 𝐑1subscript𝐑1\mathbf{R}_{1}bold_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and direct-product\odot denotes element-wise multiplication. The solution to (21) can be used as initial points in computing the maximum likelihood estimates.

The above principal component factor model can be extend to cover general factor models. Let 𝐑(i)subscript𝐑𝑖\mathbf{R}_{(i)}bold_R start_POSTSUBSCRIPT ( italic_i ) end_POSTSUBSCRIPT denote the p×i𝑝𝑖p\times iitalic_p × italic_i sub-matrix of 𝐑𝐑\mathbf{R}bold_R corresponding to the first i𝑖iitalic_i eigenvectors, 1ip1𝑖𝑝1\leqslant i\leqslant p1 ⩽ italic_i ⩽ italic_p. Consider the factor model (1) with a factor loading matrix 𝐀𝐀\mathbf{A}bold_A. We define the principal component index of this model as the smallest integer qq*italic_q ∗ such that (𝐀)(𝐑(q))\mathcal{L}(\mathbf{A})\subset\mathcal{L}(\mathbf{R}_{(q*)})caligraphic_L ( bold_A ) ⊂ caligraphic_L ( bold_R start_POSTSUBSCRIPT ( italic_q ∗ ) end_POSTSUBSCRIPT ). Therefore, any factor model, with a principal component index at most p𝑝pitalic_p, is connected to PCA. In addition, for an unidentifiable factor model, we may improve its identifiability by reducing its principal component index.

7 Numerical examples

7.1 Penalized least squares estimation

We use numerical examples to illustrate the penalized least squares method for unidentifiable factor models in Section 5.2. Consider the following two covariance matrices,

(I)𝚺=(2113)=(1/33)(1/33)+(5/3000),I𝚺matrix2113matrix133matrix133matrix53000\displaystyle\mathrm{(I)}\ \mathbf{\Sigma}=\begin{pmatrix}2&1\\ 1&3\end{pmatrix}=\begin{pmatrix}1/\sqrt{3}\\ \sqrt{3}\end{pmatrix}\begin{pmatrix}1/\sqrt{3}&\sqrt{3}\end{pmatrix}+\begin{% pmatrix}5/3&0\\ 0&0\end{pmatrix},( roman_I ) bold_Σ = ( start_ARG start_ROW start_CELL 2 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 3 end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL 1 / square-root start_ARG 3 end_ARG end_CELL end_ROW start_ROW start_CELL square-root start_ARG 3 end_ARG end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL 1 / square-root start_ARG 3 end_ARG end_CELL start_CELL square-root start_ARG 3 end_ARG end_CELL end_ROW end_ARG ) + ( start_ARG start_ROW start_CELL 5 / 3 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ) ,
(II)𝚺=(3122120120422122)=(11100211)(11011021)+(1000010000000000).II𝚺matrix3122120120422122matrix11100211matrix11011021matrix1000010000000000\displaystyle\mathrm{(II)}\ \mathbf{\Sigma}=\begin{pmatrix}3&-1&-2&2\\ -1&2&0&-1\\ -2&0&4&-2\\ 2&-1&-2&2\end{pmatrix}=\begin{pmatrix}1&-1\\ -1&0\\ 0&2\\ 1&-1\end{pmatrix}\begin{pmatrix}1&-1&0&1\\ -1&0&2&-1\end{pmatrix}+\begin{pmatrix}1&0&0&0\\ 0&1&0&0\\ 0&0&0&0\\ 0&0&0&0\end{pmatrix}.( roman_II ) bold_Σ = ( start_ARG start_ROW start_CELL 3 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 2 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 0 end_CELL start_CELL 4 end_CELL start_CELL - 2 end_CELL end_ROW start_ROW start_CELL 2 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL 2 end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 2 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ) + ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ) .

Since all factor models with (p,q)=(2,1)𝑝𝑞21(p,q)=(2,1)( italic_p , italic_q ) = ( 2 , 1 ) or (p,q)=(4,2)𝑝𝑞42(p,q)=(4,2)( italic_p , italic_q ) = ( 4 , 2 ) are unidentifiable (Anderson and Rubin 1956), 𝚺𝚺\mathbf{\Sigma}bold_Σ can be written as the form of 𝐀𝐀+𝐕superscript𝐀𝐀𝐕\mathbf{A}\mathbf{A}^{\prime}+\mathbf{V}bold_AA start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + bold_V with different (𝐀,𝐕)𝐀𝐕(\mathbf{A},\mathbf{V})( bold_A , bold_V ). The above equations show the ones with the smallest 𝐕Fsubscriptnorm𝐕𝐹\|\mathbf{V}\|_{F}∥ bold_V ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. For (I) and (II), we solve the problem of penalized least squares estimation in (16) with varying tuning parameter λ𝜆\lambdaitalic_λ by Algorithm 2, and show the results in Figure 1. It can be seen that, as 1/λ1𝜆1/\lambda1 / italic_λ increases, the resulted 𝐕Fsubscriptnorm𝐕𝐹\|\mathbf{V}\|_{F}∥ bold_V ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT from the penalized least squares estimates converges to minimal 𝐕Fsubscriptnorm𝐕𝐹\|\mathbf{V}\|_{F}∥ bold_V ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. At the meantime, the loss of least squares converges to zero, which indicates that the estimates are consistent with the structure of factor models for large 1/λ1𝜆1/\lambda1 / italic_λ. Therefore, the proposed penalized least squares estimation can effectively yield the estimates with good dimensionality reduction performance under unidentifiable factor models.

[0.7]Refer to caption

Figure 1: Results of penalized least squares estimation for (I) and (II) in Section 7.1.

7.2 Path optimization

We show two examples of the path between PCA and FA introduced in Section 6.1. Consider the following two covariance matrices,

(i)𝚺=(211131113)=(111)(111)+(100020002),i𝚺matrix211131113matrix111matrix111matrix100020002\displaystyle\mathrm{(i)}\ \mathbf{\Sigma}=\begin{pmatrix}2&1&1\\ 1&3&1\\ 1&1&3\end{pmatrix}=\begin{pmatrix}1\\ 1\\ 1\end{pmatrix}\begin{pmatrix}1&1&1\end{pmatrix}+\begin{pmatrix}1&0&0\\ 0&2&0\\ 0&0&2\end{pmatrix},( roman_i ) bold_Σ = ( start_ARG start_ROW start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 3 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 3 end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) + ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 2 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 2 end_CELL end_ROW end_ARG ) ,
(ii)𝚺=(3211225022104011204122115)=(1120011011)(1201110101)+(1000001000003000003000003).ii𝚺matrix3211225022104011204122115matrix1120011011matrix1201110101matrix1000001000003000003000003\displaystyle\mathrm{(ii)}\ \mathbf{\Sigma}=\begin{pmatrix}3&-2&-1&1&-2\\ -2&5&0&-2&2\\ -1&0&4&0&1\\ 1&-2&0&4&-1\\ -2&2&1&-1&5\end{pmatrix}=\begin{pmatrix}1&-1\\ -2&0\\ 0&1\\ 1&0\\ -1&1\end{pmatrix}\begin{pmatrix}1&-2&0&1&-1\\ -1&0&1&0&1\end{pmatrix}+\begin{pmatrix}1&0&0&0&0\\ 0&1&0&0&0\\ 0&0&3&0&0\\ 0&0&0&3&0\\ 0&0&0&0&3\end{pmatrix}.( roman_ii ) bold_Σ = ( start_ARG start_ROW start_CELL 3 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 5 end_CELL start_CELL 0 end_CELL start_CELL - 2 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 4 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 0 end_CELL start_CELL 4 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 5 end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) + ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 3 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 3 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 3 end_CELL end_ROW end_ARG ) .

We solve the path optimization problem (17) with objective functions (F1,PCA(𝐓),F1,FA(𝐓))=(𝚺𝐓F2,f2(𝚺𝐓))subscript𝐹1PCA𝐓subscript𝐹1FA𝐓superscriptsubscriptnorm𝚺𝐓𝐹2subscript𝑓2𝚺𝐓(F_{1,\mathrm{PCA}}(\mathbf{T}),F_{1,\mathrm{FA}}(\mathbf{T}))=(\|\mathbf{% \Sigma}-\mathbf{T}\|_{F}^{2},f_{2}(\mathbf{\Sigma}-\mathbf{T}))( italic_F start_POSTSUBSCRIPT 1 , roman_PCA end_POSTSUBSCRIPT ( bold_T ) , italic_F start_POSTSUBSCRIPT 1 , roman_FA end_POSTSUBSCRIPT ( bold_T ) ) = ( ∥ bold_Σ - bold_T ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_Σ - bold_T ) ). For each w𝑤witalic_w, we use Algorithm 1 to compute the solution. The results are displayed in Figure 2. It can be seen that, as w𝑤witalic_w varies from 1 to 0, the objective value of PCA increases, while that of FA decreases to 0. This indicates that the fitted model becomes closer to the factor model, and reaches it at last.

[0.7]Refer to caption

Figure 2: Path from PCA to FA for (i) and (ii) in Section 7.2.

7.3 Examples of fitting the principal component factor model

We now use two real datasets to fit the principal component factor model in Section 6.2. The first dataset comes from 48 Chinese cities, and contains data of three indices of urban development in the year 2008. These indices are as follows:

X1::subscript𝑋1absent\displaystyle X_{1}:italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : per-capita investment in the fixed assets (thousand RMB)
X2::subscript𝑋2absent\displaystyle X_{2}:italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : per-capita net income of rural people (thousand RMB)
X3::subscript𝑋3absent\displaystyle X_{3}:italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT : birthrate (‰)

All data used in this example can be found on the internet (available from the author).

Here p=3𝑝3p=3italic_p = 3, and thus we only need to consider q=1𝑞1q=1italic_q = 1 for fitting a factor model. We first compute the maximum likelihood estimate of the population covariance matrix as

𝚺=(82.55244.69905.61774.69904.62621.55025.61771.55024.7571).𝚺matrix82.55244.69905.61774.69904.62621.55025.61771.55024.7571\mathbf{\Sigma}=\begin{pmatrix}82.5524&4.6990&-5.6177\\ 4.6990&4.6262&-1.5502\\ -5.6177&-1.5502&4.7571\end{pmatrix}.bold_Σ = ( start_ARG start_ROW start_CELL 82.5524 end_CELL start_CELL 4.6990 end_CELL start_CELL - 5.6177 end_CELL end_ROW start_ROW start_CELL 4.6990 end_CELL start_CELL 4.6262 end_CELL start_CELL - 1.5502 end_CELL end_ROW start_ROW start_CELL - 5.6177 end_CELL start_CELL - 1.5502 end_CELL start_CELL 4.7571 end_CELL end_ROW end_ARG ) .

The first principal component is 𝒓1=(0.9971,0.0441,0.0614)subscript𝒓1superscript0.99710.04410.0614\boldsymbol{r}_{1}=(0.9971,0.0441,-0.0614)^{\prime}bold_italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 0.9971 , 0.0441 , - 0.0614 ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Consequently, we compute the maximum likelihood estimates of the parameters in the principal component factor model by the iterative method in Section 6.2, and obtain the fitted model,

{X1=8.8792U+ε1,X2=0.3928U+ε2,X3=0.5466U+ε3,\left\{\begin{aligned} &X_{1}=8.8792\,U+\varepsilon_{1},\\ &X_{2}=0.3928\,U+\varepsilon_{2},\\ &X_{3}=-0.5466\,U+\varepsilon_{3},\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 8.8792 italic_U + italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.3928 italic_U + italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = - 0.5466 italic_U + italic_ε start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , end_CELL end_ROW (22)

where UN(0,1),ε1N(0,2.08272),ε2N(0,2.07892),ε3N(0,2.08192)formulae-sequencesimilar-to𝑈𝑁01formulae-sequencesimilar-tosubscript𝜀1𝑁0superscript2.08272formulae-sequencesimilar-tosubscript𝜀2𝑁0superscript2.07892similar-tosubscript𝜀3𝑁0superscript2.08192U\sim N(0,1),\ \varepsilon_{1}\sim N(0,2.0827^{2}),\ \varepsilon_{2}\sim N(0,2% .0789^{2}),\ \varepsilon_{3}\sim N(0,2.0819^{2})italic_U ∼ italic_N ( 0 , 1 ) , italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 2.0827 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 2.0789 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_ε start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 2.0819 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) are independent. The factor loading vector in (22) is 𝒂=(8.8792,0.3928,0.5466)=79.2938𝒓1𝒂superscript8.87920.39280.546679.2938subscript𝒓1\boldsymbol{a}=(8.8792,0.3928,-0.5466)^{\prime}=79.2938\,\boldsymbol{r}_{1}bold_italic_a = ( 8.8792 , 0.3928 , - 0.5466 ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 79.2938 bold_italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, which indicates that 𝒂𝒂\boldsymbol{a}bold_italic_a has the same direction as the first principal component. We also conduct the likelihood ratio test for testing whether the covariance structure of the principal component factor model holds, and get p𝑝pitalic_p-value =0.1538absent0.1538=0.1538= 0.1538.

In (22), the common factor U𝑈Uitalic_U can be interpreted as the level of development. With a higher level of development, the two economic indicators, X1subscript𝑋1X_{1}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and X2subscript𝑋2X_{2}italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, are expected to be high, but the birthrate has a decreasing trend.

The second dataset contains data of air pollution emissions. It is known that air pollution is one of the world’s largest health and environmental problems. A feasible method to prevent air pollution is to analyze the sources of air pollution emissions. We now study the relationship among concentrations (unit: mg/m3mgsuperscriptm3\mathrm{mg}/\mathrm{m}^{3}roman_mg / roman_m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT) of five ions in the air. The ions are as follows:

X1::subscript𝑋1absent\displaystyle X_{1}:italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : chloridion(CL)chloridionsuperscriptCL\displaystyle\text{chloridion}\quad(\text{CL}^{-})chloridion ( CL start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT )
X2::subscript𝑋2absent\displaystyle X_{2}:italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : ammonium ion(NH4+)ammonium ionsuperscriptsubscriptNH4\displaystyle\text{ammonium ion}\quad({\text{NH}_{4}}^{+})ammonium ion ( NH start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT )
X3::subscript𝑋3absent\displaystyle X_{3}:italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT : calcium ion(Ca2+)calcium ionsuperscriptCalimit-from2\displaystyle\text{calcium ion}\quad(\text{Ca}^{2+})calcium ion ( Ca start_POSTSUPERSCRIPT 2 + end_POSTSUPERSCRIPT )
X4::subscript𝑋4absent\displaystyle X_{4}:italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT : sodium ion(Na+)sodium ionsuperscriptNa\displaystyle\text{sodium ion}\quad(\text{Na}^{+})sodium ion ( Na start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT )
X5::subscript𝑋5absent\displaystyle X_{5}:italic_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT : potassium ion(K+)potassium ionsuperscriptK\displaystyle\text{potassium ion}\quad(\text{K}^{+})potassium ion ( K start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT )

The data used in this example are sampled from 56 pollution events in 2001-2007 in a Chinese city, which has suffered from serious air pollution.

Consider q=2𝑞2q=2italic_q = 2 for fitting a principal component factor model in Section 6.2. We first compute the maximum likelihood estimate of the population covariance matrix as

𝚺=(134.7848.1460.4118.362.0848.17706.92941.1885.2296.7460.42941.12189.4576.9302.9118.3885.2576.9264.870.462.0296.7302.970.461.9).𝚺matrix134.7848.1460.4118.362.0848.17706.92941.1885.2296.7460.42941.12189.4576.9302.9118.3885.2576.9264.870.462.0296.7302.970.461.9\mathbf{\Sigma}=\begin{pmatrix}134.7&848.1&460.4&118.3&62.0\\ 848.1&7706.9&2941.1&885.2&296.7\\ 460.4&2941.1&2189.4&576.9&302.9\\ 118.3&885.2&576.9&264.8&70.4\\ 62.0&296.7&302.9&70.4&61.9\end{pmatrix}.bold_Σ = ( start_ARG start_ROW start_CELL 134.7 end_CELL start_CELL 848.1 end_CELL start_CELL 460.4 end_CELL start_CELL 118.3 end_CELL start_CELL 62.0 end_CELL end_ROW start_ROW start_CELL 848.1 end_CELL start_CELL 7706.9 end_CELL start_CELL 2941.1 end_CELL start_CELL 885.2 end_CELL start_CELL 296.7 end_CELL end_ROW start_ROW start_CELL 460.4 end_CELL start_CELL 2941.1 end_CELL start_CELL 2189.4 end_CELL start_CELL 576.9 end_CELL start_CELL 302.9 end_CELL end_ROW start_ROW start_CELL 118.3 end_CELL start_CELL 885.2 end_CELL start_CELL 576.9 end_CELL start_CELL 264.8 end_CELL start_CELL 70.4 end_CELL end_ROW start_ROW start_CELL 62.0 end_CELL start_CELL 296.7 end_CELL start_CELL 302.9 end_CELL start_CELL 70.4 end_CELL start_CELL 61.9 end_CELL end_ROW end_ARG ) .

The first two principal components are 𝒓1=(0.1036,0.9033,0.3944,0.1259,0.0426)subscript𝒓1superscript0.10360.90330.39440.12590.0426\boldsymbol{r}_{1}=(-0.1036,-0.9033,-0.3944,-0.1259,-0.0426)^{\prime}bold_italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( - 0.1036 , - 0.9033 , - 0.3944 , - 0.1259 , - 0.0426 ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝒓2=(0.0662,0.4193,0.8006,0.4053,0.1207)subscript𝒓2superscript0.06620.41930.80060.40530.1207\boldsymbol{r}_{2}=(-0.0662,0.4193,-0.8006,-0.4053,-0.1207)^{\prime}bold_italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( - 0.0662 , 0.4193 , - 0.8006 , - 0.4053 , - 0.1207 ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Consequently, we compute the maximum likelihood estimates of the parameters in the principal component factor model, and obtain the fitted model,

{X1=9.9518U12.1146U2+ε1,X2=86.7484U1+13.3896U2+ε2,X3=37.8793U125.5632U2+ε3,X4=12.0859U112.9416U2+ε4,X5=4.0875U13.8551U2+ε5,\left\{\begin{aligned} &X_{1}=-9.9518\,U_{1}-2.1146\,U_{2}+\varepsilon_{1},\\ &X_{2}=-86.7484\,U_{1}+13.3896\,U_{2}+\varepsilon_{2},\\ &X_{3}=-37.8793\,U_{1}-25.5632\,U_{2}+\varepsilon_{3},\\ &X_{4}=-12.0859\,U_{1}-12.9416\,U_{2}+\varepsilon_{4},\\ &X_{5}=-4.0875\,U_{1}-3.8551\,U_{2}+\varepsilon_{5},\end{aligned}\right.{ start_ROW start_CELL end_CELL start_CELL italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 9.9518 italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 2.1146 italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 86.7484 italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 13.3896 italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_X start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = - 37.8793 italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 25.5632 italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_X start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = - 12.0859 italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 12.9416 italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_X start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = - 4.0875 italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 3.8551 italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_ε start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , end_CELL end_ROW (23)

where U1N(0,1),U2N(0,1),ε1N(0,4.75232),ε2N(0,10.19742),ε3N(0,7.55042),ε4N(0,10.23462),ε5N(0,3.97322)formulae-sequencesimilar-tosubscript𝑈1𝑁01formulae-sequencesimilar-tosubscript𝑈2𝑁01formulae-sequencesimilar-tosubscript𝜀1𝑁0superscript4.75232formulae-sequencesimilar-tosubscript𝜀2𝑁0superscript10.19742formulae-sequencesimilar-tosubscript𝜀3𝑁0superscript7.55042formulae-sequencesimilar-tosubscript𝜀4𝑁0superscript10.23462similar-tosubscript𝜀5𝑁0superscript3.97322U_{1}\sim N(0,1),\ U_{2}\sim N(0,1),\ \varepsilon_{1}\sim N(0,4.7523^{2}),\ % \varepsilon_{2}\sim N(0,10.1974^{2}),\ \varepsilon_{3}\sim N(0,7.5504^{2}),\ % \varepsilon_{4}\sim N(0,10.2346^{2}),\ \varepsilon_{5}\sim N(0,3.9732^{2})italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ) , italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 1 ) , italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 4.7523 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_ε start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 10.1974 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_ε start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 7.5504 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_ε start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 10.2346 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_ε start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∼ italic_N ( 0 , 3.9732 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) are independent. The factor loading vectors in (23) are 9221.9𝒓19221.9subscript𝒓19221.9\,\boldsymbol{r}_{1}9221.9 bold_italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 1019.6𝒓21019.6subscript𝒓21019.6\,\boldsymbol{r}_{2}1019.6 bold_italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We also conduct the likelihood ratio test for testing whether the covariance structure of the principal component factor model holds, and get p𝑝pitalic_p-value =0.3036absent0.3036=0.3036= 0.3036. The common factors U1subscript𝑈1U_{1}italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and U2subscript𝑈2U_{2}italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT can be interpreted as two sources of air pollution.

8 Discussion

In this paper we have proposed a unified way to handle PCA and FA under a general latent variable model. With specific loss functions, the two methods can be viewed as solutions to matrix optimization problems. Along this way, we have discussed a number of topics, including effective algorithms, penalized methods, and some combinations of PCA and FA.

The use of this paper is twofold. First, many students, including myself of the bygone age, feel confused about the connection and difference between PCA and FA when learning the course of multivariate statistical analysis, because the two methods are much different from other contents in textbooks such as statistical inference. The proposed framework to handle PCA and FA is relatively simple, and can complement present textbooks. Hopefully this will help students better understand them.

Second, we have provided new viewpoints to reexamine the two classical methods. New methods for data analysis are presented. As a result researchers can find some future directions from them. For example, there are numerous other loss functions for fitting the model (2). A natural class of objective functions in (3) is F1(𝐓)=d(N(𝟎,𝚺),N(𝟎,𝐓))subscript𝐹1𝐓𝑑𝑁0𝚺𝑁0𝐓F_{1}(\mathbf{T})=d\left(N(\boldsymbol{0},\mathbf{\Sigma}),N(\boldsymbol{0},% \mathbf{T})\right)italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_T ) = italic_d ( italic_N ( bold_0 , bold_Σ ) , italic_N ( bold_0 , bold_T ) ), where d𝑑ditalic_d evaluates some distance between two distributions. Note that the Wasserstein distance, which is related to the optimal transport problem, has been widely applied in machine learning and statistics (Panaretos and Zemel 2020). With d𝑑ditalic_d being the Wasserstein distance, F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT has a closed form as F1(𝐓)=trace(𝚺+𝐓2(𝚺1/2𝐓𝚺1/2)1/2)subscript𝐹1𝐓trace𝚺𝐓2superscriptsuperscript𝚺12𝐓superscript𝚺1212F_{1}(\mathbf{T})={\mathrm{trace}}\left(\mathbf{\Sigma}+\mathbf{T}-2\left(% \mathbf{\Sigma}^{1/2}\mathbf{T}\mathbf{\Sigma}^{1/2}\right)^{1/2}\right)italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_T ) = roman_trace ( bold_Σ + bold_T - 2 ( bold_Σ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT bold_T bold_Σ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) (Knott and Smith 1984). We conjecture that the solution to (3) with this objective function yields PCA. It seems correct even when we remove the constraint 𝐓𝚺𝐓𝚺\mathbf{T}\leqslant\mathbf{\Sigma}bold_T ⩽ bold_Σ like in (6) after our numerous numerical experiments. This conjecture may construct a connection between PCA and optimal transport. Note that eigenvalue decomposition is computationally expensive with large datasets. We also hope to find efficient iterative algorithms for specific objective functions, which induce PCA, instead of eigenvalue decomposition in PCA (Roweis 1997; Hippert-Ferrer, Bouchard, and Mian et al. 2023). Another direction is to extend our methods to more complex problems, including supervised PCA (Bair, Hastie, and Paul et al. 2006), sparse PCA (Zou, Hastie, and Tibshirani 2006; Johnstone and Lu 2009), nonlinear dimensionality reduction (Ghojogh, Crowley, and Karray et al. 2023), categorical data analysis (Bhattacharya and Dunson 2012), and estimation of intrinsic dimension (Levina and Bickel 2004).

Appendix

Proof of Theorem 1 Under Assumption 2, for fixed 𝐓𝐓\mathbf{T}bold_T, the solution to (4) with F2(𝐓,𝐕)=h(𝐓+𝐕)+F1(𝐓)subscript𝐹2𝐓𝐕𝐓𝐕subscript𝐹1𝐓F_{2}(\mathbf{T},\mathbf{V})=h(\mathbf{T}+\mathbf{V})+F_{1}(\mathbf{T})italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_T , bold_V ) = italic_h ( bold_T + bold_V ) + italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_T ) is 𝐕(𝐓)=𝚺𝐓superscript𝐕𝐓𝚺𝐓\mathbf{V}^{*}(\mathbf{T})=\mathbf{\Sigma}-\mathbf{T}bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_T ) = bold_Σ - bold_T. Then the optimal 𝐓superscript𝐓\mathbf{T}^{*}bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the solution to (3), and this completes the proof. ∎

Proof of Lemma 1 It suffices to show that 𝐒2s11subscriptnorm𝐒2subscript𝑠11\|\mathbf{S}\|_{2}\geqslant s_{11}∥ bold_S ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⩾ italic_s start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT, which is obtained by noting 𝐒2=max𝒙,𝒙2=1𝒙𝐒𝒙subscriptnorm𝐒2subscript𝒙subscriptnorm𝒙21superscript𝒙𝐒𝒙\|\mathbf{S}\|_{2}=\max_{\boldsymbol{x},\ \|\boldsymbol{x}\|_{2}=1}\boldsymbol% {x}^{\prime}\mathbf{S}\boldsymbol{x}∥ bold_S ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT bold_italic_x , ∥ bold_italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 end_POSTSUBSCRIPT bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_S bold_italic_x and taking 𝒙=(1,0,,0)𝒙superscript100\boldsymbol{x}=(1,0,\ldots,0)^{\prime}bold_italic_x = ( 1 , 0 , … , 0 ) start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. ∎

Proof of Theorem 2 Let 𝐓~=𝐑𝐓𝐑~𝐓superscript𝐑𝐓𝐑\tilde{\mathbf{T}}=\mathbf{R}^{\prime}\mathbf{T}\mathbf{R}over~ start_ARG bold_T end_ARG = bold_R start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_TR. By Assumption 4, (3) is equivalent to

min𝐓f(𝚲𝐓~),subscript𝐓𝑓𝚲~𝐓\displaystyle\min_{\mathbf{T}}\ f(\mathbf{\Lambda}-\tilde{\mathbf{T}}),roman_min start_POSTSUBSCRIPT bold_T end_POSTSUBSCRIPT italic_f ( bold_Λ - over~ start_ARG bold_T end_ARG ) , (24)
s.t. 0𝐓~𝚲,rank(𝐓~)=q.formulae-sequences.t. 0~𝐓𝚲rank~𝐓𝑞\displaystyle{\text{s.t.}}\ \mathbf{0}\leqslant\tilde{\mathbf{T}}\leqslant% \mathbf{\Lambda},\ \mathrm{rank}(\tilde{\mathbf{T}})=q.s.t. bold_0 ⩽ over~ start_ARG bold_T end_ARG ⩽ bold_Λ , roman_rank ( over~ start_ARG bold_T end_ARG ) = italic_q .

By Assumption 5, we only need to consider 𝐓~~𝐓\tilde{\mathbf{T}}over~ start_ARG bold_T end_ARG to be a diagonal matrix. By Assumption 6, the solution to (24) is 𝐓~=diag(λ1,,λq,0,,0)superscript~𝐓diagsubscript𝜆1subscript𝜆𝑞00\tilde{\mathbf{T}}^{*}=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{q},0,\ldots,0)over~ start_ARG bold_T end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_diag ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT , 0 , … , 0 ), which completes the proof. ∎

Proof of Theorem 5 Note that (𝐓,𝐕)superscript𝐓superscript𝐕(\mathbf{T}^{*},\mathbf{V}^{*})( bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the solution to

min𝐓,𝐕l(𝐓,𝐕),subscript𝐓𝐕𝑙𝐓𝐕\displaystyle\min_{\mathbf{T},\mathbf{V}}\ l(\mathbf{T},\mathbf{V}),roman_min start_POSTSUBSCRIPT bold_T , bold_V end_POSTSUBSCRIPT italic_l ( bold_T , bold_V ) ,
s.t.𝐓𝟎,𝐕𝟎,rank(𝐓)=q,𝐕2κ,formulae-sequences.t.𝐓0formulae-sequence𝐕0formulae-sequencerank𝐓𝑞subscriptnorm𝐕2𝜅\displaystyle{\text{s.t.}}\ \mathbf{T}\geqslant\mathbf{0},\ \mathbf{V}% \geqslant\mathbf{0},\ \mathrm{rank}(\mathbf{T})=q,\ \|\mathbf{V}\|_{2}% \leqslant\kappa,s.t. bold_T ⩾ bold_0 , bold_V ⩾ bold_0 , roman_rank ( bold_T ) = italic_q , ∥ bold_V ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⩽ italic_κ ,

where κ𝐕2𝜅subscriptnormsuperscript𝐕2\kappa\geqslant\|\mathbf{V}^{*}\|_{2}italic_κ ⩾ ∥ bold_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a tuning parameter. This completes the proof. ∎

Proof of Theorem 6 Under model (1), f~0(𝚺𝐓)=f~0(𝐕)psubscript~𝑓0𝚺superscript𝐓subscript~𝑓0𝐕𝑝\tilde{f}_{0}(\mathbf{\Sigma}-\mathbf{T}^{*})=\tilde{f}_{0}(\mathbf{V})\leqslant pover~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_Σ - bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_V ) ⩽ italic_p. By (11), for any matrix 𝐒𝐒\mathbf{S}bold_S satisfying the constraints in (3), f~0(𝚺𝐒)<f~0(𝚺𝐓)subscript~𝑓0𝚺𝐒subscript~𝑓0𝚺superscript𝐓\tilde{f}_{0}(\mathbf{\Sigma}-\mathbf{S})<\tilde{f}_{0}(\mathbf{\Sigma}-% \mathbf{T}^{*})over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_Σ - bold_S ) < over~ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_Σ - bold_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) if and only if 𝚺𝐒=𝟎𝚺𝐒0\mathbf{\Sigma}-\mathbf{S}=\mathbf{0}bold_Σ - bold_S = bold_0, which is a contradiction since rank(𝐒)=q<rank(𝚺)rank𝐒𝑞rank𝚺\mathrm{rank}(\mathbf{S})=q<\mathrm{rank}(\mathbf{\Sigma})roman_rank ( bold_S ) = italic_q < roman_rank ( bold_Σ ). ∎

Acknowledgements

This work is partially supported by National Key R&D Program of China (Grant nos. 2021YFA1000300, 2021YFA1000301, and 2021YFA1000303) and the National Natural Science Foundation of China (Grant no. 12171462).

References

Anderson, T. W. (1984), An Introduction to Multivariate Statistical Analysis, Second Edition, John Wiley & Sons.

Anderson, T. and Rubin, H. (1956), Statistical inference in factor analysis. In: Neyman, J., ed., Proc. 3rd Berkeley Symp. Mathematical Statistics and Probability, vol. V, Berkeley, CA: University of California Press, pp. 111-150.

Bhattacharya, A. and Dunson, D. B. (2012), Simplex factor models for multivariate unordered categorical data. Journal of the American Statistical Association, 107, 362-377.

Bair, E., Hastie, T., Paul, D., and Tibshirani, R. (2006), Prediction by supervised principal components. Journal of the American Statistical Association, 101, 119-137.

Bartholomew, D., Knott, M., and Moustaki, I. (2011), Latent Variable Models and Factor Analysis: A Unified Approach, Third Edition, John Wiley & Sons.

Ghojogh, B., Crowley, M., Karray, F., and Ghodsi, A. (2023), Elements of Dimensionality Reduction and Manifold Learning, Springer Nature.

Giudici, P. and Stanghelline, E. (2001). Bayesian inference for graphical factor analysis models. Psychometrika, 66, 577-592.

Grzebyk, M., Wild, P., and Chouanière, D. (2004), On identification of multi-factor models with correlated residuals. Biometrika, 91, 141-151.

Hippert-Ferrer, A., Bouchard, F., Mian, A., Vayer, T., and Breloy, A. (2023). Learning graphical factor models with riemannian optimization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 349-366). Cham: Springer Nature Switzerland. arXiv:2210.11950v2

Levina, E., and Bickel, P. (2004), Maximum likelihood estimation of intrinsic dimension. Advances in Neural Information Processing Systems, 17.

Johnstone, I. M. and Lu, A. Y. (2009), On consistency and sparsity for principal components analysis in high dimensions. Journal of the American Statistical Association, 104, 682-693.

Jolliffe, I. T. (2002), Principal Component Analysis, Second Edition, New York: Springer.

Jolliffe, I. T. and Cadima, J. (2016), Principal component analysis: a review and recent developments. Phil. Trans. R. Soc. A 374: 20150202.

Knott, M. and Smith, C. S. (1984), On the optimal map** of distributions. Journal of Optimization Theory and Applications, 43: 39-49.

Kolda, T. G., Lewis, R. M., and Torczon, V. (2003), Optimization by direct search: New perspectives on some classical and modern methods. SIAM review, 45: 385-482.

Panaretos, V. M. and Zemel, Y. (2020), An Invitation to Statistics in Wasserstein Space, Springer.

Roweis, S. (1997). EM algorithms for PCA and SPCA. Neural Inf. Proc. Syst., 10: 626-632.

Shapiro, A. (1985), Identifiability of factor analysis: some results and open problems. Linear Algebra and Its Applications, 70: 1-7.

Srebro, N. and Jaakkola, T. (2003), Weighted low-rank approximations. In ICML, Vol. 3, 720-727.

Tip**, M. E. and Bishop, C. M. (1999), Probabilistic principal component analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 61, 611-622.

Tseng, P. (2001), Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of Optimization Theory and Applications, 109, 475-494.

Zou, H. Hastie, T., and Tibshirani, R. (2006), Sparse principal component analysis. Journal of Computational and Graphical Statistics, 15, 265-286.