Efficient algorithms for regularized Poisson Non-negative Matrix Factorization
Abstract
We consider the problem of regularized Poisson Non-negative Matrix Factorization (NMF) problem, encompassing various regularization terms such as Lipschitz and relatively smooth functions, alongside linear constraints. This problem holds significant relevance in numerous Machine Learning applications, particularly within the domain of physical linear unmixing problems. A notable challenge arises from the main loss term in the Poisson NMF problem being a KL divergence, which is non-Lipschitz, rendering traditional gradient descent-based approaches inefficient. In this contribution, we explore the utilization of Block Successive Upper Minimization (BSUM) to overcome this challenge. We build approriate majorizing function for Lipschitz and relatively smooth functions, and show how to introduce linear constraints into the problem. This results in the development of two novel algorithms for regularized Poisson NMF. We conduct numerical simulations to showcase the effectiveness of our approach.
Disclaimer
This document is a technical report and has not undergone peer review. The findings and conclusions presented herein are solely based on the authors’ research and analysis. We apologize for any potential errors or shortcomings in the content.
1 Introduction
The problem of factorizing a matrix as Non Negative components is central in many Machine Learning (ML) applications [47, 10, 4]. The motivation for performing such a factorization is that is often associated with a probability distribution density of the form . Typically, the optimal decomposition is found by minimizing the negative log-likelihood of that distribution:
(1) |
If is assumed to be perturbed with Normal noise, we obtain a Gaussian distribution, i.e. , and we end up with the classic non-negative matrix factorization (NMF) problem [29, 30], where the quadratic function is minimized. For other distribution families, and in particular exponential families, a term often appears. In particular, the Poisson negative log-likelihood model [29] leads to a loss of the form
(2) |
where the inner product over matrices is the Frobenius inner product defined as .
Regularized Poisson Non Negative Matrix Factorisation
In many problems (see, for example, [51, 50, 21, 52]), additional prior information about the matrices is known. For example, it might be known that the columns of are smooth, or that the rows of are sparse. One might also be interested in normalizing to unity the columns of or the rows of because they might quantify physical quantities for which normalization is necessary. For example, in the analysis of hyperspectral imaging data, the images are assumed to be smooth and the components are summing to the unity [51, 52]. Typically, this information can be encoded via an extra regularization term and/or additional constraints . This leads to the general optimization problem we solve in this contribution:
(3) |
where . The different colors emphasize the changes compared to the traditional problem of [29]. First, in violet, we slightly simplify the problem by imposing strict non-negtativity. While, this is not strictly necessary111The case could be handled with an approach similar to [33]., this assumption significantly simplify our analysis. We believe that handling is an unnecessary complication, as the results with close to machine precision will be practically identical. Second, in light blue, we consider the case where the constraints are linear, specifically or . In general, it is only meaningful to use one of the constraints, as it will fix the ratio between and . We note that this includes the simplex constraint when Third, in brown, we consider regularizations of the form:
(4) |
where is the vector of a row of or a colum of ., i.e and
-
1.
The term is assumed to be gradient Lipschitz, i.e., there exists a such that
(5) Alternatively, this condition could be rewriten as
In this contribution, we will consider in particular , where is the Laplacian operator, favoring smoothness in the columns of . In this case .
-
2.
The term is assumed to be relatively smooth with respect to . Relative smoothness is a generalization of Lipschitz smoothness, and is defined as follows [35, Definition 1.1]:
(6) where is the Bregman divergence [3] associated with [35, equation 7]:
While Lipschitz functions can be upper bounded with quadratic functions, we observe from (6) that relative smoothness allows us to upper bound a function with the Bregman divergence of a function . This allows us to use a much wider range of functions to regularize our problem, and in particular non-gradient Lipschitz ones. As an example, let us consider , with its Bregman divergence:
One can observe that the objective function (2) is relatively smooth with respect to . This term could also be used to introduce soft contraints such as a log-barrier: for and otherwise.
-
3.
Eventually, is a smooth point-wise concave function (i.e. is convex). A typical example of this regularisation could be which favor sparsity in the vector without penalizing large values too heavily. Its slope starts at for and tends to for .
Our approach can handle regularizations of the form However, for simplicity of notation, we restrict ourselves to separable regularizations.
The fidelity term is convex in and in , but jointly non convex. We note that depending on the regularization and the constraints, multiple equivalent scaled solutions (stationary points) could exist, i.e., for and . However, this is generally not the case with additional constraints or regularizations.
Why is this problem challenging?
In general, Poisson Non Negative Matrix factorization, i.e. minimizing is challenging because it is not gradient Lipschitz222The function is, according to the definition, gradient Lipschitz because of the constraint . However, in practice, the constant is chosen to be so small that its actual Lipschitz constant is too large to be useful. despite being differentiable for . In practice, this implies that there exists no fixed learning rate that ensures convergence of gradient descent and that line search would have to be used. For that reason, solving the more general problem (3) is a difficult task, and to the best of our knowledge, there are no existing algorithms that can be directly applied to it. Although there exist many algorithms to solve the traditional Poisson NMF problem [16, 30, 29, 24, 12], none of them focuses on the regularized case (see the related work Section 2). Our main contribution is to fill this gap by providing multiple algorithms that minimize (3) for a wide range of regularizations described by (4) and some linear constraints.
Our approach
A natural approach to minimize (3) is to optimize for each variable at a time. For example, Block Coordinate Descent (BCD) can be expressed as
(7) |
(8) |
This type of iterative scheme ensures that the loss does not increase between iterations and has been successfully used for the L2 case. However, in the Poisson case, there is no closed-form solution for problems (7) and (8), making this approach generally computationally expensive.
Fortunately, in practice, one does not need to find the global minima of (7) and (8) at each iteration. Instead, using a Block Successive Minimization (BSUM) algorithm [41], it is sufficient to minimize approximations of which are locally tight upper bounds of . To use the BSUM efficiently, these approximation functions need to have three properties: (1) to satisfy the hypotheses of the BSUM Theorem [41, Theorem 2], (2) to be as tight as possible, and (3) to be easy to optimize, i.e. to lead to a closed-form solution for each subproblem.
Our contributions can be summarized as follows. We show how regularized Poisson NMF can be efficiently solved using BSUM. We derive tight upper bounds for multiple regularizers and compare our approach with traditional algorithms. We also propose a simple way to introduce linear constraints into the problem and suggest using line search to build even tighter upper bounds. Finally, we propose multiple algorithms for regularized Poisson NMF and conduct numerical simulations to demonstrate the effectiveness of our approach.
Outline of this contribution
In Section 2, we provide a review of the literature. In Section 3, we clarify the notation and provide the necessary definitions for the BSUM Theorem [41, Theorem 2] that will be used for the convergence of our algorithm. In Section 4, we develop convenient approximations of the objective and regularization functions leading to sub-problems with closed-form solutions. In Section 4.3, we explore how to modify the optimization scheme to introduce generalized simplex constraints. In Section 5, we present our algorithms. Section 6 provides numerical applications of our algorithm, and Section 7 concludes this contribution.
2 Related work
Applications of Poisson Distribution likelihood Maximization
The maximization of likelihood in a Poisson distribution finds relevance in various applications, prompting the resolution of the problem outlined in Equation (3).
Many such applications arise in the domain of physical constrained linear unmixing problems [21]. Some noteworthy instances encompass: 1. Scanning transmission electron microscopy (STEM) [52, 45, 5, 19], 2. Hyperspectral Raman and optical imaging [22, 54, 11], 3. Tensor SVD applied to denoise atomic-resolution 4D scanning transmission electron microscopy [59], and 4. Non-local Poisson PCA denoising [44, 57]. It is noteworthy that many of these applications predominantly employ the L2 case, which offers a comparatively simpler solution. As a result, data is often renormalized to convert Poisson distributions into Gaussian distributions [28]. Nevertheless, the efficacy of these applications could be significantly enhanced by the development of algorithms tailored explicitly for the Poisson case.
Furthermore, Problem (3) also surfaces in hyperspectral image denoising, where noise is assumed to follow a Poisson distribution [60, 58]. In the domain of text mining, the Poisson distribution assumption is frequently utilized for modeling word occurrences based on latent variables such as categories, leading to the problem formulation depicted by (3) [17, 38]. Additionally, within the context of recommender systems, several matrix factorization problems can be reformulated into the structure of (3) [46, 13, 27].
Other Optimization Approaches
The literature offers a limited number of optimization methods suitable for addressing the problem presented by Equation (3). This constraint arises from the requirement of many optimization techniques to have a continuously differentiable gradient Lipschitz function. Examples of such techniques include gradient descent [40], perturbed gradient descent [20], nonlinear conjugate gradient method [9], various proximal point minimization algorithms [26], and second-order methods like the Newton-CG algorithms [42, 43].
One potentially attractive direction is the utilization of Proximal Alternating (Linearized) Minimization (PALM) [2] or Proximal Alternating Minimization (PAM) [1]. These algorithms are designed to solve problems of the form:
PALM employs a Gauss-Seidel iteration scheme, consisting of the following sub-problems:
Unfortunately, PALM requires the objective function to possess a Lipschitz gradient, which is not the case in our scenario. Additionally, the Gauss-Seidel iterations generally lack a closed-form solution, resulting in a slow algorithm with sub-iterations.
To overcome the non-Lipschitz gradient issue, Bregman gradient descent (B-GD) [31, Algorithm 1.1] can be considered. This type of algorithm has been extended to alternating minimization [31, Algorithm 1.3 and 1.4]. Such an approach can be adapted to our case, as the objective function (3) exhibits relative smoothness for most regularization scenarios (see (6)). However, this optimization scheme involves non-tight majorization functions, leading to slow convergence, as discussed in Section 5.2.
Given the presence of two blocks of variables, Block Coordinate Descent (BCD) algorithms [53] naturally emerge as a potential solution. In fact, previous work [24] demonstrates that many existing approaches can be viewed as Block Coordinate Descent (BCD) problems. However, a primary challenge with BCD lies in its propensity to necessitate full minimization of the sub-problems, which proves to be challenging in the Poisson case. In the L2 case, the subproblems often have closed-form solutions. To address this concern, we explore BSUM algorithms [41] in this study, a generalization of BCD that avoids the requirement for full minimization of the subproblems, instead using upper bounds for the objective function.
Non-Negative Matrix Factorization
Non-Negative Matrix Factorization (NMF) algorithms have been extensively studied, and for a comprehensive review, one can refer to [12]. Initially formulated as Positive Matrix Factorization for the Gaussian (L2-NMF) case by [39], NMF has seen numerous algorithmic developments. In the L2 case, popular approaches include the Alternating Nonnegative Least Squares (ANLS) framework [23, 25, 34] and the Hierarchical Alternating Least Squares (HALS) method [7, 8]. As for the Poisson case, known as KL NMF, the first algorithm using Multiplicative Updates (MU) was proposed by [30], with later demonstrations of its convergence provided in [29] for both Poisson and Gaussian cases. A more rigorous convergence analysis is presented in [33].
Considering the specific problem of Poisson KL-NMF, there have been a few notable contributions. For instance, [48] employed the Alternating Direction Method of Multipliers (ADMM) with the variable change . The Primal Dual algorithm, based on the framework from Chambolle-Pock, was explored by [56]. Moreover, [16] conducted a comparative study of various optimization algorithms for KL NMF, including MU [29], ADMM [48], Primal Dual [56], and Cyclic Coordinate Descent Method [18]. They also introduced three new algorithms for KL-NMF, namely Block Mirror Descent Method, A Scalar Newton-Type Algorithm, and A Hybrid SN-MU Algorithm.
In the introduction, we mentioned that there are few contributions that address the problem of regularized NMF, with many of them focusing on the L2 case. [24] demonstrated how many existing works can be cast as Block Coordinate Descent (BCD) problems, allowing the derivation of MU update rules for different regularizers, such as L1 for sparsity. However, their work is limited to L2 NMF. Xu et al. [55] proposed a general optimization scheme for block multiconvex optimization using block coordinate descent, which can accommodate regularization on each block. Although applied to L2-NMF, such an approach may lead to algorithms with sub-iterations. In the context of L2-NMF with sparsity constraints, [50] presented an approach to address this scenario. Additionally, [49] introduced a framework for handling L2-NMF with Lipschitz regularizers, akin to our term . Other forms of regularization have also been explored, such as graph-based [6] or simplex constraint [17].
Regarding the specific problem of regularized KL loss, [15] provided a notable contribution. However, their work focused solely on the subproblems of the NMF problem, rather than addressing the NMF problem itself. Notably, one could potentially employ a similar approach to solve the subproblems of (3) using BCD. Nonetheless, this would result in a less efficient algorithm with sub-iterations.
3 Preliminaries
3.1 Notation
We reserve capital letters for matrices and vectors, e.g., . We use to refer to the -th numbered vector, and to denote the -th element of vector . The -th element of vector or the element at the -th row and -th column of matrix is denoted as .
and indicate that all entries of matrix or vector are greater than or equal to , i.e., for all . represent the transpose of and , respectively. We use to denote at step . denotes the transpose of . and denote elementwise multiplication (also known as the Hadamard product) and division for matrices, respectively. For example, and .
As mentioned in the introduction, the matrix to be factorized is generally denoted as , where and are its factors. We will use as the vectorized versions of , while will denote the -th column of . are general variables that can replace either or . We use as the general loss function, and is broadly used to denote a multivariate scalar function.
Finally, we commonly employ calligraphic notation for variable domains. Let serve as a generic domain for the variable . In practical terms, it is frequently defined by the constraint, specifically as . Additionally, we utilize and to represent the domains of and , respectively.
3.2 Definitions
In this contribution, we consider the loss function with two blocks of variables: and , where and are both non-empty convex sets. Here, and correspond to the vectorized versions of the two matrices and , respectively. Therefore, and often correspond to the sets and with . Let us use to denote all the variables. We have , where the total dimension of the problem is .
Definition 1 (Directional derivative).
Let be a scalar function, where is a convex set. The directional derivative of at point in the direction is defined by
Note that when is differentiable, since and are row vectors. In this contribution, almost all functions are differentiable on the domain of interest.
Definition 2 (Coordinatewise Minimum).
The point is a coordinatewise minimum of a function if
A coordinatewise minimum is a natural termination point for an alternating minimization algorithm. However, it is important to note that a coordinatewise minimum is not equivalent to a local minimum, as it does not guarantee minimality in all directions. Figure 1 (left) provides a counterexample illustrating this.
Another significant concept is the notion of a stationary point, where the gradient is non-negative in all directions.
Definition 3 (Stationary Points of a function).
Let be a scalar function, where is a convex set. A point is a stationary point of if
We emphasize that a stationary point is not equivalent to a strict local minimum as there might be directions where the directional derivative equals 0. For example, in the simple function , the point has a zero derivative in the direction , which corresponds to rescaling the solution as . Even worse, a stationary point is not necessarily a local minimum, even if it is a coordinatewise minimum, as shown in Figure 1 (a). Here, in the diagonal directions, the directional derivative equals 0 but the function is concave in this direction.
When it comes to the Poisson loss function, there is, at least, a continuous set of local minimas corresponding to rescaling the solution. This is illustrated in Figure 1 (b). Note that the introduction of regularization or constraints can lead to strict local minima, as shown in Figure 1 (c).
In this contribution, we prove convergence to a coordinatewise minimum that is also a stationary point. To accomplish this, we will consider a class of functions that are regular at their coordinatewise minima.
Definition 4 (Regularity of a function at a point).
The function is said to be regular at the point if for all such that and .
Lemma 1.
Continuously differentiable functions are regular at their coordinatewise minimums.
3.3 Approximation functions
In order to facilitate optimization algorithms, it is beneficial to work with approximation functions that majorize or approximate the objective function at a given point. One commonly used class of approximation functions is known as first-order majorization functions. These functions provide a convenient framework for constructing surrogates and facilitating optimization. We adopt the definition of first-order majorization functions from [41].
Definition 5.
[41, Assumption 1] A function is said to be a first-order majorization of at the point if it satisfies the following properties:
It is worth noting that for continuously differentiable functions, the third statement can be equivalently expressed as . Although the definition of first-order majorization functions resembles the concept of surrogate functions introduced in [37, Definition 2.2], the additional requirement for a surrogate function is that is L gradient Lipschitz as defined in (5). Importantly, all majorization functions defined in the following Section 4.1 satisfy this condition and can thus serve as majorization functions.
Conveniently, majorization functions can be built term by term, leveraging their additivity property. This property allows us to combine multiple majorization functions to obtain a new majorization function.
Lemma 2.
First-order majorization functions are additive. If and majorize and at , respectively, then majorizes at .
Proof.
The additivity property preserves each property of (5). ∎
Lemma 2 provides a valuable tool for constructing majorization functions by combining simpler majorization functions. Additionally, when proving that a function is majorizing, it is often unnecessary to explicitly demonstrate the equality of partial derivatives or gradients at . Instead, in the case of differentiable functions, it is typically sufficient to establish the first two properties (A.1 and A.2). According to [41, Proposition 1], properties A.3 and A.4 follow as a consequence. Intuitively, one can observe that the continuity of the gradient ensures that the majorization function shares the tangent spaces with at the point .
3.4 Two Blocks Successive Minimization (TBSUM)
The TBSUM algorithm is designed to solve the following problem:
(9) |
It relies on two first-order majorizing functions: and , which majorize at for all and . The construction of these functions will be presented in Section 4. The TBSUM algorithm, outlined in Algorithm 1, alternates between minimizing and . It is assumed that the subproblem solutions are unique. Theorem 1 establishes the convergence of the TBSUM algorithm, which is a variant of the algorithm presented in [41, Theorem 2a] adapted for solving the specific problem at hand.
Theorem 1 (Convergence of TBSUM Algorithm 1).
Given two quasi-convex first order majorizing functions and of at . Furthermore assuming that the two subproblems in the TBSUM Algorithm 1 have unique solutions for any points , . Then, every limit point of the iterates generated by the TBSUM Algorithm 1 is a coordinatewise minimum of (9). In addition, if is regular at any point , then is a stationary point of (9).
4 Subproblem minimization
In this section, we focus on constructing the appropriate majorization functions and for our problem (3). Since we consider the same type of regularization for and , both subfunctions have the same form.
Practically, the loss function can be rewritten as
where and are the row and column of and . The functions and have the form:
(10) |
Therefore, in this section, our objective is to find majorization functions for (10). Once this is done, we will provide closed-form solutions for steps 1 and 2 of Algorithm 1. It is worth noting that each term of (10) can be handled separately using the additivity property of majorizing functions (Lemma 2).
4.1 Majorizing functions
The following four lemmas provide majorizing functions for the different term of our objective function. Proofs are provided in Appendix A.2.
In order to develop an efficient algorithm, our objective is to identify majorizing functions that result in sub-problems with closed-form tractable solutions. Often, this can be accomplished under two conditions: 1. all the majorizing functions are of the same form, and, 2. the majorization function is separable with respect to the variables , i.e., . Within the scope of this contribution, we consider two forms of majorizing functions: quadratic and logarithmic .
First, we propose a majorization scheme for the logarithmic term in the objective function (10). We utilize a widely used majorization technique based on the concavity of the logarithm function. This technique has been employed in the original work by Lee and Seung [29] as well as in many EM (Expectation-Maximization) schemes.
Lemma 3 (Log majorization).
Assuming , for , let us define for , then is a first order majorizing function of .
We now proceed to majorize the different terms of the regularisation function , , and . We can majorize any Lipschitz function using the following lemma.
Lemma 4 (Lipschitz-majorization).
Given a gradient Lipschitz function with constant over the domain . The functions
(11) |
and
(12) |
are first oder majorizing functions at
We note that (11) (quadratic majorisation) is tighter than (12) (logarithmic majorisation) . However, the looser majorisation function is needed to obtain a close form solution for the MU (see Section 4).
Next, the term that is relatively smooth can be majorized using the following lemma.
Lemma 5 (Relative smoothness majorization).
Assuming a relatively smooth function with respect to for Then the function
(13) |
is a first order majorizing function of for
Lemma 6 (Concave majorisation).
Given a concave function defined on , it’s linear approximation at the point
(14) |
is a first order majorization function for
4.2 Subproblem updates
Now that we have defined majorizing functions for each term of (10), we can apply the additivity property of Lemma 2 to obtain a general majorizing function for :
(15) | ||||
where We use the colors green, blue, orange, and purple to denote and keep track of the dependencies of the different terms in (10). Finding the local optimum of the majorizing function will provide us with an update for Algorithm 1.
Proposition 1 (Generalized MU for (10)).
Assuming , the first-order majorizing function defined in (15) is strictly convex, and its global minimum is given by
(16) |
where
(17) | ||||
(18) |
The proof is provided in Appendix A.3.
Generalization of the traditional MU Rule
Connection with (Block) Mirror Descent [16, Algorithm 1]
Another interesting observation is that the majorization of the relatively smooth term is done similarly to a Bregman proximal method algorithm [14]. Since the objective function is relatively smooth, one could drop all terms except for and optimize using Block Bregman Proximal Gradient (BBPG) [50]. This would result in an algorithm very similar to Block Mirror Descent (BMD), which has recently been proposed for solving Poisson NMF [16]. Nevertheless, we advice against this this solution as discussed further in Section 5.2.
Alternative majorizing function and Quadratic Update (QU)
As shown experimentally in Section 6 and illustrated in Figure 2, having a majorization function as tight as possible leads to faster convergence of the algorithm. In (1), we deliberately choose to use a looser majorizing function for the term in order to recover an algorithm with multiplicative update that generalizes the original approach from [29]. However, instead of using (12), one can also use (11) when constructing the majorizing function:
(19) | ||||
which is also a strictly convex function.
Proposition 2 (QU for (10)).
Assuming , the first-order majorizing function defined in Equation (19) is strictly convex, and its global minimum is given by
(20) |
where
(21) |
The proof is provided in Appendix A.3. Both of these propositions lead to the update rule for our MU and QU algorithms detailed in Section 5. We note also that, with the appropriate assumptions, the update rule 16 and 20 will preserve positivity of the variable . However, since our desire is also to handle extra constraint, we develop in the next section rigorous approach.
4.3 Generalized simplex constraint
We need to handle two constraints: 1. the linear constraint , and, 2. the scale constraint , where . While the first one is used to keep the variable non-negative, typically with a strictly positive small , the second one can set the scale of one of the variables ( or ) in the factorization problem. Furthermore, in the case , the simplex constraint is recovered. It turns out that the update rules of (16) and (20) can simply be updated to handle this constraint. The actual optimization problem we want to solve becomes:
where is given in (10).
To solve this problem, we used the KKT approach, i.e, we find points that satisfy the KKT (Karush-Kuhn-Tucker) conditions:
1. Stationarity | |||||
2. Primal feasibility | |||||
3. Dual feasibility | |||||
4. Complementary slackness |
where the Lagrangian is defined as:
We follow the same method as developed in Section 4.2, except that we majorize the Lagrangian . The resulting first-order majorizing function is given by:
where is given in (15) or (19). We repeat the development of Section 4.2 (and the proofs of Appendix A.3). We end up with and update that is very similar to (16) or (20). In the MU case, we end up with:
where the only final difference consists of two terms in cyan and violet ( and remain identical). For the QU, we stick to the same update rule (20), where only is modified:
This update rule ensures the first of the KKT conditions (stationarity). We now find such that the second KKT condition holds (primal feasibility). It turns out that does not need to be computed explicitly. In the MU case, is selected to be large enough such that
(22) |
In the QU case, we obtain
(23) | ||||
Note that dual feasibility and complementary slackness could be verified, but we leave them out for simplicity. We then need to find the value of such that , which is equivalent to searching for
(24) |
Similarly, for the quadratic update of (20), we search for that satisfies
(25) |
There is no closed-form solution for ; however, the value can be found using a simple dichotomy search. Bounds for starting the dichotomy are computed in Appendix B.
Case
Most of our reasoning relies on the fact that and, therefore, on the domain constraint . We have found that setting to a small non-zero value works well in practice. However, our approach can likely be generalized to the case where , following the approach of [33, Section 4], which studies the unregularized Poisson NMF case.
5 Algorithms for Poisson matrix factorisation
Equipped with the update rules developed in the previous Sections 4 and 4.3, we are ready to tackle the general problem of this contribution333Here we show the problem with the linear constraint on , however, by symmetry, a similar algorithm can be developed with the constraint on . which consists of minimizing (2):
(26) | ||||
such that |
We first observe that with respect to each variable , the problem is separable by column/row. For example, given , finding the optimal can be done for each column:
We therefore apply the TBSUM Algorithm 1, where all lines of and all columns of are updated independently, and obtain the two Algorithms 2 and 3. We note here that the function ensures that the solutions are (non-negativity).
Convergence
The two update rules in steps 5 and 10 correspond to minimizing first-order strictly convex majorization functions. As a result, we can apply Theorem 1 to guarantee convergence towards a coordinate-wise minimum. It is important to note that this coordinate-wise minimum is also a stationary point, given that the objective function remains regular for any point .
5.1 Algorithm complexity
Let’s examine the complexity of both Algorithms 2 and 3 when considering and . In each iteration, the following complexities are observed:
(a) Step 4 has a complexity of .
(b) Step 5 has a complexity of .
(c) Step 8 has a complexity of .
(d) Step 9 has a complexity of , where denotes the number of iterations performed by the dichotomy.
(e) Step 10 has a complexity of .
Thus, the overall complexity per iteration can be expressed as . This indicates that the computational complexity per iteration is linear with respect to the problem size, i.e., , multiplied by the number of components, i.e., .
Impact of the dichotomy
When is small, the computational cost of the dichotomy in step 4 becomes dominant. Nevertheless, in general, for larger values of , the impact of the dichotomy becomes negligible.
5.2 Tight Majorizing Functions
While we do not make any theoretical contributions concerning the speed of convergence of the algorithm, we want to emphasize the natural fact that tighter majorizing functions lead to faster convergence. Therefore, when evaluating an algorithm, we believe that the analysis of the underlying majorizing function is as insightful as the experimental evaluation. As an example, we could have used Block Mirror Descent to solve (3), as was done in [16, Algorithm 1]. This algorithm uses a Bregman Difference to create a majorization function for the subproblem. However, this would result in a much looser majorization function, which partly explains the slow convergence of this algorithm observed in [16]. This difference between majorization functions is exemplified in Figure 2.
Linesearch
By tightening the bounds we used to construct the surrogate function, we can develop a more efficient algorithm. Here, we apply a classic "linesearch" method to the functions . However, the same technique can be trivially applied to as well. First, in (12) or (11), replace the constant with a parameter and initialize it with . Second, at each iteration, update the parameter according to the following rule:
Here, and are two update rates that determine how fast is updated. Choosing values that are too small for these parameters leads to an inefficient linesearch, while selecting values that are too large can result in strong oscillation patterns. Typical values for and range from 1.05 to 1.5. However, it is important to note that when using linesearch, we are not guaranteed to converge, as we might invalidate the assumptions of Theorem 1.
6 Numerical Simulation
Problem
In this section, we analyze the speed of convergence of Algorithms 2 and 3 through numerical simulations. As a regularizer for , we consider the Laplacian regularization , where represents the two-dimensional Laplacian for the th line of reshaped as images of size . Since a straightforward approach to minimize is to reduce the amplitude of , we add the simplex constraint . This leads to the following optimization problem:
This particular problem can be applied in various domains, such as Non-Negative Matrix Factorization for hyperspectral images [36] and remote sensing [32] (See Related Work Section 2 for more references and applications). Our algorithms and regularisations were specifically developed for the espm python package [51]. All algorithms and experiments can be found in the espm package.
Dataset
We construct two datasets consisting of 50 randomly drawn samples. In the first dataset, both matrices and are randomly generated from a uniform distribution. In the second dataset, each column of corresponds to the sum of Gaussian functions that are randomly centered and scaled. The matrix represents random smooth images. This second dataset is created using the espm package [51], where the toy model is used for and is generated using the "laplacian" weight type. This choice of dataset is selected because it can benefit from the Laplacian regularization on .
Once and are generated, the noiseless matrix is obtained as . We introduce noise by independently sampling each element , where can be regarded as the noise control parameter. For all samples, we set and with , , and selected from the set . Thus, the images in the dataset have dimensions of .
Results
We compare the performance of Algorithm 2 (MU), Algorithm 3 (QU), Block Mirror Descent (similar to [16, Algorithm 1]) and the projected gradient algorithm applied to (3). Figure 3 displays the convergence curves for 1000 iterations and . Although the overall complexity of all algorithms is the same, the time per iteration differs due to the different operations performed within each iteration and the time spent on dichotomy to compute the dual variable . Therefore, we provide the time in seconds for each algorithm in Figure 4 for various value of . All results are averaged over 50 repetitions.
Smooth images samples | Random uniform , samples | |
Noiseless |
||
---|---|---|
Noisy |
Discussion
Let’s discuss the results in more detail:
- Number of iterations: Figure 3 illustrates the convergence behavior of QU and MU algorithms. It is evident that QU converges faster per iteration compared to MU, which aligns with our expectations due to the tighter majorizing function used in QU. However, it is important to note that the introduction of the linesearch technique, while accelerating convergence, can lead to occasional instability, as indicated by occasional increases in the loss function. This observation supports our earlier discussion in Section 5.2.
The challenge with Projected Gradient Descent is that we need to find an initial learning rate that is not too large, as the algorithm can diverge and not too small as the algorithm can be slow. Overall, since the selected learning rate cannot be selected optimally, the algorithm is slower than QU and MU.
The Block Mirror Descent algorithm is also slower than QU and MU, which is consistent with the results of [16] and can be explained by the fact that the majorizing function used in the algorithm is looser.
- Time per iteration: Figure 4 presents the total time taken by the algorithms to complete 100 iterations. The results demonstrate that for small values of , the computation of the dual variable during the dichotomy process dominates the overall execution time. However, as increases, the time spent on dichotomy becomes negligible in comparison. These findings align with the complexity per iteration discussed in Section 5.1.
7 Conclusion
This contribution is the first to address the Poisson NMF problem with general regularization terms, such as Lipschitz functions, relatively smooth functions, or those expressed as linear constraints. We introduce two new algorithms and demonstrate their convergence to a coordinate-wise minimum, which is also a stationary point. Emphasizing the impact of the majorizing function choice on convergence speed, we validate our findings through numerical simulations. In essence, we believe that this work serves as a helpful guide for develo** efficient algorithms suited for regularized Poisson NMF problems.
Appendix A Proofs
In this Appendix, we provide the different proofs used in the paper.
A.1 Proof of Lemma 1
We note that this lemma and its proof likely exist in the literature, but we were unable to find a reference. See 1
Proof.
If is continuously differentiable, the directional derivative can be written as . At a coordinatewise minimum , we have by definition:
and
Therefore,
∎
A.2 Proof of majorizing functions
See 3
Proof.
First let us observe that
Therefore, we have . The inequality follows from the convexity of the function:
where we set . Finally, by continuity, we obtain the and properties of majorizing functions. ∎
We now proceed to majorize , , and . See 4
Proof.
First it can be trivially observed that
which satisfies the first property. We then take order Taylor expension of around and find
where since the function is gradient Lipschitz with constant . Therefor we have
(27) | ||||
where (27) will be shown later in this proof. By continuity, we obtain the and property of majorizing functions. We now need to prove (27) and reformulate it as
(28) |
For simplicity, let us define the function
with the gradient and the Hessian
Note that is a strictly convex function for . We expand the generalized KL divergence:
(29) |
where is selected such that the last equality holds. Since is a strictly convex function, we know that for some give Now we bound the Hessian as
and introducing this inquality in (29), we obtain
which is equivalent to (28) and completes the proof. ∎
See 5
Proof.
The first property can be trivially verified. Then using by the definition of relatively smoot function
where
Given that , we compute
Finally, by continuity, we obtain the and property of majorizing functions. ∎
See 6
Proof.
One can simply observe . Then by concavity, we have
Finally, by continuity, we obtain the and property of majorizing functions. ∎
A.3 Proof of subproblem updates
In this subsection, we present the proof of the subproblem updates used in the MU and QU algorithms. Let us start with the MU updates. See 1
Proof.
Assuming , (10) is strictly convex, as the green term is strictly convex, and the remaining terms are convex. Consequently, (10) possesses a global minimum. To identify this minimum, we seek the stationary point . Due to our meticulous selection of majorizing functions, this subproblem becomes separable. When computing the gradient with respect to the variable , we obtain:
(30) |
where we assume that since . Transforming the above expression, we find a multiplicative update rule (16) for . ∎
The proof of the QU updates is similar to the MU updates, expect that we need to solve a quadratic equation to obtain a closed-form solution. See 2
Proof.
Assuming , Equation (19) is strictly convex. This convexity arises from the strict convexity of the green term, coupled with the convexity of the other terms. Consequently, (19) possesses a global minimum. To identify this minimum, we seek the stationary point by computing :
(31) |
We observe that it is a separable quadratic function, hence the update rule named Quadratic Update (QU). Solving (31) for can be rewritten as
where , , and are given in (21). Assuming , we have and . Therefore, the previous quadratic equation has two real solutions. Since , they are of opposite sign. Due to the constraint , we select the positive one, leading to the update rule of (20) for . ∎
Appendix B Computation of Lower and Upper Bounds for Dichotomy
In Section 4.3, we introduced modifications to the MU and QU algorithms to incorporate the positivity constraint and the linear constraint . However, solving for the dual parameter in equations (24) for MU or (25) for QU is intractable. To address this, we propose using the dichotomy method to solve for . Therefore, this appendix provides the computation of lower bound and upper bound such that and . These bounds will serve as convenient initializations for the dichotomy algorithm.
B.1 Case 1: MU
For MU, we aim to solve equation (24) for :
Terms where can be ignored since they do not contribute to the sum. Assuming , the function is well-defined for . Since , we have for , and is monotonically decreasing for . Assuming (which ensures the feasibility of the constraints and ), we have and . Thus, there exists exactly one root for the function .
Negative bound
First, let’s find such that for . We can bound as follows:
Therefore one possible bound is
Positive bound
Similarly, let’s find such that for . Note that is not a good bound when is small. We can bound as follows:
As a result, we have
To improve numerical stability, one could use and .
B.2 Case 2: QU
For QU, we aim to solve equation (25) for :
Let’s analyze the function . We observe that the function is strictly decreasing over for , since its derivative is strictly negative for . Therefore, each term of the sum is decreasing, and is a decreasing function. Note that once at least for one , we have
and therefore the function becomes strictly decreasing. In the limit, we have and . Assuming (which ensures the feasibility of the constraints and ), we know that the function has exactly one root.
Negative bound
Let’s find such that for . We start by bounding the term
where . We move to the left and square the inequality to remove the square root:
Eventually, we can extract a bound for to ensure that the inequality is satisfied for a chosen :
Let’s set , where is the number of elements in the sum, and take the maximum over to obtain the bound:
We observe the validity of this bound provided that for all . Then, it can be verified that
which proves that is a valid negative bound.
Positive bound
Similarly, let’s find such that for . This time, we will bound from below and obtain
We can, therefore, define
References
- [1] Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality. Mathematics of operations research, 35(2):438–457, 2010.
- [2] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1):459–494, 2014.
- [3] Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
- [4] Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12):4164–4169, 2004.
- [5] Stefania Cacovich, Fabio Matteocci, Mojtaba Abdi-Jalebi, Samuel D Stranks, Aldo Di Carlo, Caterina Ducati, and Giorgio Divitini. Unveiling the chemical composition of halide perovskite films using multivariate statistical analyses. ACS Applied Energy Materials, 1(12):7174–7181, 2018.
- [6] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE transactions on pattern analysis and machine intelligence, 33(8):1548–1560, 2010.
- [7] Andrzej Cichocki and Anh-Huy Phan. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences, 92(3):708–721, 2009.
- [8] Andrzej Cichocki, Rafal Zdunek, and Shun-ichi Amari. Hierarchical als algorithms for nonnegative matrix and 3d tensor factorization. In International Conference on Independent Component Analysis and Signal Separation, pages 169–176. Springer, 2007.
- [9] Yu-Hong Dai and Yaxiang Yuan. A nonlinear conjugate gradient method with a strong global convergence property. SIAM Journal on optimization, 10(1):177–182, 1999.
- [10] Cédric Févotte, Nancy Bertin, and Jean-Louis Durrieu. Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis. Neural computation, 21(3):793–830, 2009.
- [11] Dan Fu, Gary Holtom, Christian Freudiger, Xu Zhang, and Xiaoliang Sunney Xie. Hyperspectral imaging with stimulated raman scattering by chirped femtosecond lasers. The Journal of Physical Chemistry B, 117(16):4634–4640, 2013.
- [12] Nicolas Gillis. The why and how of nonnegative matrix factorization. In Regularization, Optimization, Kernels, and Support Vector Machines, pages 275–310. Chapman and Hall/CRC, 2014.
- [13] Prem Gopalan, Jake M Hofman, and David M Blei. Scalable recommendation with poisson factorization. arXiv preprint arXiv:1311.1704, 2013.
- [14] Filip Hanzely, Peter Richtarik, and Lin Xiao. Accelerated bregman proximal gradient methods for relatively smooth convex optimization. Computational Optimization and Applications, 79:405–440, 2021.
- [15] Niao He, Zaid Harchaoui, Yichen Wang, and Le Song. Fast and simple optimization for poisson likelihood models. arXiv preprint arXiv:1608.01264, 2016.
- [16] Le Thi Khanh Hien and Nicolas Gillis. Algorithms for nonnegative matrix factorization with the kullback–leibler divergence. Journal of Scientific Computing, 87(3):1–32, 2021.
- [17] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57, 1999.
- [18] Cho-Jui Hsieh and Inderjit S Dhillon. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1064–1072, 2011.
- [19] BR Jany, Arkadiusz Janas, and Franciszek Krok. Retrieving the quantitative chemical information at nanoscale from scanning electron microscope energy dispersive x-ray measurements by machine learning. Nano letters, 17(11):6520–6525, 2017.
- [20] Chi **, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pages 1724–1732. PMLR, 2017.
- [21] Ramakrishnan Kannan, AV Ievlev, Nouamane Laanait, Maxim A Ziatdinov, Rama K Vasudevan, Stephen Jesse, and Sergei V Kalinin. Deep data analysis via physically constrained linear unmixing: universal framework, domain examples, and a community-wide platform. Advanced Structural and Chemical Imaging, 4(1):1–20, 2018.
- [22] Hideaki Kano, Hiroki Segawa, Masanari Okuno, Philippe Leproux, and Vincent Couderc. Hyperspectral coherent raman imaging–principle, theory, instrumentation, and applications to life sciences. Journal of Raman Spectroscopy, 47(1):116–123, 2016.
- [23] Hyunsoo Kim and Haesun Park. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM journal on matrix analysis and applications, 30(2):713–730, 2008.
- [24] **gu Kim, Yunlong He, and Haesun Park. Algorithms for nonnegative matrix and tensor factorizations: A unified view based on block coordinate descent framework. Journal of Global Optimization, 58(2):285–319, 2014.
- [25] **gu Kim and Haesun Park. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientific Computing, 33(6):3261–3281, 2011.
- [26] Nikos Komodakis and Jean-Christophe Pesquet. Playing with duality: An overview of recent primal? dual approaches for solving large-scale optimization problems. IEEE Signal Processing Magazine, 32(6):31–54, 2015.
- [27] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
- [28] Paul G Kotula, Michael R Keenan, and Joseph R Michael. Automated analysis of sem x-ray spectral images: A powerful new microanalysis tool. Microscopy and Microanalysis, 9(1):1–17, 2003.
- [29] Daniel Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, pages 556–562. MIT Press, 2001.
- [30] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
- [31] Qiuwei Li, Zhihui Zhu, Gongguo Tang, and Michael B Wakin. Provable bregman-divergence based methods for nonconvex and non-lipschitz problems. arXiv preprint arXiv:1904.09712, 2019.
- [32] Xinghua Li, Liyuan Wang, Qing Cheng, Penghai Wu, Wenxia Gan, and Lina Fang. Cloud removal in remote sensing images using nonnegative matrix factorization and error correction. ISPRS journal of photogrammetry and remote sensing, 148:103–113, 2019.
- [33] Chih-Jen Lin. On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Transactions on Neural Networks, 18(6):1589–1596, 2007.
- [34] Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10):2756–2779, 2007.
- [35] Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
- [36] Xiaoqiang Lu, Hao Wu, Yuan Yuan, **kun Yan, and Xuelong Li. Manifold regularized sparse nmf for hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing, 51(5):2815–2826, 2012.
- [37] Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
- [38] Qiaozhu Mei and ChengXiang Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 198–207, 2005.
- [39] Pentti Paatero. Least squares formulation of robust non-negative factor analysis. Chemometrics and intelligent laboratory systems, 37(1):23–35, 1997.
- [40] Ioannis Panageas, Georgios Piliouras, and Xiao Wang. First-order methods almost always avoid saddle points: The case of vanishing step-sizes. Advances in Neural Information Processing Systems, 32, 2019.
- [41] Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013.
- [42] Clément W Royer, Michael O’Neill, and Stephen J Wright. A newton-cg algorithm with complexity guarantees for smooth unconstrained optimization. Mathematical Programming, 180(1):451–488, 2020.
- [43] Clément W Royer and Stephen J Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM Journal on Optimization, 28(2):1448–1477, 2018.
- [44] Joseph Salmon, Zachary Harmany, Charles-Alban Deledalle, and Rebecca Willett. Poisson noise reduction with non-local pca. Journal of mathematical imaging and vision, 48(2):279–294, 2014.
- [45] Motoki Shiga, Kazuyoshi Tatsumi, Shunsuke Muto, Koji Tsuda, Yuta Yamamoto, Toshiyuki Mori, and Takayoshi Tanji. Sparse modeling of eels and edx spectral imaging data by nonnegative matrix factorization. Ultramicroscopy, 170:43–59, 2016.
- [46] Ajit P Singh and Geoffrey J Gordon. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 650–658, 2008.
- [47] Paris Smaragdis and Judith C Brown. Non-negative matrix factorization for polyphonic music transcription. In 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No. 03TH8684), pages 177–180. IEEE, 2003.
- [48] Dennis L Sun and Cedric Fevotte. Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 6201–6205. IEEE, 2014.
- [49] Leo Taslaman and Björn Nilsson. A framework for regularized non-negative matrix factorization, with application to the analysis of gene expression data. PloS one, 7(11):e46331, 2012.
- [50] Marc Teboulle and Yakov Vaisbourd. Novel proximal gradient methods for nonnegative matrix factorization with sparsity constraints. SIAM Journal on Imaging Sciences, 13(1):381–421, 2020.
- [51] Adrien Teurtrie, Nathanaël Perraudin, Thomas Holvoet, Hui Chen, Duncan TL Alexander, Guillaume Obozinski, and Cécile Hébert. espm: A python library for the simulation of stem-edxs datasets. Ultramicroscopy, page 113719, 2023.
- [52] Adrien Teurtrie, Nathanaël Perraudin, Thomas Holvoet, Hui Chen, Duncan TL Alexander, Guillaume Obozinski, and Cécile Hébert. From stem-edxs data to phase separation and quantification using physics-guided nmf. To appear, 2024.
- [53] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109:475–494, 2001.
- [54] Musundi B Wabuyele, Fei Yan, Guy D Griffin, and Tuan Vo-Dinh. Hyperspectral surface-enhanced raman imaging of labeled silver nanoparticles in single cells. Review of scientific instruments, 76(6):063710, 2005.
- [55] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6(3):1758–1789, 2013.
- [56] Felipe Yanez and Francis Bach. Primal-dual algorithms for non-negative matrix factorization with the kullback-leibler divergence. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2257–2261. IEEE, 2017.
- [57] Andrew B Yankovich, Chenyu Zhang, Albert Oh, Thomas JA Slater, Feridoon Azough, Robert Freer, Sarah J Haigh, Rebecca Willett, and Paul M Voyles. Non-rigid registration and non-local principle component analysis to improve electron microscopy spectrum images. Nanotechnology, 27(36):364001, 2016.
- [58] Minchao Ye, Yuntao Qian, and Jun Zhou. Multitask sparse nonnegative matrix factorization for joint spectral–spatial hyperspectral imagery denoising. IEEE Transactions on Geoscience and Remote Sensing, 53(5):2621–2639, 2014.
- [59] Chenyu Zhang, Rungang Han, Anru R Zhang, and Paul M Voyles. Denoising atomic resolution 4d scanning transmission electron microscopy data with tensor singular value decomposition. Ultramicroscopy, 219:113123, 2020.
- [60] Changzhong Zou and Youshen Xia. Restoration of hyperspectral image contaminated by poisson noise using spectral unmixing. Neurocomputing, 275:430–437, 2018.