Efficient algorithms for regularized Poisson Non-negative Matrix Factorization

Nathanaël Perraudin Swiss Data Science Center, EPFL and ETH Zürich, Andreasstrasse 5, Zürich, 8050, Zürich, Switzerland Adrien Teutrie Unité Matériaux et Transformations, UMR-CNRS 8207, Université de Lille, Cité scientifique, Bâtiment C6, Villeneuve d’Ascq, 59655, Nord, France Cécile Hébert Electron Spectrometry and Microscopy Laboratory, Institute of Physics, EPFL, Bâtiment PH, Station 3, Lausanne, 1015, Vaud, Switzerland Guillaume Obozinski Swiss Data Science Center, EPFL and ETH Zürich, Andreasstrasse 5, Zürich, 8050, Zürich, Switzerland

Abstract

We consider the problem of regularized Poisson Non-negative Matrix Factorization (NMF) problem, encompassing various regularization terms such as Lipschitz and relatively smooth functions, alongside linear constraints. This problem holds significant relevance in numerous Machine Learning applications, particularly within the domain of physical linear unmixing problems. A notable challenge arises from the main loss term in the Poisson NMF problem being a KL divergence, which is non-Lipschitz, rendering traditional gradient descent-based approaches inefficient. In this contribution, we explore the utilization of Block Successive Upper Minimization (BSUM) to overcome this challenge. We build approriate majorizing function for Lipschitz and relatively smooth functions, and show how to introduce linear constraints into the problem. This results in the development of two novel algorithms for regularized Poisson NMF. We conduct numerical simulations to showcase the effectiveness of our approach.

Disclaimer

This document is a technical report and has not undergone peer review. The findings and conclusions presented herein are solely based on the authors’ research and analysis. We apologize for any potential errors or shortcomings in the content.

1 Introduction

The problem of factorizing a matrix $\bm{Y}\approx\bm{W}\bm{H}$ as Non Negative components $\bm{W}\geq 0,\bm{H}\geq 0$ is central in many Machine Learning (ML) applications [47, 10, 4]. The motivation for performing such a factorization is that $\bm{Y}$ is often associated with a probability distribution density of the form $P_{\bm{Y}}(\bm{W},\bm{H})=\tilde{P}_{\bm{Y}}(\bm{W}\bm{H})$ . Typically, the optimal decomposition is found by minimizing the negative log-likelihood of that distribution:

\operatorname*{minimize~{}~{}}_{\bm{W}\geq 0,\bm{H}\geq 0}-\sum_{i,j}\log\left% (P_{\bm{Y}}\left(\bm{W},\bm{H}\right)\right)

(1)

If $\bm{Y}$ is assumed to be perturbed with Normal noise, we obtain a Gaussian distribution, i.e. $P_{\bm{Y}}\left(\bm{W},\bm{H}\right)\propto e^{-\|\bm{W}\bm{H}-\bm{Y}\|^{2}}$ , and we end up with the classic non-negative matrix factorization (NMF) problem [29, 30], where the quadratic function $\|\bm{W}\bm{H}-\bm{Y}\|^{2}$ is minimized. For other distribution families, and in particular exponential families, a $\log$ term often appears. In particular, the Poisson negative log-likelihood model [29] leads to a loss of the form

\mathcal{L}_{\bm{Y}}\left(\bm{W},\bm{H}\right):=-\left\langle\bm{Y},\log\left(% \bm{W}\bm{H}\right)\right\rangle+\left\langle\mathbf{1},\bm{W}\bm{H}\right% \rangle\propto-\sum_{i,j}\log\left(P_{\bm{Y}}\left(\bm{W},\bm{H}\right)\right),

(2)

where the inner product over matrices is the Frobenius inner product defined as $\left\langle\bm{A},\bm{B}\right\rangle=\sum_{ij}a_{ij}b_{ij}$ .

Regularized Poisson Non Negative Matrix Factorisation

In many problems (see, for example, [51, 50, 21, 52]), additional prior information about the matrices $\bm{W},\bm{H}$ is known. For example, it might be known that the columns of $\bm{H}$ are smooth, or that the rows of $\bm{W}$ are sparse. One might also be interested in normalizing to unity the columns of $\bm{H}$ or the rows of $\bm{W}$ because they might quantify physical quantities for which normalization is necessary. For example, in the analysis of hyperspectral imaging data, the images $\bm{H}$ are assumed to be smooth and the components $\bm{W}$ are summing to the unity [51, 52]. Typically, this information can be encoded via an extra regularization term $R\left(\bm{W},\bm{H}\right)$ and/or additional constraints $\bm{W}\in\mathcal{C}_{1},\bm{H}\in\mathcal{C}_{2}$ . This leads to the general optimization problem we solve in this contribution:

\begin{split}\operatorname*{minimize~{}~{}}_{\bm{W},\bm{H}}~{}~{}&\mathcal{L}_% {\bm{Y}}\left(\bm{W},\bm{H}\right){\color[rgb]{.75,.5,.25}+R_{W}\left(\bm{W}% \right)+R_{H}\left(\bm{H}\right),}\\ \text{subject to}~{}~{}&{\color[rgb]{.5,0,.5}\bm{W}\geq\epsilon,\bm{H}\geq% \epsilon},~{}~{}{\color[rgb]{0,1,1}\bm{e}_{H}^{\top}\bm{H}=\bm{1}\text{ or }% \bm{W}\bm{e}_{W}=\bm{1}^{\top}}\end{split}

(3)

where $\epsilon>0$ . The different colors emphasize the changes compared to the traditional problem of [29]. First, in violet, we slightly simplify the problem by imposing strict non-negtativity. While, this is not strictly necessary¹¹1The case $\epsilon=0$ could be handled with an approach similar to [33]., this assumption significantly simplify our analysis. We believe that handling $\epsilon=0$ is an unnecessary complication, as the results with $\epsilon$ close to machine precision will be practically identical. Second, in light blue, we consider the case where the constraints are linear, specifically $\bm{e}_{H}^{\top}\bm{H}=\bm{1}$ or $\bm{W}\bm{e}_{W}=\bm{1}^{\top}$ . In general, it is only meaningful to use one of the constraints, as it will fix the ratio between $\bm{W}$ and $\bm{H}$ . We note that this includes the simplex constraint when $\bm{e}=\bm{1}.$ Third, in brown, we consider regularizations $R_{W}\left(\bm{W}\right)+R_{H}\left(\bm{H}\right)$ of the form:

r\left(\bm{x}\right)=s_{L}\left(\bm{x}\right)+s_{R}\left(\bm{x}\right)+\sum_{j% =1}^{n}s_{C}\left(x_{j}\right),

(4)

where $\bm{x}$ is the vector of a row of $\bm{W}$ or a colum of $\bm{H}$ ., i.e $R_{W}\left(\bm{W}\right)=\sum_{i}r_{H}\left(\bm{w}_{i}\right)$ and $R_{H}\left(\bm{H}\right)=\sum_{j}r_{W}\left(\bm{h}_{i}\right).$

The term $s_{L}$ is assumed to be $\sigma_{L}$ gradient Lipschitz, i.e., there exists a $\sigma_{L}$ such that

\|\nabla s_{L}\left(\bm{x}\right)-\nabla s_{L}\left(\bm{y}\right)\|_{2}\leq% \sigma_{L}\|\bm{x}-\bm{y}\|_{2}\hskip 10.00002pt\text{for all }\bm{x},\bm{y}% \in\mathcal{C}.

(5)

Alternatively, this condition could be rewriten as

s_{L}\left(\bm{y}\right)\leq s_{L}\left(\bm{x}\right)+\left\langle\nabla s_{L}% \left(\bm{x}\right),\bm{y}-\bm{x}\right\rangle+\frac{\sigma_{L}}{2}\|\bm{x}-% \bm{y}\|_{2}^{2}\hskip 10.00002pt\text{for all }\bm{x},\bm{y}\in\mathcal{C}.

In this contribution, we will consider in particular $s_{L}\left(\bm{x}\right)=\bm{x}^{\top}\Delta\bm{x}$ , where $\Delta$ is the Laplacian operator, favoring smoothness in the columns of $\bm{H}$ . In this case $\sigma_{L}=2\lambda_{\text{max}}(\Delta)$ .

The term $s_{R}\left(\bm{x}\right)$ is assumed to be $\sigma_{R}$ relatively smooth with respect to $\kappa\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}\right)$ . Relative smoothness is a generalization of Lipschitz smoothness, and is defined as follows [35, Definition 1.1]:

f\left(\bm{y}\right)\leq f\left(\bm{x}\right)+\left\langle\nabla f\left(\bm{x}% \right),\bm{y}-\bm{x}\right\rangle+\sigma_{R}\mathcal{B}_{\kappa}\left(\bm{y},% \bm{x}\right)

(6)

where $\mathcal{B}_{\kappa}$ is the Bregman divergence [3] associated with $\kappa$ [35, equation 7]:

\mathcal{B}_{\kappa}\left(\bm{y},\bm{x}\right):=\kappa\left(\bm{y}\right)-% \kappa\left(\bm{x}\right)-\left\langle\nabla\kappa\left(\bm{x}\right),\bm{y}-% \bm{x}\right\rangle\hskip 10.00002pt\text{for all }\bm{x},\bm{y}\in\mathcal{C}.

While Lipschitz functions can be upper bounded with quadratic functions, we observe from (6) that relative smoothness allows us to upper bound a function with the Bregman divergence of a function $\kappa$ . This allows us to use a much wider range of functions to regularize our problem, and in particular non-gradient Lipschitz ones. As an example, let us consider $\kappa\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}\right)$ , with its Bregman divergence:

\mathcal{B}_{\kappa}\left(\bm{y},\bm{x}\right)=\sum_{i}\left(\frac{y_{i}}{x_{i% }}-\log\left(\frac{y_{i}}{x_{i}}\right)-1\right)

One can observe that the objective function (2) is relatively smooth with respect to $\kappa\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}\right)$ . This term could also be used to introduce soft contraints such as a log-barrier: $s_{R}\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}-\epsilon\bm{1}\right)$ for $\bm{x}>\epsilon$ and $+\infty$ otherwise.

3.

Eventually, $s_{C}\left(x\right)$ is a smooth point-wise concave function (i.e. $-s_{C}$ is convex). A typical example of this regularisation could be $s_{C}\left(x\right)=\log\left(x+\alpha^{-1}\right)$ which favor sparsity in the vector $\bm{x}$ without penalizing large values too heavily. Its slope starts at $\alpha$ for $x=0$ and tends to $0$ for $x\rightarrow\infty$ .

Our approach can handle regularizations of the form $R\left(\bm{W},\bm{H}\right).$ However, for simplicity of notation, we restrict ourselves to separable regularizations.

The fidelity term $\mathcal{L}_{\bm{Y}}\left(\bm{W},\bm{H}\right)$ is convex in $\bm{H}$ and in $\bm{W}$ , but jointly non convex. We note that depending on the regularization and the constraints, multiple equivalent scaled solutions (stationary points) could exist, i.e., $\bm{W}\bm{H}=\bm{W}^{\prime}\bm{H}^{\prime}$ for $\bm{W}^{\prime}=\alpha^{-1}\bm{W}$ and $\bm{H}^{\prime}=\alpha\bm{H}$ . However, this is generally not the case with additional constraints or regularizations.

Why is this problem challenging?

In general, Poisson Non Negative Matrix factorization, i.e. minimizing $\mathcal{L}_{\bm{Y}}\left(\bm{W},\bm{H}\right)$ is challenging because it is not gradient Lipschitz²²2The function $\mathcal{L}_{\bm{Y}}$ is, according to the definition, gradient Lipschitz because of the constraint $\bm{W},\bm{H}\geq\epsilon$ . However, in practice, the constant $\epsilon$ is chosen to be so small that its actual Lipschitz constant is too large to be useful. despite being differentiable for $\bm{W},\bm{H}>0$ . In practice, this implies that there exists no fixed learning rate that ensures convergence of gradient descent and that line search would have to be used. For that reason, solving the more general problem (3) is a difficult task, and to the best of our knowledge, there are no existing algorithms that can be directly applied to it. Although there exist many algorithms to solve the traditional Poisson NMF problem [16, 30, 29, 24, 12], none of them focuses on the regularized case (see the related work Section 2). Our main contribution is to fill this gap by providing multiple algorithms that minimize (3) for a wide range of regularizations $R\left(\bm{W},\bm{H}\right)$ described by (4) and some linear constraints.

Our approach

A natural approach to minimize (3) is to optimize for each variable $\bm{W},\bm{H}$ at a time. For example, Block Coordinate Descent (BCD) can be expressed as

\bm{W}^{t+1}\leftarrow\operatorname*{minimize~{}~{}}_{\bm{W}\in\mathcal{C}_{% \bm{W}}}\mathcal{L}\left(\bm{W},\bm{H}^{t}\right),

(7)

\bm{H}^{t+1}\leftarrow\operatorname*{minimize~{}~{}}_{\bm{H}\in\mathcal{C}_{% \bm{H}}}\mathcal{L}\left(\bm{W}^{t+1},\bm{H}\right).

(8)

This type of iterative scheme ensures that the loss does not increase between iterations and has been successfully used for the L2 case. However, in the Poisson case, there is no closed-form solution for problems (7) and (8), making this approach generally computationally expensive.

Fortunately, in practice, one does not need to find the global minima of (7) and (8) at each iteration. Instead, using a Block Successive Minimization (BSUM) algorithm [41], it is sufficient to minimize approximations of $\mathcal{L}$ which are locally tight upper bounds of $\mathcal{L}$ . To use the BSUM efficiently, these approximation functions need to have three properties: (1) to satisfy the hypotheses of the BSUM Theorem [41, Theorem 2], (2) to be as tight as possible, and (3) to be easy to optimize, i.e. to lead to a closed-form solution for each subproblem.

Our contributions can be summarized as follows. We show how regularized Poisson NMF can be efficiently solved using BSUM. We derive tight upper bounds for multiple regularizers and compare our approach with traditional algorithms. We also propose a simple way to introduce linear constraints into the problem and suggest using line search to build even tighter upper bounds. Finally, we propose multiple algorithms for regularized Poisson NMF and conduct numerical simulations to demonstrate the effectiveness of our approach.

Outline of this contribution

In Section 2, we provide a review of the literature. In Section 3, we clarify the notation and provide the necessary definitions for the BSUM Theorem [41, Theorem 2] that will be used for the convergence of our algorithm. In Section 4, we develop convenient approximations of the objective and regularization functions leading to sub-problems with closed-form solutions. In Section 4.3, we explore how to modify the optimization scheme to introduce generalized simplex constraints. In Section 5, we present our algorithms. Section 6 provides numerical applications of our algorithm, and Section 7 concludes this contribution.

2 Related work

Applications of Poisson Distribution likelihood Maximization

The maximization of likelihood in a Poisson distribution finds relevance in various applications, prompting the resolution of the problem outlined in Equation (3).

Many such applications arise in the domain of physical constrained linear unmixing problems [21]. Some noteworthy instances encompass: 1. Scanning transmission electron microscopy (STEM) [52, 45, 5, 19], 2. Hyperspectral Raman and optical imaging [22, 54, 11], 3. Tensor SVD applied to denoise atomic-resolution 4D scanning transmission electron microscopy [59], and 4. Non-local Poisson PCA denoising [44, 57]. It is noteworthy that many of these applications predominantly employ the L2 case, which offers a comparatively simpler solution. As a result, data is often renormalized to convert Poisson distributions into Gaussian distributions [28]. Nevertheless, the efficacy of these applications could be significantly enhanced by the development of algorithms tailored explicitly for the Poisson case.

Furthermore, Problem (3) also surfaces in hyperspectral image denoising, where noise is assumed to follow a Poisson distribution [60, 58]. In the domain of text mining, the Poisson distribution assumption is frequently utilized for modeling word occurrences based on latent variables such as categories, leading to the problem formulation depicted by (3) [17, 38]. Additionally, within the context of recommender systems, several matrix factorization problems can be reformulated into the structure of (3) [46, 13, 27].

Other Optimization Approaches

The literature offers a limited number of optimization methods suitable for addressing the problem presented by Equation (3). This constraint arises from the requirement of many optimization techniques to have a continuously differentiable gradient Lipschitz function. Examples of such techniques include gradient descent [40], perturbed gradient descent [20], nonlinear conjugate gradient method [9], various proximal point minimization algorithms [26], and second-order methods like the Newton-CG algorithms [42, 43].

One potentially attractive direction is the utilization of Proximal Alternating (Linearized) Minimization (PALM) [2] or Proximal Alternating Minimization (PAM) [1]. These algorithms are designed to solve problems of the form:

\operatorname*{minimize~{}~{}}_{\bm{W},\bm{H}}R_{W}\left(\bm{W}\right)+R_{H}% \left(\bm{H}\right)+\mathcal{L}\left(\bm{W},\bm{H}\right)

PALM employs a Gauss-Seidel iteration scheme, consisting of the following sub-problems:

	$\displaystyle\bm{\bm{W}}^{k+1}$	$\displaystyle=\operatorname*{arg\,min}_{\bm{W}}R_{W}\left(\bm{W}\right)+% \mathcal{L}\left(\bm{W},\bm{H}^{k}\right)+c_{W}\left\\|\bm{W}-\bm{W}^{k}\right% \\|_{2}^{2}$
	$\displaystyle\bm{\bm{H}}^{k+1}$	$\displaystyle=\operatorname*{arg\,min}_{\bm{H}}R_{H}\left(\bm{H}\right)+% \mathcal{L}\left(\bm{W}^{k+1},\bm{H}\right)+c_{H}\left\\|\bm{H}-\bm{H}^{k}% \right\\|_{2}^{2}$

Unfortunately, PALM requires the objective function $\mathcal{L}$ to possess a Lipschitz gradient, which is not the case in our scenario. Additionally, the Gauss-Seidel iterations generally lack a closed-form solution, resulting in a slow algorithm with sub-iterations.

To overcome the non-Lipschitz gradient issue, Bregman gradient descent (B-GD) [31, Algorithm 1.1] can be considered. This type of algorithm has been extended to alternating minimization [31, Algorithm 1.3 and 1.4]. Such an approach can be adapted to our case, as the objective function (3) exhibits relative smoothness for most regularization scenarios (see (6)). However, this optimization scheme involves non-tight majorization functions, leading to slow convergence, as discussed in Section 5.2.

Given the presence of two blocks of variables, Block Coordinate Descent (BCD) algorithms [53] naturally emerge as a potential solution. In fact, previous work [24] demonstrates that many existing approaches can be viewed as Block Coordinate Descent (BCD) problems. However, a primary challenge with BCD lies in its propensity to necessitate full minimization of the sub-problems, which proves to be challenging in the Poisson case. In the L2 case, the subproblems often have closed-form solutions. To address this concern, we explore BSUM algorithms [41] in this study, a generalization of BCD that avoids the requirement for full minimization of the subproblems, instead using upper bounds for the objective function.

Non-Negative Matrix Factorization

Non-Negative Matrix Factorization (NMF) algorithms have been extensively studied, and for a comprehensive review, one can refer to [12]. Initially formulated as Positive Matrix Factorization for the Gaussian (L2-NMF) case by [39], NMF has seen numerous algorithmic developments. In the L2 case, popular approaches include the Alternating Nonnegative Least Squares (ANLS) framework [23, 25, 34] and the Hierarchical Alternating Least Squares (HALS) method [7, 8]. As for the Poisson case, known as KL NMF, the first algorithm using Multiplicative Updates (MU) was proposed by [30], with later demonstrations of its convergence provided in [29] for both Poisson and Gaussian cases. A more rigorous convergence analysis is presented in [33].

Considering the specific problem of Poisson KL-NMF, there have been a few notable contributions. For instance, [48] employed the Alternating Direction Method of Multipliers (ADMM) with the variable change $X=WH$ . The Primal Dual algorithm, based on the framework from Chambolle-Pock, was explored by [56]. Moreover, [16] conducted a comparative study of various optimization algorithms for KL NMF, including MU [29], ADMM [48], Primal Dual [56], and Cyclic Coordinate Descent Method [18]. They also introduced three new algorithms for KL-NMF, namely Block Mirror Descent Method, A Scalar Newton-Type Algorithm, and A Hybrid SN-MU Algorithm.

In the introduction, we mentioned that there are few contributions that address the problem of regularized NMF, with many of them focusing on the L2 case. [24] demonstrated how many existing works can be cast as Block Coordinate Descent (BCD) problems, allowing the derivation of MU update rules for different regularizers, such as L1 for sparsity. However, their work is limited to L2 NMF. Xu et al. [55] proposed a general optimization scheme for block multiconvex optimization using block coordinate descent, which can accommodate regularization on each block. Although applied to L2-NMF, such an approach may lead to algorithms with sub-iterations. In the context of L2-NMF with sparsity constraints, [50] presented an approach to address this scenario. Additionally, [49] introduced a framework for handling L2-NMF with Lipschitz regularizers, akin to our term $s_{L}$ . Other forms of regularization have also been explored, such as graph-based [6] or simplex constraint [17].

Regarding the specific problem of regularized KL loss, [15] provided a notable contribution. However, their work focused solely on the subproblems of the NMF problem, rather than addressing the NMF problem itself. Notably, one could potentially employ a similar approach to solve the subproblems of (3) using BCD. Nonetheless, this would result in a less efficient algorithm with sub-iterations.

3 Preliminaries

3.1 Notation

We reserve capital letters for matrices and vectors, e.g., $\bm{A},\bm{a}$ . We use $\bm{a_{i}}$ to refer to the $i$ -th numbered vector, and $a_{i}$ to denote the $i$ -th element of vector $\bm{a}$ . The $j$ -th element of vector $\bm{a}_{i}$ or the element at the $i$ -th row and $j$ -th column of matrix $\bm{A}$ is denoted as $a_{ij}$ .

$\bm{A}\geq 0$ and $\bm{a}\geq 0$ indicate that all entries of matrix $\bm{A}$ or vector $\bm{a}$ are greater than or equal to $0$ , i.e., $a_{ij}\geq 0$ for all $i,j$ . $\bm{A}^{\top},\bm{a}^{\top}$ represent the transpose of $\bm{A}$ and $\bm{a}$ , respectively. We use $\bm{A}^{t}$ to denote $\bm{A}$ at step $t$ . $\bm{A}^{t}{}^{\top}$ denotes the transpose of $\bm{A}^{t}$ . $\circ$ and $\oslash$ denote elementwise multiplication (also known as the Hadamard product) and division for matrices, respectively. For example, $[\bm{A}\circ\bm{B}]_{ij}=a_{ij}b_{ij}$ and $[\bm{A}\oslash\bm{B}]_{ij}=\frac{a_{ij}}{b_{ij}}$ .

As mentioned in the introduction, the matrix to be factorized is generally denoted as $\bm{Y}\approx\bm{W}\bm{H}$ , where $\bm{W}$ and $\bm{H}$ are its factors. We will use $\bm{w},\bm{h}$ as the vectorized versions of $\bm{W},\bm{H}$ , while $\bm{w}_{i},\bm{h}_{i}$ will denote the $i$ -th column of $\bm{W},\bm{H}$ . $x,\bm{x}$ are general variables that can replace either $\bm{w}$ or $\bm{h}$ . We use $\mathcal{L}\left(\bm{W},\bm{H}\right)$ as the general loss function, and $f\left(\bm{x}\right)$ is broadly used to denote a multivariate scalar function.

Finally, we commonly employ calligraphic notation for variable domains. Let $\mathcal{X}$ serve as a generic domain for the variable $x$ . In practical terms, it is frequently defined by the $\geq\epsilon$ constraint, specifically as $\mathcal{X}=\left\{\bm{x}\in\mathbb{R}^{m}|\bm{x}\geq\epsilon\right\}$ . Additionally, we utilize $C_{w}$ and $C_{h}$ to represent the domains of $\bm{w}$ and $\bm{h}$ , respectively.

3.2 Definitions

In this contribution, we consider the loss function $\mathcal{L}\left(\bm{w},\bm{h}\right)$ with two blocks of variables: $\bm{w}\in\mathcal{C}_{w}$ and $\bm{h}\in\mathcal{C}_{h}$ , where $\mathcal{C}_{w}\subset\mathbb{R}^{m_{w}}$ and $\mathcal{C}_{h}\subset\mathbb{R}^{m_{h}}$ are both non-empty convex sets. Here, $\bm{w}$ and $\bm{h}$ correspond to the vectorized versions of the two matrices $\bm{W}$ and $\bm{H}$ , respectively. Therefore, $\mathcal{C}_{w}$ and $\mathcal{C}_{h}$ often correspond to the sets $\bm{w}\geq\epsilon$ and $\bm{h}\geq\epsilon$ with $\epsilon>0$ . Let us use $\bm{z}=\left[\bm{w},\bm{h}\right]$ to denote all the variables. We have $\bm{z}\in\mathcal{C}=\mathcal{C}_{w}\times\mathcal{C}_{h}\subset\mathbb{R}^{m}% =\mathbb{R}^{m_{w}}\times\mathbb{R}^{m_{h}}$ , where the total dimension of the problem is $m=m_{w}+m_{h}$ .

Definition 1 (Directional derivative).

Let $\mathcal{L}:\mathcal{C}\rightarrow\mathbb{R}$ be a scalar function, where $\mathcal{C}\subset\mathbb{R}^{m}$ is a convex set. The directional derivative of $\mathcal{L}$ at point $\bm{x}$ in the direction $\bm{d}$ is defined by

\mathcal{L}^{\prime}\left(\bm{z};\bm{d}\right)\coloneqq\lim_{\lambda\downarrow 0% }\frac{\mathcal{L}\left(\bm{z}+\lambda\bm{d}\right)-\mathcal{L}\left(\bm{z}% \right)}{\lambda}

Note that when $\mathcal{L}$ is differentiable, $\mathcal{L}^{\prime}(\bm{z};\bm{d})=\nabla\mathcal{L}(\bm{z})\bm{d}^{\top}$ since $\bm{d}$ and $\bm{z}$ are row vectors. In this contribution, almost all functions are differentiable on the domain of interest.

Definition 2 (Coordinatewise Minimum).

The point $\bm{z}=[\bm{w},\bm{h}]\in\mathcal{C}$ is a coordinatewise minimum of a function $\mathcal{L}$ if

\mathcal{L}\left(\bm{w}+\bm{d}_{w},\bm{h}\right)\geq\mathcal{L}\left(\bm{w},% \bm{h}\right)\hskip 10.00002pt\forall\bm{d}_{w}\in\mathbb{R}^{m_{w}}\hskip 10.% 00002pt\text{with}\hskip 10.00002pt\bm{w}+\bm{d}_{w}\in\mathcal{C}_{w}

\mathcal{L}\left(\bm{w},\bm{h}+\bm{d}_{h}\right)\geq\mathcal{L}\left(\bm{w},% \bm{h}\right)\hskip 10.00002pt\forall\bm{d}_{h}\in\mathbb{R}^{m_{h}}\hskip 10.% 00002pt\text{with}\hskip 10.00002pt\bm{h}+\bm{d}_{h}\in\mathcal{C}_{h}

A coordinatewise minimum is a natural termination point for an alternating minimization algorithm. However, it is important to note that a coordinatewise minimum is not equivalent to a local minimum, as it does not guarantee minimality in all directions. Figure 1 (left) provides a counterexample illustrating this.

Another significant concept is the notion of a stationary point, where the gradient is non-negative in all directions.

Definition 3 (Stationary Points of a function).

Let $\mathcal{L}:\mathcal{C}\rightarrow\mathbb{R}$ be a scalar function, where $\mathcal{C}\subset\mathbb{R}^{m}$ is a convex set. A point $\bm{x}$ is a stationary point of $\mathcal{L}$ if

\mathcal{L}^{\prime}\left(\bm{z};\bm{d}\right)\geq 0\hskip 10.00002pt\forall% \bm{d}|\bm{z}+\bm{d}\in\mathcal{X}

We emphasize that a stationary point is not equivalent to a strict local minimum as there might be directions where the directional derivative equals 0. For example, in the simple function $f([\bm{w},\bm{h}])=(wh-2)^{2}$ , the point $[\bm{w},\bm{h}]=[\sqrt{2},\sqrt{2}]$ has a zero derivative in the direction $[1,-1]$ , which corresponds to rescaling the solution as $[\alpha,1/\alpha]$ . Even worse, a stationary point is not necessarily a local minimum, even if it is a coordinatewise minimum, as shown in Figure 1 (a). Here, in the diagonal directions, the directional derivative equals 0 but the function is concave in this direction.

When it comes to the Poisson loss function, there is, at least, a continuous set of local minimas corresponding to rescaling the solution. This is illustrated in Figure 1 (b). Note that the introduction of regularization or constraints can lead to strict local minima, as shown in Figure 1 (c).

In this contribution, we prove convergence to a coordinatewise minimum that is also a stationary point. To accomplish this, we will consider a class of functions that are regular at their coordinatewise minima.

Definition 4 (Regularity of a function at a point).

The function $\mathcal{L}:\mathcal{C}\rightarrow\mathbb{R}$ is said to be regular at the point $\bm{z}\in\mathcal{C}$ if $\mathcal{L}^{\prime}(\bm{z};\bm{d})\geq 0$ for all $\bm{d}=[\bm{d}_{w},\bm{d}_{h}]\in\mathbb{R}^{m}$ such that $\mathcal{L}^{\prime}(\bm{z};[\bm{d}_{w},\bm{0}])\geq 0$ and $\mathcal{L}^{\prime}(\bm{z};[\bm{0},\bm{d}_{h}])\geq 0$ .

Lemma 1.

Continuously differentiable functions are regular at their coordinatewise minimums.

Proof is provided in Appendix A.1. Lemma 1 plays a crucial role, as it ensures that the coordinate-wise minimum we converge to in Theorem 1 is also a stationary point. In this work, we assume the regularizer to be continuously differentiable on the domain.

3.3 Approximation functions

In order to facilitate optimization algorithms, it is beneficial to work with approximation functions that majorize or approximate the objective function at a given point. One commonly used class of approximation functions is known as first-order majorization functions. These functions provide a convenient framework for constructing surrogates and facilitating optimization. We adopt the definition of first-order majorization functions from [41].

Definition 5.

[41, Assumption 1] A function $g(\bm{x},\bm{x}^{t})$ is said to be a first-order majorization of $f$ at the point $\bm{x}^{t}$ if it satisfies the following properties:

	$\displaystyle A.1\hskip 30.00005pt$	$\displaystyle g(\bm{x},\bm{x}^{t})\geq f(\bm{x})\hskip 10.00002pt\forall\bm{x}% ,\bm{x}^{t}\in\mathcal{X},$
	$\displaystyle A.2\hskip 30.00005pt$	$\displaystyle g(\bm{x}^{t},\bm{x}^{t})=f(\bm{x}^{t})\hskip 10.00002pt\forall% \bm{x}^{t}\in\mathcal{X},$
	$\displaystyle A.3\hskip 30.00005pt$	$\displaystyle g^{\prime}(\bm{x},\bm{x}^{t};\bm{d})\big{\|}_{\bm{x}=\bm{x}^{t}}=% f^{\prime}(\bm{x}^{t};\bm{d})\hskip 10.00002pt\forall\bm{d}\text{ such that }% \bm{x}^{t}+\bm{d}\in\mathcal{X},$
	$\displaystyle A.4\hskip 30.00005pt$	$\displaystyle g(\bm{x},\bm{x}^{t})\text{ is continuous in }(\bm{x},\bm{x}^{t}).$

It is worth noting that for continuously differentiable functions, the third statement can be equivalently expressed as $\nabla_{\bm{x}}g(\bm{x}^{t},\bm{x}^{t})=\nabla f(\bm{x}^{t})$ . Although the definition of first-order majorization functions resembles the concept of surrogate functions introduced in [37, Definition 2.2], the additional requirement for a surrogate function is that $g(\bm{x},\bm{x}^{t})-f(\bm{x})$ is L gradient Lipschitz as defined in (5). Importantly, all majorization functions defined in the following Section 4.1 satisfy this condition and can thus serve as majorization functions.

Conveniently, majorization functions can be built term by term, leveraging their additivity property. This property allows us to combine multiple majorization functions to obtain a new majorization function.

Lemma 2.

First-order majorization functions are additive. If $g_{1}(\bm{x},\bm{x}^{t})$ and $g_{2}(\bm{x},\bm{x}^{t})$ majorize $f_{1}(\bm{x})$ and $f_{2}(\bm{x})$ at $\bm{x}^{t}$ , respectively, then $g_{1}(\bm{x},\bm{x}^{t})+g_{2}(\bm{x},\bm{x}^{t})$ majorizes $f(\bm{x})=f_{1}(\bm{x})+f_{2}(\bm{x})$ at $\bm{x}^{t}$ .

Proof.

The additivity property preserves each property of (5). ∎

Lemma 2 provides a valuable tool for constructing majorization functions by combining simpler majorization functions. Additionally, when proving that a function is majorizing, it is often unnecessary to explicitly demonstrate the equality of partial derivatives or gradients at $\bm{x}^{t}$ . Instead, in the case of differentiable functions, it is typically sufficient to establish the first two properties (A.1 and A.2). According to [41, Proposition 1], properties A.3 and A.4 follow as a consequence. Intuitively, one can observe that the continuity of the gradient ensures that the majorization function $g$ shares the tangent spaces with $f$ at the point $\bm{x}^{t}$ .

3.4 Two Blocks Successive Minimization (TBSUM)

The TBSUM algorithm is designed to solve the following problem:

\begin{split}\operatorname*{minimize~{}~{}}_{\bm{h},\bm{w}}~{}&\mathcal{L}% \left(\bm{w},\bm{h}\right)\\ \text{such that}~{}~{}&\bm{h}\in\mathcal{C}_{h},\bm{w}\in\mathcal{C}_{w}\end{split}

(9)

It relies on two first-order majorizing functions: $g_{w}(\bm{w},\bm{w}^{t},\bm{h}^{t})$ and $g_{h}(\bm{h},\bm{h}^{t},\bm{w}^{t})$ , which majorize $\mathcal{L}(\bm{w},\bm{h})$ at $\left(\bm{w}^{t},\bm{h}^{t}\right)$ for all $\bm{w}^{t}\in\mathcal{C}_{w}$ and $\bm{h}^{t}\in\mathcal{C}_{h}$ . The construction of these functions will be presented in Section 4. The TBSUM algorithm, outlined in Algorithm 1, alternates between minimizing $g_{w}$ and $g_{h}$ . It is assumed that the subproblem solutions are unique. Theorem 1 establishes the convergence of the TBSUM algorithm, which is a variant of the algorithm presented in [41, Theorem 2a] adapted for solving the specific problem at hand.

Algorithm 1 TBSUM: Two-Block Successive Minimization Algorithm

1:Initialize the variables to a feasible point

\bm{w}^{0}\in\mathcal{C}_{w}

\bm{h}^{0}\in\mathcal{C}_{h}

, and set

t=0

2:repeat

\bm{w}^{t+1}\leftarrow\operatorname*{arg\,min}_{\bm{w}\in\mathcal{C}_{w}}g_{w}% \left(\bm{w},\bm{w}^{t},\bm{h}^{t}\right)

\bm{h}^{t+1}\leftarrow\operatorname*{arg\,min}_{\bm{h}\in\mathcal{C}_{h}}g_{h}% \left(\bm{h},\bm{h}^{t},\bm{w}^{t+1}\right)

t\leftarrow t+1

6:until some convergence criterion is met

Theorem 1 (Convergence of TBSUM Algorithm 1).

Given two quasi-convex first order majorizing functions $g_{h}\left(\bm{w},\bm{w}^{t},\bm{h}^{t}\right)$ and $g_{w}\left(\bm{h},\bm{h}^{t},\bm{w}^{t+1}\right)$ of $\mathcal{L}\left(\bm{w},\bm{h}\right)$ at $\left(\bm{w}^{t},\bm{h}^{t}\right),\forall\bm{w}^{t},\bm{h}^{t}\in\mathcal{C}_% {w}\times\mathcal{C}_{h}$ . Furthermore assuming that the two subproblems in the TBSUM Algorithm 1 have unique solutions for any points $\bm{w}^{t}\in C_{w}$ , $\bm{h}^{t}\in C_{h}$ . Then, every limit point $\bm{\bm{z}=\left[\bm{w},\bm{h}\right]}$ of the iterates generated by the TBSUM Algorithm 1 is a coordinatewise minimum of (9). In addition, if $\mathcal{L}$ is regular at any point $\bm{z}\in\mathcal{C}$ , then $\bm{z}$ is a stationary point of (9).

4 Subproblem minimization

In this section, we focus on constructing the appropriate majorization functions $g_{h}\left(\bm{w},\bm{w}^{t},\bm{h}^{t}\right)$ and $g_{w}\left(\bm{h},\bm{h}^{t},\bm{w}^{t+1}\right)$ for our problem (3). Since we consider the same type of regularization for $\bm{w}$ and $\bm{h}$ , both subfunctions have the same form.

Practically, the loss function can be rewritten as

\mathcal{L}_{\bm{Y}}\left(\bm{W},\bm{H}\right){\color[rgb]{.75,.5,.25}+R\left(% \bm{W},\bm{H}\right)}=\sum_{j}f_{w}\left(\bm{w}_{j}\right)=\sum_{i}f_{h}\left(% \bm{h}_{i}\right)

where $\bm{w}_{j}$ and $\bm{h}_{i}$ are the $i^{th}$ row and $j^{th}$ column of $\bm{W}$ and $\bm{H}$ . The functions $f_{w}$ and $f_{h}$ have the form:

f\left(\bm{x}\right)={\color[rgb]{0.0,0.5,0.0}-\sum_{i=1}^{m}\left(b_{i}\log% \left(\bm{a}_{i}^{\top}\bm{x}\right)+\bm{a}_{i}^{\top}\bm{x}\right)}{\color[% rgb]{0,0,1}+s_{L}\left(\bm{x}\right)}{\color[rgb]{1,.5,0}+s_{R}\left(\bm{x}% \right)}{\color[rgb]{.75,0,.25}+\sum_{j=1}^{n}s_{C}\left(x_{j}\right)}.

(10)

Therefore, in this section, our objective is to find majorization functions for (10). Once this is done, we will provide closed-form solutions for steps 1 and 2 of Algorithm 1. It is worth noting that each term of (10) can be handled separately using the additivity property of majorizing functions (Lemma 2).

4.1 Majorizing functions

The following four lemmas provide majorizing functions for the different term of our objective function. Proofs are provided in Appendix A.2.

In order to develop an efficient algorithm, our objective is to identify majorizing functions that result in sub-problems with closed-form tractable solutions. Often, this can be accomplished under two conditions: 1. all the majorizing functions are of the same form, and, 2. the majorization function is separable with respect to the variables $\bf{x}$ , i.e., $g(\bf{x})=\sum_{i}g_{i}(x_{i})$ . Within the scope of this contribution, we consider two forms of majorizing functions: quadratic $g(x)=a+bx+cx^{2}$ and logarithmic $g(x)=a+bx-c\log(x)$ .

First, we propose a majorization scheme for the logarithmic term $\log\left(\bm{a}_{i}^{\top}\bm{x}\right)$ in the objective function (10). We utilize a widely used majorization technique based on the concavity of the logarithm function. This technique has been employed in the original work by Lee and Seung [29] as well as in many EM (Expectation-Maximization) schemes.

Lemma 3 (Log majorization).

Assuming $\bm{a}\circ\bm{x}>0$ , for $\bm{x}\in\mathcal{C}$ , let us define $q_{j}=\frac{a_{j}x_{j}^{t}}{\sum_{k}a_{k}x_{k}^{t}}$ for $\bm{x}^{t}\in\mathcal{C}$ , then $g\left(\bm{x},\bm{x}^{t}\right)=-\sum_{j}q_{j}\log\left(\frac{a_{j}x_{j}}{q_{j% }}\right)$ is a first order majorizing function of $f\left(\bm{x}\right)=-\log\left(\bm{a}^{\top}\bm{x}\right)=-\log\left(\sum_{j}% a_{j}x_{j}\right)$ .

We now proceed to majorize the different terms of the regularisation function $s_{L}\left(\bm{x}\right)$ , $s_{R}\left(\bm{x}\right)$ , and $s_{C}\left(x_{j}\right)$ . We can majorize any Lipschitz function using the following lemma.

Lemma 4 (Lipschitz-majorization).

Given $s_{L}\left(\bm{x}\right)$ a gradient Lipschitz function with constant $\sigma_{L}$ over the domain $\bm{x}\in\mathcal{C}$ . The functions

\displaystyle g_{1}\left(\bm{x},\bm{x}^{t}\right)

\displaystyle=s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm{x}^{t}\right)^{% \top}\nabla s_{L}\left(\bm{x}^{t}\right)+\sigma_{L}\|\bm{x}-\bm{x}^{t}\|_{2}^{2}

(11)

and

g_{2}\left(\bm{x},\bm{x}^{t}\right)=s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-% \bm{x}^{t}\right)^{\top}\nabla s_{L}\left(\bm{x}^{t}\right)+2\sigma_{L}\left(% \max_{j}x_{j}^{t}\right)\left(\sum_{j}x_{j}^{t}\log\left(\frac{x_{j}^{t}}{x_{j% }}\right)-x_{j}^{t}+x_{j}\right)

(12)

are first oder majorizing functions at $\bm{x}^{t}\in\mathcal{C}.$

We note that (11) (quadratic majorisation) is tighter than (12) (logarithmic majorisation) . However, the looser majorisation function is needed to obtain a close form solution for the MU (see Section 4).

Next, the term that is relatively smooth can be majorized using the following lemma.

Lemma 5 (Relative smoothness majorization).

Assuming $s_{R}\left(\bm{x}\right)$ a $\sigma_{R}$ relatively smooth function with respect to $\kappa\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}\right)$ for $\bm{x}\in\mathcal{C\subset\mathbb{R}}_{+}^{n}.$ Then the function

g\left(\bm{x},\bm{x}^{t}\right)=s_{R}\left(\bm{x}^{t}\right)+\left\langle% \nabla s_{R}\left(\bm{x}^{t}\right),\bm{x}-\bm{x}^{t}\right\rangle+\sigma_{R}% \sum_{i}^{n}\left(\frac{x_{i}}{x_{i}^{t}}-\log\left(\frac{x_{i}}{x_{i}^{t}}% \right)-1\right)

(13)

is a first order majorizing function of $s_{R}\left(\bm{x}\right)$ for $\bm{x}^{t}\in\mathcal{C}.$

Lemma 6 (Concave majorisation).

Given $s\left(x\right)$ a concave function defined on $x\in\mathcal{C}\subset\mathbb{R}$ , it’s linear approximation at the point $x^{t}$

g\left(x,x^{t}\right)=s\left(x^{t}\right)+\frac{\partial s\left(x^{t}\right)}{% \partial x}\left(x-x^{t}\right)

(14)

is a first order majorization function for $x^{t}\in\mathcal{C}.$

4.2 Subproblem updates

Now that we have defined majorizing functions for each term of (10), we can apply the additivity property of Lemma 2 to obtain a general majorizing function for $f(\bm{x})$ :

$\displaystyle g\left(\bm{x},\bm{x}^{t}\right)$	$\displaystyle={\color[rgb]{0.0,0.5,0.0}-\sum_{i=1}^{m}\left(b_{i}\sum_{j}q_{ij% }\log\left(\frac{a_{ij}x_{j}}{q_{ij}}\right)+\bm{a}_{i}^{\top}\bm{x}\right)}$
	$\displaystyle{\color[rgb]{0,0,1}+s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm% {x}^{t}\right)^{\top}\nabla s_{L}\left(\bm{x}^{t}\right)+2\sigma_{L}\left(\max% _{j}x_{j}^{t}\right)\left(\sum_{j}x_{j}^{t}\log\left(\frac{x_{j}^{t}}{x_{j}}% \right)-x_{j}^{t}+x_{j}\right)}$	(15)
	$\displaystyle{\color[rgb]{1,.5,0}+s_{R}\left(\bm{x}^{t}\right)+\left\langle% \nabla s_{R}\left(\bm{x}^{t}\right),\bm{x}-\bm{x}^{t}\right\rangle+\sigma_{R}% \sum_{j}\left(\frac{x_{j}}{x_{j}^{t}}-\log\left(\frac{x_{j}}{x_{j}^{t}}\right)% -1\right)}$
	$\displaystyle{\color[rgb]{.75,0,.25}+\sum_{j=1}^{n}s_{C}\left(x_{j}^{t}\right)% +\frac{\partial s_{C}\left(x_{j}^{t}\right)}{\partial x_{j}}\left(x_{j}-x_{j}^% {t}\right)}$

where $q_{ij}=\frac{a_{ij}x_{j}^{t}}{\sum_{k}a_{ik}x_{k}^{t}}.$ We use the colors green, blue, orange, and purple to denote and keep track of the dependencies of the different terms in (10). Finding the local optimum of the majorizing function will provide us with an update for Algorithm 1.

Proposition 1 (Generalized MU for (10)).

Assuming $\bm{x}^{t},\bm{x},\bm{b},\bm{A}>0$ , the first-order majorizing function defined in (15) is strictly convex, and its global minimum $\bm{x}^{t+1}$ is given by

x_{j}^{t+1}=x_{j}^{t}\frac{\alpha_{j}^{t}}{\beta_{j}^{t}}

(16)

where

	$\displaystyle\alpha_{j}^{t}$	$\displaystyle={\color[rgb]{0.0,0.5,0.0}\sum_{i}b_{i}\frac{a_{ij}}{\sum_{k}a_{% ik}x_{k}^{t}}}{\color[rgb]{0,0,1}+2\left(\max_{i}x_{i}^{t}\right)\sigma_{L}}{% \color[rgb]{1,.5,0}+\frac{\sigma_{R}}{x_{j}^{t}}}\hskip 10.00002pt\text{and}$		(17)
	$\displaystyle\beta_{j}^{t}$	$\displaystyle={\color[rgb]{0.0,0.5,0.0}\sum_{i}a_{ij}}{\color[rgb]{0,0,1}+% \nabla_{x_{j}}s_{L}\left(\bm{x}^{t}\right)+2\left(\max_{i}x_{i}^{t}\right)% \sigma_{L}}{\color[rgb]{1,.5,0}+\nabla_{x_{j}}s_{R}\left(\bm{x}^{t}\right)+% \frac{\sigma_{R}}{x_{j}^{t}}}{\color[rgb]{.75,0,.25}+\frac{\partial s_{C}\left% (x_{j}^{t}\right)}{\partial x}}.$		(18)

The proof is provided in Appendix A.3.

Generalization of the traditional MU Rule

We note that (16) serves as a generalization of the original Multiplicative Update (MU) rule presented in [29]. Removing the regularization terms (blue, orange, and purple) results in precisely the MU rule as outlined in [29].

Connection with (Block) Mirror Descent [16, Algorithm 1]

Another interesting observation is that the majorization of the relatively smooth term is done similarly to a Bregman proximal method algorithm [14]. Since the objective function $-\sum_{i=1}^{m}\left(b_{i}\log\left(\bm{a}_{i}^{\top}\bm{x}\right)+\bm{a}_{i}^% {\top}\bm{x}\right)$ is relatively smooth, one could drop all terms except for $s_{R}$ and optimize using Block Bregman Proximal Gradient (BBPG) [50]. This would result in an algorithm very similar to Block Mirror Descent (BMD), which has recently been proposed for solving Poisson NMF [16]. Nevertheless, we advice against this this solution as discussed further in Section 5.2.

Alternative majorizing function and Quadratic Update (QU)

As shown experimentally in Section 6 and illustrated in Figure 2, having a majorization function as tight as possible leads to faster convergence of the algorithm. In (1), we deliberately choose to use a looser majorizing function for the term $s_{L}$ in order to recover an algorithm with multiplicative update that generalizes the original approach from [29]. However, instead of using (12), one can also use (11) when constructing the majorizing function:

$\displaystyle g\left(\bm{x},\bm{x}^{t}\right)$	$\displaystyle={\color[rgb]{0.0,0.5,0.0}-\sum_{i=1}^{m}\left(b_{i}\sum_{j}q_{ij% }\log\left(\frac{a_{ij}x_{j}}{q_{ij}}\right)+\bm{a}_{i}^{\top}\bm{x}\right)}$	(19)
	$\displaystyle{\color[rgb]{0,0,1}+s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm% {x}^{t}\right)^{\top}\nabla s_{L}\left(\bm{x}^{t}\right)+\sigma_{L}\left\\|\bm{% x}-\bm{x}^{t}\right\\|_{2}^{2}}$
	$\displaystyle{\color[rgb]{1,.5,0}+s_{R}\left(\bm{x}^{t}\right)+\left\langle% \nabla s_{R}\left(\bm{x}^{t}\right),\bm{x}-\bm{x}^{t}\right\rangle+\sigma_{R}% \sum_{j}\left(\frac{x_{j}}{x_{j}^{t}}-\log\left(\frac{x_{j}}{x_{j}^{t}}\right)% -1\right)}$
	$\displaystyle{\color[rgb]{.75,0,.25}+\sum_{j=1}^{n}s_{C}\left(x_{j}^{t}\right)% +\frac{\partial s_{C}\left(x_{j}^{t}\right)}{\partial x_{j}}\left(x_{j}-x_{j}^% {t}\right)}$

which is also a strictly convex function.

Proposition 2 (QU for (10)).

Assuming $\bm{x}^{t},\bm{x},\bm{b},\bm{A}>0$ , the first-order majorizing function defined in Equation (19) is strictly convex, and its global minimum $\bm{x}^{t+1}$ is given by

x_{j}^{t+1}=\frac{-\beta_{j}^{t}+\sqrt{\left(\beta_{j}^{t}\right)^{2}+4\alpha% \zeta_{j}^{t}}}{2\alpha}

(20)

where

\alpha={\color[rgb]{0,0,1}2\sigma_{L}}\hskip 10.00002pt\beta_{j}^{t}={\color[% rgb]{0.0,0.5,0.0}\sum_{i}a_{ij}}{\color[rgb]{0,0,1}+\nabla_{x_{j}}s_{L}\left(% \bm{x}^{t}\right)-2\sigma_{L}x_{j}^{t}}{\color[rgb]{1,.5,0}+\nabla_{x_{j}}s_{R% }\left(\bm{x}^{t}\right)+\frac{\sigma_{R}}{x_{j}^{t}}}{\color[rgb]{.75,0,.25}+% \frac{\partial s_{C}\left(x_{j}^{t}\right)}{\partial x}}\hskip 10.00002pt\zeta% _{j}^{t}={\color[rgb]{0.0,0.5,0.0}\sum_{i}b_{i}\frac{a_{ij}x_{j}^{t}}{\sum_{k}% a_{ik}x_{k}^{t}}}{\color[rgb]{1,.5,0}+\sigma_{R}}.

(21)

The proof is provided in Appendix A.3. Both of these propositions lead to the update rule for our MU and QU algorithms detailed in Section 5. We note also that, with the appropriate assumptions, the update rule 16 and 20 will preserve positivity of the variable $\bm{x}$ . However, since our desire is also to handle extra constraint, we develop in the next section rigorous approach.

4.3 Generalized simplex constraint

We need to handle two constraints: 1. the linear constraint $\bm{x}\geq\epsilon$ , and, 2. the scale constraint $\bm{e}^{\top}\bm{x}=1$ , where $\bm{e}\geq 0$ . While the first one is used to keep the variable non-negative, typically with a strictly positive small $\epsilon$ , the second one can set the scale of one of the variables ( $\bm{W}$ or $\bm{H}$ ) in the factorization problem. Furthermore, in the case $\bm{e}=\bm{1}$ , the simplex constraint is recovered. It turns out that the update rules of (16) and (20) can simply be updated to handle this constraint. The actual optimization problem we want to solve becomes:

\operatorname*{minimize~{}~{}}_{{\color[rgb]{.5,0,.5}\bm{x}\geq\epsilon}}f% \left(\bm{x}\right)\hskip 10.00002pt\text{such that}\hskip 10.00002pt{\color[% rgb]{0,1,1}\bm{e}^{\top}\bm{x}=1}.

where $f$ is given in (10).

To solve this problem, we used the KKT approach, i.e, we find points that satisfy the KKT (Karush-Kuhn-Tucker) conditions:

		1. Stationarity	$\displaystyle\nabla_{\bm{x}}L\left(\dot{\bm{x}},\nu,\bm{\mu}\right)=\bm{0},$
		2. Primal feasibility	$\displaystyle\begin{cases}{\color[rgb]{0,1,1}\bm{e}^{\top}\dot{\bm{x}}-1=0},\\ {\color[rgb]{.5,0,.5}\dot{\bm{x}}\geq\epsilon}\end{cases},$
		3. Dual feasibility	$\displaystyle{\color[rgb]{.5,0,.5}\bm{\mu}\geq\bm{0}},$
		4. Complementary slackness	$\displaystyle{\color[rgb]{.5,0,.5}\bm{\mu}^{\top}\left(-\bm{x}+\epsilon\bm{1}% \right)=0},$

where the Lagrangian is defined as:

L\left(\bm{x},\nu,\bm{\mu}\right)=f\left(\bm{x}\right){\color[rgb]{0,1,1}+\nu% \left(\bm{e}^{\top}\bm{x}-1\right)}{\color[rgb]{.5,0,.5}+\bm{\mu}^{\top}\left(% -\bm{x}+\epsilon\bm{1}\right)}.

We follow the same method as developed in Section 4.2, except that we majorize the Lagrangian $L\left(\bm{x},\nu,\bm{\mu}\right)$ . The resulting first-order majorizing function is given by:

g^{\prime}\left(\bm{x},\bm{x}^{t},\nu,\bm{\mu}\right)=g\left(\bm{x},\bm{x}^{t}% \right){\color[rgb]{0,1,1}+\nu\left(\bm{e}^{\top}\bm{x}-1\right)}{\color[rgb]{% .5,0,.5}+\bm{\mu}^{\top}\left(-\bm{x}+\epsilon\bm{1}\right)}

where $g\left(\bm{x},\bm{x}^{t}\right)$ is given in (15) or (19). We repeat the development of Section 4.2 (and the proofs of Appendix A.3). We end up with and update that is very similar to (16) or (20). In the MU case, we end up with:

x_{j}^{t+1}=x_{j}^{t}\frac{\alpha_{j}}{\beta_{j}{\color[rgb]{0,1,1}+\nu e_{j}}% {\color[rgb]{.5,0,.5}-\mu_{j}}},

where the only final difference consists of two terms in cyan and violet ( $\alpha_{j}$ and $\beta_{j}$ remain identical). For the QU, we stick to the same update rule (20), where only $\beta_{j}^{t}$ is modified:

\beta_{j}^{t\prime}=\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j}}{\color[rgb]{% .5,0,.5}-\mu_{j}}.

This update rule ensures the first of the KKT conditions (stationarity). We now find $\nu,\bm{\mu}$ such that the second KKT condition holds (primal feasibility). It turns out that $\bm{\mu}$ does not need to be computed explicitly. In the MU case, $\mu_{j}$ is selected to be large enough such that

\displaystyle x_{j}^{t+1}

\displaystyle=\max\left(x_{j}^{t}\frac{\alpha_{j}}{\beta_{j}{\color[rgb]{0,1,1% }+\nu e_{j}}},\epsilon\right)=\frac{x_{j}^{t}\alpha_{j}}{\min\left({\color[rgb% ]{.5,0,.5}\frac{x_{j}^{t}\alpha_{j}}{\epsilon}},\beta_{j}{\color[rgb]{0,1,1}+% \nu e_{j}}\right)}.

(22)

In the QU case, we obtain

	$\displaystyle x_{j}^{t+1}$	$\displaystyle=\max\left(\frac{-\left(\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j% }}\right)+\sqrt{\left(\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j}}\right)^{2}+4% \alpha\zeta_{j}^{t}}}{2\alpha},\epsilon\right)$		(23)
		$\displaystyle=\frac{-\min\left({\color[rgb]{.5,0,.5}\frac{\zeta_{j}^{t}}{% \epsilon}-\epsilon\alpha},\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j}}\right)+% \sqrt{\left(\min\left({\color[rgb]{.5,0,.5}\frac{\zeta_{j}^{t}}{\epsilon}-% \epsilon\alpha},\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j}}\right)\right)^{2}+% 4\alpha\zeta_{j}^{t}}}{2\alpha}$

Note that dual feasibility and complementary slackness could be verified, but we leave them out for simplicity. We then need to find the value of $\nu$ such that $\bm{e}^{\top}\bm{x}=1$ , which is equivalent to searching for

\displaystyle h_{1}\left(\nu\right)

\displaystyle=\sum_{j}e_{j}\frac{x_{j}^{t}\alpha_{j}}{\min\left({\color[rgb]{% .5,0,.5}\frac{x_{j}^{t}\alpha_{j}}{\epsilon}},\beta_{j}^{t}{\color[rgb]{0,1,1}% +\nu e_{j}}\right)}-1=0

(24)

Similarly, for the quadratic update of (20), we search for $\nu$ that satisfies

h_{2}\left(\nu\right)=\sum_{j}e_{j}\frac{-\min\left({\color[rgb]{.5,0,.5}\frac% {\zeta_{j}^{t}}{\epsilon}-\epsilon\alpha},\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e% _{j}}\right)+\sqrt{\left(\min\left({\color[rgb]{.5,0,.5}\frac{\zeta_{j}^{t}}{% \epsilon}-\epsilon\alpha},\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j}}\right)% \right)^{2}+4\alpha\zeta_{j}^{t}}}{2\alpha}-1=0.

(25)

There is no closed-form solution for $\nu$ ; however, the value can be found using a simple dichotomy search. Bounds for starting the dichotomy are computed in Appendix B.

Case $\epsilon=0$

Most of our reasoning relies on the fact that $x_{i}>0$ and, therefore, on the domain constraint $\epsilon>0$ . We have found that setting $\epsilon$ to a small non-zero value works well in practice. However, our approach can likely be generalized to the case where $\epsilon=0$ , following the approach of [33, Section 4], which studies the unregularized Poisson NMF case.

5 Algorithms for Poisson matrix factorisation

Equipped with the update rules developed in the previous Sections 4 and 4.3, we are ready to tackle the general problem of this contribution³³3Here we show the problem with the linear constraint on $\bm{H}$ , however, by symmetry, a similar algorithm can be developed with the constraint on $\bm{W}$ . which consists of minimizing (2):

	$\displaystyle\dot{\bm{W}},\dot{\bm{H}}=\operatorname*{arg\,min}_{\bm{W},\bm{H}}$	$\displaystyle-\left\langle\bm{Y},\log\left(\bm{W}\bm{H}\right)\right\rangle+% \left\langle\mathbf{1},\bm{W}\bm{H}\right\rangle+R_{W}\left(\bm{W}\right)+R_{H% }\left(\bm{H}\right)$		(26)
	such that	$\displaystyle\bm{W}\geq\epsilon,\bm{H}\geq\epsilon,\bm{e}^{\top}\bm{H}=\bm{1}$

We first observe that with respect to each variable $\bm{W},\bm{H}$ , the problem is separable by column/row. For example, given $\bm{W}$ , finding the optimal $\bm{H}$ can be done for each column:

\dot{\bm{h}_{i}}=\operatorname*{arg\,min}_{\bm{h}_{i}}-\left\langle\bm{y}_{i},% \log\left(\bm{W}\bm{h}_{i}\right)\right\rangle+\left\langle\mathbf{1},\bm{W}% \bm{h}_{i}\right\rangle+r_{H}\left(\bm{h}_{i}\right)\hskip 10.00002pt\text{% such that}\hskip 10.00002pt\bm{h}_{i}\geq\epsilon,\bm{e}^{\top}\bm{h}_{i}=1

We therefore apply the TBSUM Algorithm 1, where all lines of $\bm{W}$ and all columns of $\bm{H}$ are updated independently, and obtain the two Algorithms 2 and 3. We note here that the function $max(\cdot,\epsilon)$ ensures that the solutions are $\geq\epsilon$ (non-negativity).

Algorithm 2 MU Algorithm for Regularized Poisson NMF

1:Initialize the variables

\bm{W}^{0}\geq\epsilon

\bm{H}\geq\epsilon

t=0

such that

\bm{e}^{\top}\bm{H}=\bm{1}

2:while some convergence criterion is met do

3: for each line

\bm{w}_{i}^{t\top}

\bm{W}^{t}

4: Compute

\bm{\alpha}_{i}^{t\top}

and

\bm{\beta}_{i}^{t\top}

using (17) and (18).

5: Update using the MU rule (22):

w_{ij}^{t+1}\leftarrow\max\left(w_{ij}^{t}\frac{\alpha_{ij}^{t}}{\beta_{ij}^{t% }},\epsilon\right)

6: end for

7: for each column

\bm{h}_{i}^{t}

\bm{H}^{t}

8: Compute

\bm{\alpha}_{i}^{t}

and

\bm{\beta}_{i}^{t}

using (17) and (18).

9: Find the dual variable

\nu_{i}

by dichotomy of the function (24) (set

\nu=0

if no constraint is present).

10: Update using the MU (22):

h_{ij}^{t+1}\leftarrow\max\left(h_{ij}^{t}\frac{\alpha_{ij}^{t}}{\beta_{ij}^{t% }+\nu_{i}e_{j}},\epsilon\right)

11: end for

12:

t\leftarrow t+1

13:end while

Algorithm 3 QU Algorithm for Regularized Poisson NMF

1:Initialize the variables

\bm{W}^{0}\geq\epsilon

\bm{H}\geq\epsilon

t=0

such that

\bm{e}^{\top}\bm{H}=\bm{1}

2:while some convergence criterion is met do

3: for each line

\bm{w}_{i}^{t\top}

\bm{W}^{t}

4: Compute

\alpha^{t}

\bm{\beta}_{i}^{t\top}

, and

\bm{\gamma}_{i}^{t\top}

using (21).

5: Update using the QU rule (23):

w_{ij}^{t+1}\leftarrow\max\left(\frac{-\beta_{ij}^{t}+\sqrt{\left(\beta_{ij}^{% t}\right)^{2}+4\alpha^{t}\zeta_{ij}^{t}}}{2\alpha^{t}},\epsilon\right)

6: end for

7: for each column

\bm{h}_{i}^{t}

\bm{H}^{t}

8: Compute

\alpha^{t}

\bm{\beta}_{i}^{t}

, and

\bm{\gamma}_{i}^{t}

using (21).

9: Find the dual variable

\nu_{i}

by dichotomy of the function (25) (set

\nu=0

if no constraint is present).

10: Update using the QU rule (23):

h_{ij}^{t+1}\leftarrow\max\left(\frac{-\left(\beta_{ij}^{t}+\nu_{i}e_{j}\right% )+\sqrt{\left(\beta_{ij}^{t}+\nu_{i}e_{j}\right)^{2}+4\alpha^{t}\zeta_{ij}^{t}% }}{2\alpha^{t}},\epsilon\right)

11: end for

12:

t\leftarrow t+1

13:end while

Convergence

The two update rules in steps 5 and 10 correspond to minimizing first-order strictly convex majorization functions. As a result, we can apply Theorem 1 to guarantee convergence towards a coordinate-wise minimum. It is important to note that this coordinate-wise minimum is also a stationary point, given that the objective function remains regular for any point $\bm{W}\in\mathcal{C}_{W},\bm{H}\in\mathcal{C}_{H}$ .

5.1 Algorithm complexity

Let’s examine the complexity of both Algorithms 2 and 3 when considering $\bm{W}\in\mathbb{R}^{n\times k}$ and $\bm{H}\in\mathbb{R}^{k\times m}$ . In each iteration, the following complexities are observed:
(a) Step 4 has a complexity of $\mathcal{O}\left(nmk\right)$ .
(b) Step 5 has a complexity of $\mathcal{O}\left(nk\right)$ .
(c) Step 8 has a complexity of $\mathcal{O}\left(nmk\right)$ .
(d) Step 9 has a complexity of $\mathcal{O}\left(c_{d}km\right)$ , where $c_{d}$ denotes the number of iterations performed by the dichotomy.
(e) Step 10 has a complexity of $\mathcal{O}\left(mk\right)$ .
Thus, the overall complexity per iteration can be expressed as $\mathcal{O}\left(nmk\right)+\mathcal{O}\left(c_{d}km\right)=\mathcal{O}\left(% \left(n+c_{d}\right)mk\right)$ . This indicates that the computational complexity per iteration is linear with respect to the problem size, i.e., $nm$ , multiplied by the number of components, i.e., $k$ .

Impact of the dichotomy

When $n$ is small, the computational cost of the dichotomy in step 4 becomes dominant. Nevertheless, in general, for larger values of $n$ , the impact of the dichotomy becomes negligible.

5.2 Tight Majorizing Functions

While we do not make any theoretical contributions concerning the speed of convergence of the algorithm, we want to emphasize the natural fact that tighter majorizing functions lead to faster convergence. Therefore, when evaluating an algorithm, we believe that the analysis of the underlying majorizing function is as insightful as the experimental evaluation. As an example, we could have used Block Mirror Descent to solve (3), as was done in [16, Algorithm 1]. This algorithm uses a Bregman Difference to create a majorization function for the subproblem. However, this would result in a much looser majorization function, which partly explains the slow convergence of this algorithm observed in [16]. This difference between majorization functions is exemplified in Figure 2.

Linesearch

By tightening the bounds we used to construct the surrogate function, we can develop a more efficient algorithm. Here, we apply a classic "linesearch" method to the functions $s_{L}$ . However, the same technique can be trivially applied to $s_{R}$ as well. First, in (12) or (11), replace the constant $\sigma_{L}$ with a parameter $\gamma$ and initialize it with $\sigma_{L}$ . Second, at each iteration, update the parameter $\gamma$ according to the following rule:

\gamma^{t+1}=\begin{cases}\upsilon\gamma^{t}&\text{if}\hskip 10.00002pts_{L}% \left(\bm{x}\right)\geq g\left(\bm{x},\bm{x}^{t},\gamma\right)\\ \frac{1}{\tau}\gamma^{t}&\text{otherwise.}\end{cases}

Here, $\upsilon$ and $\tau$ are two update rates that determine how fast $\gamma$ is updated. Choosing values that are too small for these parameters leads to an inefficient linesearch, while selecting values that are too large can result in strong oscillation patterns. Typical values for $\upsilon$ and $\tau$ range from 1.05 to 1.5. However, it is important to note that when using linesearch, we are not guaranteed to converge, as we might invalidate the assumptions of Theorem 1.

6 Numerical Simulation

Problem

In this section, we analyze the speed of convergence of Algorithms 2 and 3 through numerical simulations. As a regularizer for $\bm{H}$ , we consider the Laplacian regularization $R(\bm{H},\lambda)=\frac{\lambda}{2}\text{tr}(\bm{H}^{T}\Delta\bm{H})$ , where $\Delta$ represents the two-dimensional Laplacian for the $k$ th line of $\bm{H}\in\mathbb{R}^{k\times p^{2}}$ reshaped as $k$ images of size $p\times p$ . Since a straightforward approach to minimize $R(\bm{H},\lambda)$ is to reduce the amplitude of $\bm{H}$ , we add the simplex constraint $\bm{1}^{T}\bm{H}=\bm{1}$ . This leads to the following optimization problem:

\displaystyle\dot{\bm{W}},\dot{\bm{H}}=

\displaystyle\arg\min_{\bm{W},\bm{H}}-\left\langle\bm{Y},\log(\bm{W}\bm{H})% \right\rangle+\left\langle\mathbf{1},\bm{W}\bm{H}\right\rangle+\frac{\lambda}{% 2}\text{tr}(\bm{H}^{T}\Delta\bm{H})\

\displaystyle\text{subject to }\bm{W}\geq\epsilon,\bm{H}\geq\epsilon,\bm{1}^{T% }\bm{H}=\bm{1}

This particular problem can be applied in various domains, such as Non-Negative Matrix Factorization for hyperspectral images [36] and remote sensing [32] (See Related Work Section 2 for more references and applications). Our algorithms and regularisations were specifically developed for the espm python package [51]. All algorithms and experiments can be found in the espm package.

Dataset

We construct two datasets consisting of 50 randomly drawn samples. In the first dataset, both matrices $\bm{W}$ and $\bm{H}$ are randomly generated from a uniform distribution. In the second dataset, each column of $\bm{W}$ corresponds to the sum of Gaussian functions that are randomly centered and scaled. The matrix $\bm{H}$ represents random smooth images. This second dataset is created using the espm package [51], where the toy model is used for $\bm{W}$ and $\bm{W}$ is generated using the "laplacian" weight type. This choice of dataset is selected because it can benefit from the Laplacian regularization on $\bm{H}$ .

Once $\bm{W}$ and $\bm{H}$ are generated, the noiseless matrix $\bm{Y}$ is obtained as $\bm{Y}=\bm{W}\bm{H}$ . We introduce noise by independently sampling each element $\tilde{\bm{Y}}{ij}\sim\frac{\text{Poisson}(\lambda\bm{Y}{ij})}{\lambda}$ , where $\lambda$ can be regarded as the noise control parameter. For all samples, we set $\bm{W}\in\mathbb{R}^{n\times k}$ and $\bm{H}\in\mathbb{R}^{k\times p^{2}}$ with $k=3$ , $p=64$ , and $n$ selected from the set ${25,100,500,1000}$ . Thus, the images in the dataset have dimensions of $64\times 64$ .

Results

We compare the performance of Algorithm 2 (MU), Algorithm 3 (QU), Block Mirror Descent (similar to [16, Algorithm 1]) and the projected gradient algorithm applied to (3). Figure 3 displays the convergence curves for 1000 iterations and $n=25$ . Although the overall complexity of all algorithms is the same, the time per iteration differs due to the different operations performed within each iteration and the time spent on dichotomy to compute the dual variable $\nu$ . Therefore, we provide the time in seconds for each algorithm in Figure 4 for various value of $n$ . All results are averaged over 50 repetitions.

Discussion

Let’s discuss the results in more detail:
- Number of iterations: Figure 3 illustrates the convergence behavior of QU and MU algorithms. It is evident that QU converges faster per iteration compared to MU, which aligns with our expectations due to the tighter majorizing function used in QU. However, it is important to note that the introduction of the linesearch technique, while accelerating convergence, can lead to occasional instability, as indicated by occasional increases in the loss function. This observation supports our earlier discussion in Section 5.2. The challenge with Projected Gradient Descent is that we need to find an initial learning rate that is not too large, as the algorithm can diverge and not too small as the algorithm can be slow. Overall, since the selected learning rate cannot be selected optimally, the algorithm is slower than QU and MU. The Block Mirror Descent algorithm is also slower than QU and MU, which is consistent with the results of [16] and can be explained by the fact that the majorizing function used in the algorithm is looser.
- Time per iteration: Figure 4 presents the total time taken by the algorithms to complete 100 iterations. The results demonstrate that for small values of $n$ , the computation of the $p^{2}$ dual variable during the dichotomy process dominates the overall execution time. However, as $n$ increases, the time spent on dichotomy becomes negligible in comparison. These findings align with the complexity per iteration discussed in Section 5.1.

7 Conclusion

This contribution is the first to address the Poisson NMF problem with general regularization terms, such as Lipschitz functions, relatively smooth functions, or those expressed as linear constraints. We introduce two new algorithms and demonstrate their convergence to a coordinate-wise minimum, which is also a stationary point. Emphasizing the impact of the majorizing function choice on convergence speed, we validate our findings through numerical simulations. In essence, we believe that this work serves as a helpful guide for develo** efficient algorithms suited for regularized Poisson NMF problems.

Appendix A Proofs

In this Appendix, we provide the different proofs used in the paper.

A.1 Proof of Lemma 1

We note that this lemma and its proof likely exist in the literature, but we were unable to find a reference. See 1

Proof.

If $\mathcal{L}$ is continuously differentiable, the directional derivative can be written as $\mathcal{L}^{\prime}(\bm{z};\bm{d})=\bm{d}^{\top}\nabla\mathcal{L}(\bm{z})$ . At a coordinatewise minimum $\bm{z}$ , we have by definition:

\mathcal{L}^{\prime}\left(\bm{z};\left[\bm{d}_{w},\bm{0}\right]\right)=\nabla% \mathcal{L}\left(\bm{z}\right)\left[\bm{d}_{w},\bm{0}\right]^{\top}\geq 0

and

\mathcal{L}^{\prime}\left(\bm{z};\left[\bm{0},\bm{d}_{h}\right]\right)=\nabla% \mathcal{L}\left(\bm{z}\right)\left[\bm{0},\bm{d}_{h}\right]^{\top}\geq 0.

Therefore,

	$\displaystyle\mathcal{L}^{\prime}(\bm{z};\bm{d})$	$\displaystyle=\nabla\mathcal{L}(\bm{z})[\bm{d}_{w},\bm{d}_{h}]^{\top}$
		$\displaystyle=\nabla\mathcal{L}(\bm{z})([\bm{d}_{w},\bm{0}]^{\top}+[\bm{0},\bm% {d}_{h}]^{\top})$
		$\displaystyle=\nabla\mathcal{L}(\bm{z})[\bm{d}_{w},\bm{0}]^{\top}+\nabla% \mathcal{L}(\bm{z})[\bm{0},\bm{d}_{h}]^{\top}$
		$\displaystyle\geq 0.$

∎

A.2 Proof of majorizing functions

See 3

Proof.

First let us observe that

	$\displaystyle g\left(\bm{x}^{t},\bm{x}^{t}\right)$	$\displaystyle=-\sum_{j}\frac{a_{j}x_{j}^{t}}{\sum_{k}a_{k}x_{k}^{t}}\log\left(% \frac{a_{j}x_{j}^{t}}{\frac{a_{j}x_{j}^{t}}{\sum_{k}a_{k}x_{k}^{t}}}\right)$
		$\displaystyle=-\sum_{j}\frac{a_{j}x_{j}^{t}}{\sum_{k}a_{k}x_{k}^{t}}\log\left(% \sum_{k}a_{k}x_{k}^{t}\right)$
		$\displaystyle=-\log\left(\sum_{k}a_{k}x_{k}^{t}\right)=f\left(\bm{x}^{t}\right).$

Therefore, we have $g\left(\bm{x},\bm{x}^{t}\right)\geq f\left(\bm{x}\right)$ . The inequality follows from the convexity of the $-\log$ function:

\displaystyle-\log\left(\sum_{j}a_{j}x_{j}\right)

\displaystyle=-\log\left(\sum_{j}q_{j}\frac{a_{j}x_{j}}{u_{j}}\right)\leq-\sum% _{j}q_{j}\log\left(\frac{a_{j}x_{j}}{u_{j}}\right)

where we set $q_{j}=\frac{a_{j}x_{j}^{t}}{\sum_{k}a_{k}x_{k}^{t}}$ . Finally, by continuity, we obtain the $3^{\text{rd}}$ and $4^{\text{th}}$ properties of majorizing functions. ∎

We now proceed to majorize $s_{L}\left(\bm{x}\right)$ , $s_{R}\left(\bm{x}\right)$ , and $s_{C}\left(x_{j}\right)$ . See 4

Proof.

First it can be trivially observed that

s_{L}\left(\bm{x}^{t}\right)=g_{1}\left(\bm{x}^{t},\bm{x}^{t}\right)=g_{2}% \left(\bm{x}^{t},\bm{x}^{t}\right),

which satisfies the first property. We then take $1^{\text{st}}$ order Taylor expension of $s_{L}$ around $\bm{x}^{t}$ and find

s_{L}\left(\bm{x}\right)=s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm{x}^{t}% \right)^{\top}\nabla s_{L}\left(\bm{x}^{t}\right)+\text{$\mathcal{R}$$\left(% \bm{x},\bm{x}^{t}\right)$}

where $\text{$\mathcal{R}$$\left(\bm{x},\bm{x}^{t}\right)$}\leq\sigma_{L}\|\bm{x}-\bm% {x}^{t}\|_{2}^{2}$ since the function $s_{L}$ is gradient Lipschitz with constant $\sigma_{L}$ . Therefor we have

$\displaystyle s_{L}\left(\bm{x}\right)$	$\displaystyle\leq s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm{x}^{t}\right)^% {\top}\nabla s_{L}\left(\bm{x}^{t}\right)+\sigma_{L}\\|\bm{x}-\bm{x}^{t}\\|_{2}^% {2}=g_{1}\left(\bm{x},\bm{x}^{t}\right)$
	$\displaystyle\leq s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm{x}^{t}\right)^% {\top}\nabla s_{L}\left(\bm{x}^{t}\right)+2\sigma_{L}\left(\max_{j}x_{j}^{t}% \right)\left(\sum_{j}x_{j}^{t}\log\left(\frac{x_{j}^{t}}{x_{j}}\right)-x_{j}^{% t}+x_{j}\right)$	(27)
	$\displaystyle=g_{2}\left(\bm{x},\bm{x}^{t}\right),$

where (27) will be shown later in this proof. By continuity, we obtain the $3^{\text{rd}}$ and $4^{\text{th}}$ property of majorizing functions. We now need to prove (27) and reformulate it as

\|\bm{x}-\bm{x}^{t}\|_{2}^{2}\leq 2\left(\max_{j}x_{j}^{t}\right)\left(\sum_{j% }x_{j}^{t}\log\left(\frac{x_{j}^{t}}{x_{j}}\right)-x_{j}^{t}+x_{j}\right)=2% \left(\max_{j}x_{j}^{t}\right)D_{GKL}\left(\bm{x}\|\bm{x}^{t}\right).

(28)

For simplicity, let us define the function

q\left(\bm{x}\right)=\sum_{j}x_{j}\log\left(x_{j}\right),

with the gradient $\nabla_{x_{j}}q\left(\bm{x}\right)=\log\left(x_{j}\right)+1$ and the Hessian

\bm{H}_{ij}^{q}\left(\bm{x}\right)=\frac{\partial q}{\partial x_{i}\partial x_% {j}}\left(\bm{x}\right)=\begin{cases}\frac{1}{x_{j}}&\text{if }i=j\\ 0&\text{otherwise.}\end{cases}

Note that $q$ is a strictly convex function for $\bm{x}>0$ . We expand the generalized KL divergence:

$\displaystyle D_{GKL}\left(\bm{x}\\|\bm{x}^{t}\right)$	$\displaystyle=\sum_{j}x_{j}^{t}\log\left(x_{j}^{t}\right)-\sum_{j}x_{j}^{t}% \log\left(x_{j}\right)-\sum_{j}x_{j}^{t}+\sum_{j}x_{j}$
	$\displaystyle=\sum_{j}x_{j}^{t}\log\left(x_{j}^{t}\right)-\sum_{j}x_{j}\log% \left(x_{j}\right)-\sum_{j}\left(\log\left(x_{j}\right)+1\right)\left(x_{j}^{t% }-x_{j}\right)$
	$\displaystyle=q\left(\bm{x}^{t}\right)-\left(q\left(\bm{x}\right)+\nabla q% \left(\bm{x}\right)^{\top}\left(\bm{x}^{t}-\bm{x}\right)\right)$
	$\displaystyle=\frac{1}{2}\left(\bm{x}^{t}-\bm{x}\right)^{T}\bm{H}^{q}\left(% \tilde{\bm{x}}\right)\left(\bm{x}^{t}-\bm{x}\right),$	(29)

where $\tilde{\bm{x}}$ is selected such that the last equality holds. Since $q$ is a strictly convex function, we know that $\tilde{\bm{x}}=\rho\bm{x}+\left(1-\rho\right)\bm{x}^{t}$ for some give $\rho\in[0,1].$ Now we bound the Hessian as

\bm{H}^{q}\left(\bm{x}\right)\geq\frac{1}{\max_{j}x_{j}}\bm{I}

and introducing this inquality in (29), we obtain

D_{GKL}\left(\bm{x}\|\bm{x}^{t}\right)\geq\frac{1}{2\max_{j}x_{j}}\|\bm{x}-\bm% {x}^{t}\|_{2}^{2},

which is equivalent to (28) and completes the proof. ∎

See 5

Proof.

The first property $g\left(\bm{x}^{t},\bm{x}^{t}\right)=s_{R}\left(\bm{x}^{t}\right)$ can be trivially verified. Then using by the definition of relatively smoot function

s_{R}\left(\bm{x}\right)\leq s_{R}\left(\bm{x}^{t}\right)+\left\langle\nabla s% _{R}\left(\bm{x}^{t}\right),\bm{x}-\bm{x}^{t}\right\rangle+\sigma_{R}\mathcal{% B}_{\kappa}\left(\bm{x},\bm{x}^{t}\right)

where

\mathcal{B}_{\kappa}\left(\bm{x},\bm{x}^{t}\right):=\kappa\left(\bm{x}\right)-% \kappa\left(\bm{x}^{t}\right)-\left\langle\nabla\kappa\left(\bm{x}^{t}\right),% \bm{x}-\bm{x}^{t}\right\rangle.

Given that $\kappa\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}\right)$ , we compute

\mathcal{B}_{\kappa}\left(\bm{x},\bm{x}^{t}\right)=\sum_{i}^{n}\left(\frac{x_{% i}}{x_{i}^{t}}-\log\left(\frac{x_{i}}{x_{i}^{t}}\right)-1\right)

Finally, by continuity, we obtain the $3^{\text{rd}}$ and $4^{\text{th}}$ property of majorizing functions. ∎

See 6

Proof.

One can simply observe $s\left(x^{t}\right)=g\left(x^{t},x^{t}\right)$ . Then by concavity, we have

s\left(x_{i}\right)\leq s\left(x_{i}^{t}\right)+\frac{\partial s\left(x_{i}^{t% }\right)}{\partial x_{i}}\left(x_{i}-x_{i}^{t}\right)

Finally, by continuity, we obtain the $3^{\text{rd}}$ and $4^{\text{th}}$ property of majorizing functions. ∎

A.3 Proof of subproblem updates

In this subsection, we present the proof of the subproblem updates used in the MU and QU algorithms. Let us start with the MU updates. See 1

Proof.

Assuming $a_{ij},b_{i}>0$ , (10) is strictly convex, as the green term is strictly convex, and the remaining terms are convex. Consequently, (10) possesses a global minimum. To identify this minimum, we seek the stationary point $\nabla_{\bm{x}}g=\bm{0}$ . Due to our meticulous selection of majorizing functions, this subproblem becomes separable. When computing the gradient with respect to the variable $x_{j}$ , we obtain:

$\displaystyle\nabla_{x_{j}}g\left(\bm{x},\bm{x}^{t}\right)$	$\displaystyle={\color[rgb]{0.0,0.5,0.0}-\frac{1}{x_{j}}\sum_{i}b_{i}\frac{a_{% ij}x_{j}^{t}}{\sum_{k}a_{ik}x_{k}^{t}}+\sum_{i}a_{ij}}$
	$\displaystyle{\color[rgb]{0,0,1}+\nabla_{x_{j}}s_{L}\left(\bm{x}^{t}\right)+2% \left(\max_{i}x_{i}^{t}\right)\sigma_{L}\left(1-\frac{x_{j}^{t}}{x_{j}}\right)}$
	$\displaystyle{\color[rgb]{1,.5,0}+\nabla_{x_{j}}s_{R}\left(\bm{x}^{t}\right)+% \sigma_{R}\left(\frac{1}{x_{j}^{t}}-\frac{1}{x_{j}}\right)}$
	$\displaystyle{\color[rgb]{.75,0,.25}+\frac{\partial s_{C}\left(x_{j}^{t}\right% )}{\partial x}}=0$	(30)

where we assume that $x_{j},x_{j}^{t}>0$ since $0\notin\mathcal{C}$ . Transforming the above expression, we find a multiplicative update rule (16) for $x_{j}$ . ∎

The proof of the QU updates is similar to the MU updates, expect that we need to solve a quadratic equation to obtain a closed-form solution. See 2

Proof.

Assuming $a_{ij},b_{i}>0$ , Equation (19) is strictly convex. This convexity arises from the strict convexity of the green term, coupled with the convexity of the other terms. Consequently, (19) possesses a global minimum. To identify this minimum, we seek the stationary point by computing $\nabla_{\bm{x}}g=\bm{0}$ :

$\displaystyle\nabla_{x_{j}}g\left(\bm{x},\bm{x}^{t}\right)$	$\displaystyle={\color[rgb]{0.0,0.5,0.0}-\frac{1}{x_{j}}\sum_{i}b_{i}\frac{a_{% ij}x_{j}^{t}}{\sum_{k}a_{ik}x_{k}^{t}}+\sum_{i}a_{ij}}$
	$\displaystyle{\color[rgb]{0,0,1}+\nabla_{x_{j}}s_{L}\left(\bm{x}^{t}\right)+2% \sigma_{L}\left(x_{j}-x_{j}^{t}\right)}$
	$\displaystyle{\color[rgb]{1,.5,0}+\nabla_{x_{j}}s_{R}\left(\bm{x}^{t}\right)+% \sigma_{R}\left(\frac{1}{x_{j}^{t}}-\frac{1}{x_{j}}\right)}$
	$\displaystyle{\color[rgb]{.75,0,.25}+\frac{\partial s_{C}\left(x_{j}^{t}\right% )}{\partial x}=0}$	(31)

We observe that it is a separable quadratic function, hence the update rule named Quadratic Update (QU). Solving (31) for $x_{j}>0$ can be rewritten as

\alpha x_{j}^{2}+\beta_{j}^{t}x_{j}-\zeta_{j}^{t}=0,

where $\alpha$ , $\beta_{j}^{t}$ , and $\zeta_{j}^{t}$ are given in (21). Assuming $\zeta_{j}^{t}\neq 0$ , we have $\zeta_{j}^{t}>0$ and $4\alpha\zeta_{j}^{t}>0$ . Therefore, the previous quadratic equation has two real solutions. Since $\sqrt{\left(\beta_{j}^{t}\right)^{2}+4\alpha\zeta_{j}^{t}}>\beta_{j}^{t}$ , they are of opposite sign. Due to the constraint $x_{j}\geq\epsilon>0$ , we select the positive one, leading to the update rule of (20) for $x_{j}$ . ∎

Appendix B Computation of Lower and Upper Bounds for Dichotomy

In Section 4.3, we introduced modifications to the MU and QU algorithms to incorporate the positivity constraint $\bm{x}\geq\epsilon$ and the linear constraint $\bm{e}^{\top}\bm{x}=1$ . However, solving for the dual parameter $\nu$ in equations (24) for MU or (25) for QU is intractable. To address this, we propose using the dichotomy method to solve for $h(\nu)=0$ . Therefore, this appendix provides the computation of lower bound $\nu_{\text{low}}$ and upper bound $\nu_{\text{up}}$ such that $h(\nu_{\text{low}})<0$ and $h(\nu_{\text{up}})>0$ . These bounds will serve as convenient initializations for the dichotomy algorithm.

B.1 Case 1: MU

For MU, we aim to solve equation (24) for $\nu$ :

h_{1}\left(\nu\right)=\sum_{j}e_{j}\frac{x_{j}^{t}\alpha_{j}}{\min\left(\frac{% x_{j}^{t}\alpha_{j}}{\epsilon},\beta_{j}^{t}+\nu e_{j}\right)}-1=0

Terms where $e_{j}=0$ can be ignored since they do not contribute to the sum. Assuming $\epsilon>0$ , the function $h_{1}$ is well-defined for $\nu\in\mathbb{R}$ . Since $x_{j}^{t}\alpha_{j}>0$ , we have $h_{1}\left(\nu\right)=\frac{\|\bm{e}\|_{1}}{\epsilon}-1$ for $\nu\leq\nu_{\lim}=\min_{j}\left(\frac{\frac{x_{j}^{t}\alpha_{j}}{\epsilon}-% \beta_{j}^{t}}{e_{j}}\right)$ , and $h_{1}$ is monotonically decreasing for $\nu\in\left[\nu_{\lim},\infty\right[$ . Assuming $\frac{\|\bm{e}\|_{1}}{\epsilon}-1\geq 0$ (which ensures the feasibility of the constraints $\bm{x}\geq\epsilon$ and $\bm{e}^{\top}\bm{x}=1$ ), we have $h_{1}(\nu_{\text{lim}})\geq 0$ and $\lim_{\nu\to\infty}h_{1}(\nu)=-1$ . Thus, there exists exactly one root for the function $h_{1}$ .

Negative bound

First, let’s find $\nu_{\text{low}}$ such that $h_{1}(\nu)<0$ for $\nu_{\text{low}}\leq\nu<\infty$ . We can bound $h_{1}$ as follows:

	$\displaystyle h_{1}\left(\nu\right)$	$\displaystyle=\sum_{j}\frac{x_{j}^{t}\alpha_{j}}{\min\left(\frac{x_{j}^{t}% \alpha_{j}}{\epsilon e_{j}},\frac{\beta_{j}^{t}}{e_{j}}+\nu\right)}-1$
		$\displaystyle\leq n\frac{\max_{j}x_{j}^{t}\alpha_{j}}{\min_{j}\min\left(\frac{% x_{j}^{t}\alpha_{j}}{\epsilon e_{j}},\frac{\beta_{j}^{t}}{e_{j}}+\nu\right)}-1<0$

Therefore one possible bound is

\nu_{\text{low}}=n\max_{j}x_{j}^{t}\alpha_{j}-\min_{j}\frac{\beta_{j}^{t}}{e_{% j}}.

Positive bound

Similarly, let’s find $\nu_{\text{up}}$ such that $h_{1}(\nu)\geq 0$ for $\nu_{\text{up}}\geq\nu\geq\nu_{\text{lim}}$ . Note that $\nu_{\text{lim}}$ is not a good bound when $\epsilon$ is small. We can bound $h_{1}$ as follows:

	$\displaystyle h_{1}\left(\nu\right)$	$\displaystyle=\sum_{j=1}^{n}\frac{x_{j}^{t}\alpha_{j}}{\min\left(\frac{x_{j}^{% t}\alpha_{j}}{\epsilon e_{j}},\frac{\beta_{j}^{t}}{e_{j}}+\nu\right)}-1$
		$\displaystyle\geq\max_{j}\frac{x_{j}^{t}\alpha_{j}}{\min\left(\frac{x_{j}^{t}% \alpha_{j}}{\epsilon e_{j}},\frac{\beta_{j}^{t}}{e_{j}}+\nu\right)}-1$
		$\displaystyle\geq\max_{j}\frac{x_{j}^{t}\alpha_{j}}{\frac{\beta_{j}^{t}}{e_{j}% }+\nu}-1>0,$

As a result, we have

\nu_{\text{up}}=\max_{j}\left(x_{j}^{t}\alpha_{j}-\frac{\beta_{j}^{t}}{e_{j}}\right)

To improve numerical stability, one could use $\nu_{\text{low}}^{\prime}=2n\max_{j}x_{j}^{t}\alpha_{j}-\min_{j}\frac{\beta_{j% }^{t}}{e_{j}}$ and $\nu_{\text{up}}^{\prime}<\max_{j}\frac{x_{j}^{t}\alpha_{j}}{2}-\frac{\beta_{j}% ^{t}}{e_{j}}$ .

B.2 Case 2: QU

For QU, we aim to solve equation (25) for $\nu$ :

h_{2}\left(\nu\right)=\sum_{j}e_{j}\max\left(\frac{-\beta_{j}^{t}-\nu e_{j}+% \sqrt{\left(\beta_{j}^{t}+\nu e_{j}\right)^{2}+4\alpha\zeta_{j}^{t}}}{2\alpha}% ,\epsilon\right)-1=0.

Let’s analyze the function $h_{2}$ . We observe that the function $-x+\sqrt{x^{2}+\delta}$ is strictly decreasing over $\mathbb{R}$ for $\delta>0$ , since its derivative $-1+\frac{x}{\sqrt{x^{2}+\delta}}$ is strictly negative for $\delta>0$ . Therefore, each term of the sum is decreasing, and $h_{2}$ is a decreasing function. Note that once at least for one $j$ , we have

-\beta_{j}^{t}-\nu e_{j}+\sqrt{(\beta_{j}^{t}+\nu e_{j})^{2}+4\alpha\zeta_{j}^% {t}}\geq 2\alpha\epsilon,

and therefore the function $h_{2}$ becomes strictly decreasing. In the limit, we have $\lim_{\nu\to-\infty}h_{2}(\nu)=\infty$ and $\lim_{\nu\to\infty}h_{2}(\nu)=\epsilon\|\bm{e}\|_{1}-1$ . Assuming $\epsilon\|\bm{e}\|_{1}>1$ (which ensures the feasibility of the constraints $\bm{x}\geq\epsilon$ and $\bm{e}^{\top}\bm{x}=1$ ), we know that the function $h_{2}$ has exactly one root.

Negative bound

Let’s find $\nu_{\text{low}}$ such that $h_{2}(\nu)\leq 0$ for $\nu\geq\nu_{\text{low}}$ . We start by bounding the term

-\beta_{j}^{t}-\nu e_{j}+\sqrt{\left(\beta_{j}^{t}+\nu e_{j}\right)^{2}+4% \alpha\zeta_{j}^{t}}\leq\delta_{j},

where $\delta_{j}>\epsilon>0$ . We move $-\beta_{j}^{t}-\nu e_{j}$ to the left and square the inequality to remove the square root:

\left(\beta_{j}^{t}+\nu e_{j}\right)^{2}+4\alpha\zeta_{j}^{t}\leq\left(\beta_{% j}^{t}+\nu e_{j}+\epsilon_{j}\right)^{2}

Eventually, we can extract a bound for $\nu$ to ensure that the inequality is satisfied for a chosen $\delta_{j}$ :

\nu\geq\frac{4\alpha\zeta_{j}^{t}-\epsilon_{j}^{2}}{2\delta_{j}e_{j}}-\frac{% \beta_{j}^{t}}{e_{j}}.

Let’s set $\delta_{j}=\frac{2\alpha}{me_{j}}$ , where $m$ is the number of elements in the sum, and take the maximum over $j$ to obtain the bound:

\nu_{\text{low}}\geq\max_{j}\left(m\zeta_{j}^{t}-\frac{\alpha}{me_{j}^{2}}-% \frac{\beta_{j}^{t}}{e_{j}}\right)

We observe the validity of this bound provided that $\delta_{j}=\frac{2\alpha}{me_{j}}\geq\epsilon$ for all $j$ . Then, it can be verified that

	$\displaystyle h_{2}\left(\nu_{\text{low}}\right)$	$\displaystyle=\sum_{j}e_{j}\max\left(\frac{-\beta_{j}^{t}-\nu_{\text{low}}e_{j% }+\sqrt{\left(\beta_{j}^{t}+\nu_{\text{low}}e_{j}\right)^{2}+4\alpha\zeta_{j}^% {t}}}{2\alpha},\epsilon\right)-1$
		$\displaystyle=\sum_{j}e_{j}\frac{-\beta_{j}^{t}-\nu_{\text{low}}e_{j}+\sqrt{% \left(\beta_{j}^{t}+\nu_{\text{low}}e_{j}\right)^{2}+4\alpha\zeta_{j}^{t}}}{2% \alpha}-1$
		$\displaystyle\leq\sum_{j}e_{j}\frac{\delta_{j}}{2\alpha}-1$
		$\displaystyle=\sum_{j}^{m}e_{j}\frac{\frac{2\alpha}{me_{j}}}{2\alpha}-1=0,$

which proves that $\nu_{\text{low}}$ is a valid negative bound.

Positive bound

Similarly, let’s find $\nu_{\text{up}}$ such that $h_{2}(\nu)\geq 0$ for $\nu\leq\nu_{\text{up}}$ . This time, we will bound $h_{2}$ from below and obtain

	$\displaystyle h_{2}\left(\nu\right)$	$\displaystyle=\sum_{j}e_{j}\max\left(\frac{-\beta_{j}^{t}-\nu e_{j}+\sqrt{% \left(\beta_{j}^{t}+\nu e_{j}\right)^{2}+4\alpha\zeta_{j}^{t}}}{2\alpha},% \epsilon\right)-1$
		$\displaystyle\geq\sum_{j}e_{j}\frac{-\beta_{j}^{t}-\nu e_{j}+\sqrt{\left(\beta% _{j}^{t}+\nu e_{j}\right)^{2}+4\alpha\zeta_{j}^{t}}}{2\alpha}-1$
		$\displaystyle\geq\frac{1}{2\alpha}\sum_{j}e_{j}\left(-\beta_{j}^{t}-\nu e_{j}% \right)-1$
		$\displaystyle=-\frac{1}{2\alpha}\left(\sum_{j}e_{j}\beta_{j}^{t}+\nu\sum_{j}e_% {j}^{2}\right)-1>0.$

We can, therefore, define

\nu_{\text{up}}=-\frac{2\alpha+\sum_{j}e_{j}\beta_{j}^{t}}{\sum_{j}e_{j}^{2}}.

References

[1] Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality. Mathematics of operations research, 35(2):438–457, 2010.
[2] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1):459–494, 2014.
[3] Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
[4] Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12):4164–4169, 2004.
[5] Stefania Cacovich, Fabio Matteocci, Mojtaba Abdi-Jalebi, Samuel D Stranks, Aldo Di Carlo, Caterina Ducati, and Giorgio Divitini. Unveiling the chemical composition of halide perovskite films using multivariate statistical analyses. ACS Applied Energy Materials, 1(12):7174–7181, 2018.
[6] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE transactions on pattern analysis and machine intelligence, 33(8):1548–1560, 2010.
[7] Andrzej Cichocki and Anh-Huy Phan. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences, 92(3):708–721, 2009.
[8] Andrzej Cichocki, Rafal Zdunek, and Shun-ichi Amari. Hierarchical als algorithms for nonnegative matrix and 3d tensor factorization. In International Conference on Independent Component Analysis and Signal Separation, pages 169–176. Springer, 2007.
[9] Yu-Hong Dai and Yaxiang Yuan. A nonlinear conjugate gradient method with a strong global convergence property. SIAM Journal on optimization, 10(1):177–182, 1999.
[10] Cédric Févotte, Nancy Bertin, and Jean-Louis Durrieu. Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis. Neural computation, 21(3):793–830, 2009.
[11] Dan Fu, Gary Holtom, Christian Freudiger, Xu Zhang, and Xiaoliang Sunney Xie. Hyperspectral imaging with stimulated raman scattering by chirped femtosecond lasers. The Journal of Physical Chemistry B, 117(16):4634–4640, 2013.
[12] Nicolas Gillis. The why and how of nonnegative matrix factorization. In Regularization, Optimization, Kernels, and Support Vector Machines, pages 275–310. Chapman and Hall/CRC, 2014.
[13] Prem Gopalan, Jake M Hofman, and David M Blei. Scalable recommendation with poisson factorization. arXiv preprint arXiv:1311.1704, 2013.
[14] Filip Hanzely, Peter Richtarik, and Lin Xiao. Accelerated bregman proximal gradient methods for relatively smooth convex optimization. Computational Optimization and Applications, 79:405–440, 2021.
[15] Niao He, Zaid Harchaoui, Yichen Wang, and Le Song. Fast and simple optimization for poisson likelihood models. arXiv preprint arXiv:1608.01264, 2016.
[16] Le Thi Khanh Hien and Nicolas Gillis. Algorithms for nonnegative matrix factorization with the kullback–leibler divergence. Journal of Scientific Computing, 87(3):1–32, 2021.
[17] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57, 1999.
[18] Cho-Jui Hsieh and Inderjit S Dhillon. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1064–1072, 2011.
[19] BR Jany, Arkadiusz Janas, and Franciszek Krok. Retrieving the quantitative chemical information at nanoscale from scanning electron microscope energy dispersive x-ray measurements by machine learning. Nano letters, 17(11):6520–6525, 2017.
[20] Chi **, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pages 1724–1732. PMLR, 2017.
[21] Ramakrishnan Kannan, AV Ievlev, Nouamane Laanait, Maxim A Ziatdinov, Rama K Vasudevan, Stephen Jesse, and Sergei V Kalinin. Deep data analysis via physically constrained linear unmixing: universal framework, domain examples, and a community-wide platform. Advanced Structural and Chemical Imaging, 4(1):1–20, 2018.
[22] Hideaki Kano, Hiroki Segawa, Masanari Okuno, Philippe Leproux, and Vincent Couderc. Hyperspectral coherent raman imaging–principle, theory, instrumentation, and applications to life sciences. Journal of Raman Spectroscopy, 47(1):116–123, 2016.
[23] Hyunsoo Kim and Haesun Park. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM journal on matrix analysis and applications, 30(2):713–730, 2008.
[24] **gu Kim, Yunlong He, and Haesun Park. Algorithms for nonnegative matrix and tensor factorizations: A unified view based on block coordinate descent framework. Journal of Global Optimization, 58(2):285–319, 2014.
[25] **gu Kim and Haesun Park. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientific Computing, 33(6):3261–3281, 2011.
[26] Nikos Komodakis and Jean-Christophe Pesquet. Playing with duality: An overview of recent primal? dual approaches for solving large-scale optimization problems. IEEE Signal Processing Magazine, 32(6):31–54, 2015.
[27] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
[28] Paul G Kotula, Michael R Keenan, and Joseph R Michael. Automated analysis of sem x-ray spectral images: A powerful new microanalysis tool. Microscopy and Microanalysis, 9(1):1–17, 2003.
[29] Daniel Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, pages 556–562. MIT Press, 2001.
[30] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
[31] Qiuwei Li, Zhihui Zhu, Gongguo Tang, and Michael B Wakin. Provable bregman-divergence based methods for nonconvex and non-lipschitz problems. arXiv preprint arXiv:1904.09712, 2019.
[32] Xinghua Li, Liyuan Wang, Qing Cheng, Penghai Wu, Wenxia Gan, and Lina Fang. Cloud removal in remote sensing images using nonnegative matrix factorization and error correction. ISPRS journal of photogrammetry and remote sensing, 148:103–113, 2019.
[33] Chih-Jen Lin. On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Transactions on Neural Networks, 18(6):1589–1596, 2007.
[34] Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10):2756–2779, 2007.
[35] Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
[36] Xiaoqiang Lu, Hao Wu, Yuan Yuan, **kun Yan, and Xuelong Li. Manifold regularized sparse nmf for hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing, 51(5):2815–2826, 2012.
[37] Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
[38] Qiaozhu Mei and ChengXiang Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 198–207, 2005.
[39] Pentti Paatero. Least squares formulation of robust non-negative factor analysis. Chemometrics and intelligent laboratory systems, 37(1):23–35, 1997.
[40] Ioannis Panageas, Georgios Piliouras, and Xiao Wang. First-order methods almost always avoid saddle points: The case of vanishing step-sizes. Advances in Neural Information Processing Systems, 32, 2019.
[41] Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013.
[42] Clément W Royer, Michael O’Neill, and Stephen J Wright. A newton-cg algorithm with complexity guarantees for smooth unconstrained optimization. Mathematical Programming, 180(1):451–488, 2020.
[43] Clément W Royer and Stephen J Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM Journal on Optimization, 28(2):1448–1477, 2018.
[44] Joseph Salmon, Zachary Harmany, Charles-Alban Deledalle, and Rebecca Willett. Poisson noise reduction with non-local pca. Journal of mathematical imaging and vision, 48(2):279–294, 2014.
[45] Motoki Shiga, Kazuyoshi Tatsumi, Shunsuke Muto, Koji Tsuda, Yuta Yamamoto, Toshiyuki Mori, and Takayoshi Tanji. Sparse modeling of eels and edx spectral imaging data by nonnegative matrix factorization. Ultramicroscopy, 170:43–59, 2016.
[46] Ajit P Singh and Geoffrey J Gordon. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 650–658, 2008.
[47] Paris Smaragdis and Judith C Brown. Non-negative matrix factorization for polyphonic music transcription. In 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No. 03TH8684), pages 177–180. IEEE, 2003.
[48] Dennis L Sun and Cedric Fevotte. Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 6201–6205. IEEE, 2014.
[49] Leo Taslaman and Björn Nilsson. A framework for regularized non-negative matrix factorization, with application to the analysis of gene expression data. PloS one, 7(11):e46331, 2012.
[50] Marc Teboulle and Yakov Vaisbourd. Novel proximal gradient methods for nonnegative matrix factorization with sparsity constraints. SIAM Journal on Imaging Sciences, 13(1):381–421, 2020.
[51] Adrien Teurtrie, Nathanaël Perraudin, Thomas Holvoet, Hui Chen, Duncan TL Alexander, Guillaume Obozinski, and Cécile Hébert. espm: A python library for the simulation of stem-edxs datasets. Ultramicroscopy, page 113719, 2023.
[52] Adrien Teurtrie, Nathanaël Perraudin, Thomas Holvoet, Hui Chen, Duncan TL Alexander, Guillaume Obozinski, and Cécile Hébert. From stem-edxs data to phase separation and quantification using physics-guided nmf. To appear, 2024.
[53] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109:475–494, 2001.
[54] Musundi B Wabuyele, Fei Yan, Guy D Griffin, and Tuan Vo-Dinh. Hyperspectral surface-enhanced raman imaging of labeled silver nanoparticles in single cells. Review of scientific instruments, 76(6):063710, 2005.
[55] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6(3):1758–1789, 2013.
[56] Felipe Yanez and Francis Bach. Primal-dual algorithms for non-negative matrix factorization with the kullback-leibler divergence. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2257–2261. IEEE, 2017.
[57] Andrew B Yankovich, Chenyu Zhang, Albert Oh, Thomas JA Slater, Feridoon Azough, Robert Freer, Sarah J Haigh, Rebecca Willett, and Paul M Voyles. Non-rigid registration and non-local principle component analysis to improve electron microscopy spectrum images. Nanotechnology, 27(36):364001, 2016.
[58] Minchao Ye, Yuntao Qian, and Jun Zhou. Multitask sparse nonnegative matrix factorization for joint spectral–spatial hyperspectral imagery denoising. IEEE Transactions on Geoscience and Remote Sensing, 53(5):2621–2639, 2014.
[59] Chenyu Zhang, Rungang Han, Anru R Zhang, and Paul M Voyles. Denoising atomic resolution 4d scanning transmission electron microscopy data with tensor singular value decomposition. Ultramicroscopy, 219:113123, 2020.
[60] Changzhong Zou and Youshen Xia. Restoration of hyperspectral image contaminated by poisson noise using spectral unmixing. Neurocomputing, 275:430–437, 2018.

	Smooth images samples	Random uniform $\bm{W}$ , $\bm{H}$ samples
Noiseless
Noisy

Efficient algorithms for regularized Poisson Non-negative Matrix Factorization

Abstract

Disclaimer

1 Introduction

Regularized Poisson Non Negative Matrix Factorisation

Why is this problem challenging?

Our approach

Outline of this contribution

2 Related work

Applications of Poisson Distribution likelihood Maximization

Other Optimization Approaches

Non-Negative Matrix Factorization

3 Preliminaries

3.1 Notation

3.2 Definitions

Definition 1 (Directional derivative).

Definition 2 (Coordinatewise Minimum).

Definition 3 (Stationary Points of a function).

Definition 4 (Regularity of a function at a point).

Lemma 1.

3.3 Approximation functions

Definition 5.

Lemma 2.

Proof.

3.4 Two Blocks Successive Minimization (TBSUM)

Theorem 1 (Convergence of TBSUM Algorithm 1).

4 Subproblem minimization

4.1 Majorizing functions

Lemma 3 (Log majorization).

Lemma 4 (Lipschitz-majorization).

Lemma 5 (Relative smoothness majorization).

Lemma 6 (Concave majorisation).

4.2 Subproblem updates

Proposition 1 (Generalized MU for (10)).

Generalization of the traditional MU Rule

Connection with (Block) Mirror Descent [16, Algorithm 1]

Alternative majorizing function and Quadratic Update (QU)

Proposition 2 (QU for (10)).

4.3 Generalized simplex constraint

Case ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0

5 Algorithms for Poisson matrix factorisation

Convergence

5.1 Algorithm complexity

Impact of the dichotomy

5.2 Tight Majorizing Functions

Linesearch

6 Numerical Simulation

Problem

Dataset

Results

Discussion

7 Conclusion

Appendix A Proofs

A.1 Proof of Lemma 1

Proof.

A.2 Proof of majorizing functions

Proof.

Proof.

Proof.

Proof.

A.3 Proof of subproblem updates

Proof.

Proof.

Appendix B Computation of Lower and Upper Bounds for Dichotomy

B.1 Case 1: MU

Negative bound

Positive bound

B.2 Case 2: QU

Negative bound

Positive bound

References

Case $\epsilon=0$