Efficient algorithms for regularized Poisson Non-negative Matrix Factorization

Nathanaël Perraudin Swiss Data Science Center, EPFL and ETH Zürich, Andreasstrasse 5, Zürich, 8050, Zürich, Switzerland Adrien Teutrie Unité Matériaux et Transformations, UMR-CNRS 8207, Université de Lille, Cité scientifique, Bâtiment C6, Villeneuve d’Ascq, 59655, Nord, France Cécile Hébert Electron Spectrometry and Microscopy Laboratory, Institute of Physics, EPFL, Bâtiment PH, Station 3, Lausanne, 1015, Vaud, Switzerland Guillaume Obozinski Swiss Data Science Center, EPFL and ETH Zürich, Andreasstrasse 5, Zürich, 8050, Zürich, Switzerland
Abstract

We consider the problem of regularized Poisson Non-negative Matrix Factorization (NMF) problem, encompassing various regularization terms such as Lipschitz and relatively smooth functions, alongside linear constraints. This problem holds significant relevance in numerous Machine Learning applications, particularly within the domain of physical linear unmixing problems. A notable challenge arises from the main loss term in the Poisson NMF problem being a KL divergence, which is non-Lipschitz, rendering traditional gradient descent-based approaches inefficient. In this contribution, we explore the utilization of Block Successive Upper Minimization (BSUM) to overcome this challenge. We build approriate majorizing function for Lipschitz and relatively smooth functions, and show how to introduce linear constraints into the problem. This results in the development of two novel algorithms for regularized Poisson NMF. We conduct numerical simulations to showcase the effectiveness of our approach.

Disclaimer

This document is a technical report and has not undergone peer review. The findings and conclusions presented herein are solely based on the authors’ research and analysis. We apologize for any potential errors or shortcomings in the content.

1 Introduction

The problem of factorizing a matrix 𝒀𝑾𝑯𝒀𝑾𝑯\bm{Y}\approx\bm{W}\bm{H}bold_italic_Y ≈ bold_italic_W bold_italic_H as Non Negative components 𝑾0,𝑯0formulae-sequence𝑾0𝑯0\bm{W}\geq 0,\bm{H}\geq 0bold_italic_W ≥ 0 , bold_italic_H ≥ 0 is central in many Machine Learning (ML) applications [47, 10, 4]. The motivation for performing such a factorization is that 𝒀𝒀\bm{Y}bold_italic_Y is often associated with a probability distribution density of the form P𝒀(𝑾,𝑯)=P~𝒀(𝑾𝑯)subscript𝑃𝒀𝑾𝑯subscript~𝑃𝒀𝑾𝑯P_{\bm{Y}}(\bm{W},\bm{H})=\tilde{P}_{\bm{Y}}(\bm{W}\bm{H})italic_P start_POSTSUBSCRIPT bold_italic_Y end_POSTSUBSCRIPT ( bold_italic_W , bold_italic_H ) = over~ start_ARG italic_P end_ARG start_POSTSUBSCRIPT bold_italic_Y end_POSTSUBSCRIPT ( bold_italic_W bold_italic_H ). Typically, the optimal decomposition is found by minimizing the negative log-likelihood of that distribution:

minimize𝑾0,𝑯0i,jlog(P𝒀(𝑾,𝑯))subscriptminimizeformulae-sequence𝑾0𝑯0subscript𝑖𝑗subscript𝑃𝒀𝑾𝑯\operatorname*{minimize~{}~{}}_{\bm{W}\geq 0,\bm{H}\geq 0}-\sum_{i,j}\log\left% (P_{\bm{Y}}\left(\bm{W},\bm{H}\right)\right)start_OPERATOR roman_minimize end_OPERATOR start_POSTSUBSCRIPT bold_italic_W ≥ 0 , bold_italic_H ≥ 0 end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT roman_log ( italic_P start_POSTSUBSCRIPT bold_italic_Y end_POSTSUBSCRIPT ( bold_italic_W , bold_italic_H ) ) (1)

If 𝒀𝒀\bm{Y}bold_italic_Y is assumed to be perturbed with Normal noise, we obtain a Gaussian distribution, i.e. P𝒀(𝑾,𝑯)e𝑾𝑯𝒀2proportional-tosubscript𝑃𝒀𝑾𝑯superscript𝑒superscriptnorm𝑾𝑯𝒀2P_{\bm{Y}}\left(\bm{W},\bm{H}\right)\propto e^{-\|\bm{W}\bm{H}-\bm{Y}\|^{2}}italic_P start_POSTSUBSCRIPT bold_italic_Y end_POSTSUBSCRIPT ( bold_italic_W , bold_italic_H ) ∝ italic_e start_POSTSUPERSCRIPT - ∥ bold_italic_W bold_italic_H - bold_italic_Y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, and we end up with the classic non-negative matrix factorization (NMF) problem [29, 30], where the quadratic function𝑾𝑯𝒀2superscriptnorm𝑾𝑯𝒀2\|\bm{W}\bm{H}-\bm{Y}\|^{2}∥ bold_italic_W bold_italic_H - bold_italic_Y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is minimized. For other distribution families, and in particular exponential families, a log\logroman_log term often appears. In particular, the Poisson negative log-likelihood model [29] leads to a loss of the form

𝒀(𝑾,𝑯):=𝒀,log(𝑾𝑯)+𝟏,𝑾𝑯i,jlog(P𝒀(𝑾,𝑯)),assignsubscript𝒀𝑾𝑯𝒀𝑾𝑯1𝑾𝑯proportional-tosubscript𝑖𝑗subscript𝑃𝒀𝑾𝑯\mathcal{L}_{\bm{Y}}\left(\bm{W},\bm{H}\right):=-\left\langle\bm{Y},\log\left(% \bm{W}\bm{H}\right)\right\rangle+\left\langle\mathbf{1},\bm{W}\bm{H}\right% \rangle\propto-\sum_{i,j}\log\left(P_{\bm{Y}}\left(\bm{W},\bm{H}\right)\right),caligraphic_L start_POSTSUBSCRIPT bold_italic_Y end_POSTSUBSCRIPT ( bold_italic_W , bold_italic_H ) := - ⟨ bold_italic_Y , roman_log ( bold_italic_W bold_italic_H ) ⟩ + ⟨ bold_1 , bold_italic_W bold_italic_H ⟩ ∝ - ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT roman_log ( italic_P start_POSTSUBSCRIPT bold_italic_Y end_POSTSUBSCRIPT ( bold_italic_W , bold_italic_H ) ) , (2)

where the inner product over matrices is the Frobenius inner product defined as 𝑨,𝑩=ijaijbij𝑨𝑩subscript𝑖𝑗subscript𝑎𝑖𝑗subscript𝑏𝑖𝑗\left\langle\bm{A},\bm{B}\right\rangle=\sum_{ij}a_{ij}b_{ij}⟨ bold_italic_A , bold_italic_B ⟩ = ∑ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

Regularized Poisson Non Negative Matrix Factorisation

In many problems (see, for example, [51, 50, 21, 52]), additional prior information about the matrices 𝑾,𝑯𝑾𝑯\bm{W},\bm{H}bold_italic_W , bold_italic_H is known. For example, it might be known that the columns of 𝑯𝑯\bm{H}bold_italic_H are smooth, or that the rows of 𝑾𝑾\bm{W}bold_italic_W are sparse. One might also be interested in normalizing to unity the columns of 𝑯𝑯\bm{H}bold_italic_H or the rows of 𝑾𝑾\bm{W}bold_italic_W because they might quantify physical quantities for which normalization is necessary. For example, in the analysis of hyperspectral imaging data, the images 𝑯𝑯\bm{H}bold_italic_H are assumed to be smooth and the components 𝑾𝑾\bm{W}bold_italic_W are summing to the unity [51, 52]. Typically, this information can be encoded via an extra regularization term R(𝑾,𝑯)𝑅𝑾𝑯R\left(\bm{W},\bm{H}\right)italic_R ( bold_italic_W , bold_italic_H ) and/or additional constraints 𝑾𝒞1,𝑯𝒞2formulae-sequence𝑾subscript𝒞1𝑯subscript𝒞2\bm{W}\in\mathcal{C}_{1},\bm{H}\in\mathcal{C}_{2}bold_italic_W ∈ caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_H ∈ caligraphic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This leads to the general optimization problem we solve in this contribution:

minimize𝑾,𝑯𝒀(𝑾,𝑯)+RW(𝑾)+RH(𝑯),subject to𝑾ϵ,𝑯ϵ,𝒆H𝑯=𝟏 or 𝑾𝒆W=𝟏formulae-sequencesubscriptminimize𝑾𝑯subscript𝒀𝑾𝑯subscript𝑅𝑊𝑾subscript𝑅𝐻𝑯subject to𝑾italic-ϵformulae-sequence𝑯italic-ϵsuperscriptsubscript𝒆𝐻top𝑯1 or 𝑾subscript𝒆𝑊superscript1top\begin{split}\operatorname*{minimize~{}~{}}_{\bm{W},\bm{H}}~{}~{}&\mathcal{L}_% {\bm{Y}}\left(\bm{W},\bm{H}\right){\color[rgb]{.75,.5,.25}+R_{W}\left(\bm{W}% \right)+R_{H}\left(\bm{H}\right),}\\ \text{subject to}~{}~{}&{\color[rgb]{.5,0,.5}\bm{W}\geq\epsilon,\bm{H}\geq% \epsilon},~{}~{}{\color[rgb]{0,1,1}\bm{e}_{H}^{\top}\bm{H}=\bm{1}\text{ or }% \bm{W}\bm{e}_{W}=\bm{1}^{\top}}\end{split}start_ROW start_CELL start_OPERATOR roman_minimize end_OPERATOR start_POSTSUBSCRIPT bold_italic_W , bold_italic_H end_POSTSUBSCRIPT end_CELL start_CELL caligraphic_L start_POSTSUBSCRIPT bold_italic_Y end_POSTSUBSCRIPT ( bold_italic_W , bold_italic_H ) + italic_R start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_italic_W ) + italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( bold_italic_H ) , end_CELL end_ROW start_ROW start_CELL subject to end_CELL start_CELL bold_italic_W ≥ italic_ϵ , bold_italic_H ≥ italic_ϵ , bold_italic_e start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_H = bold_1 or bold_italic_W bold_italic_e start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_CELL end_ROW (3)

where ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0. The different colors emphasize the changes compared to the traditional problem of [29]. First, in violet, we slightly simplify the problem by imposing strict non-negtativity. While, this is not strictly necessary111The case ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0 could be handled with an approach similar to [33]., this assumption significantly simplify our analysis. We believe that handling ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0 is an unnecessary complication, as the results with ϵitalic-ϵ\epsilonitalic_ϵ close to machine precision will be practically identical. Second, in light blue, we consider the case where the constraints are linear, specifically 𝒆H𝑯=𝟏superscriptsubscript𝒆𝐻top𝑯1\bm{e}_{H}^{\top}\bm{H}=\bm{1}bold_italic_e start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_H = bold_1 or 𝑾𝒆W=𝟏𝑾subscript𝒆𝑊superscript1top\bm{W}\bm{e}_{W}=\bm{1}^{\top}bold_italic_W bold_italic_e start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. In general, it is only meaningful to use one of the constraints, as it will fix the ratio between 𝑾𝑾\bm{W}bold_italic_W and 𝑯𝑯\bm{H}bold_italic_H. We note that this includes the simplex constraint when 𝒆=𝟏.𝒆1\bm{e}=\bm{1}.bold_italic_e = bold_1 . Third, in brown, we consider regularizations RW(𝑾)+RH(𝑯)subscript𝑅𝑊𝑾subscript𝑅𝐻𝑯R_{W}\left(\bm{W}\right)+R_{H}\left(\bm{H}\right)italic_R start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_italic_W ) + italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( bold_italic_H ) of the form:

r(𝒙)=sL(𝒙)+sR(𝒙)+j=1nsC(xj),𝑟𝒙subscript𝑠𝐿𝒙subscript𝑠𝑅𝒙superscriptsubscript𝑗1𝑛subscript𝑠𝐶subscript𝑥𝑗r\left(\bm{x}\right)=s_{L}\left(\bm{x}\right)+s_{R}\left(\bm{x}\right)+\sum_{j% =1}^{n}s_{C}\left(x_{j}\right),italic_r ( bold_italic_x ) = italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ) + italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (4)

where 𝒙𝒙\bm{x}bold_italic_x is the vector of a row of 𝑾𝑾\bm{W}bold_italic_W or a colum of 𝑯𝑯\bm{H}bold_italic_H., i.e RW(𝑾)=irH(𝒘i)subscript𝑅𝑊𝑾subscript𝑖subscript𝑟𝐻subscript𝒘𝑖R_{W}\left(\bm{W}\right)=\sum_{i}r_{H}\left(\bm{w}_{i}\right)italic_R start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_italic_W ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and RH(𝑯)=jrW(𝒉i).subscript𝑅𝐻𝑯subscript𝑗subscript𝑟𝑊subscript𝒉𝑖R_{H}\left(\bm{H}\right)=\sum_{j}r_{W}\left(\bm{h}_{i}\right).italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( bold_italic_H ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

  1. 1.

    The term sLsubscript𝑠𝐿s_{L}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is assumed to be σLsubscript𝜎𝐿\sigma_{L}italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT gradient Lipschitz, i.e., there exists a σLsubscript𝜎𝐿\sigma_{L}italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT such that

    sL(𝒙)sL(𝒚)2σL𝒙𝒚2for all 𝒙,𝒚𝒞.formulae-sequencesubscriptnormsubscript𝑠𝐿𝒙subscript𝑠𝐿𝒚2subscript𝜎𝐿subscriptnorm𝒙𝒚2for all 𝒙𝒚𝒞\|\nabla s_{L}\left(\bm{x}\right)-\nabla s_{L}\left(\bm{y}\right)\|_{2}\leq% \sigma_{L}\|\bm{x}-\bm{y}\|_{2}\hskip 10.00002pt\text{for all }\bm{x},\bm{y}% \in\mathcal{C}.∥ ∇ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ) - ∇ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_y ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∥ bold_italic_x - bold_italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for all bold_italic_x , bold_italic_y ∈ caligraphic_C . (5)

    Alternatively, this condition could be rewriten as

    sL(𝒚)sL(𝒙)+sL(𝒙),𝒚𝒙+σL2𝒙𝒚22for all 𝒙,𝒚𝒞.formulae-sequencesubscript𝑠𝐿𝒚subscript𝑠𝐿𝒙subscript𝑠𝐿𝒙𝒚𝒙subscript𝜎𝐿2superscriptsubscriptnorm𝒙𝒚22for all 𝒙𝒚𝒞s_{L}\left(\bm{y}\right)\leq s_{L}\left(\bm{x}\right)+\left\langle\nabla s_{L}% \left(\bm{x}\right),\bm{y}-\bm{x}\right\rangle+\frac{\sigma_{L}}{2}\|\bm{x}-% \bm{y}\|_{2}^{2}\hskip 10.00002pt\text{for all }\bm{x},\bm{y}\in\mathcal{C}.italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_y ) ≤ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ) + ⟨ ∇ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ) , bold_italic_y - bold_italic_x ⟩ + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∥ bold_italic_x - bold_italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all bold_italic_x , bold_italic_y ∈ caligraphic_C .

    In this contribution, we will consider in particular sL(𝒙)=𝒙Δ𝒙subscript𝑠𝐿𝒙superscript𝒙topΔ𝒙s_{L}\left(\bm{x}\right)=\bm{x}^{\top}\Delta\bm{x}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ) = bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_Δ bold_italic_x, where ΔΔ\Deltaroman_Δ is the Laplacian operator, favoring smoothness in the columns of 𝑯𝑯\bm{H}bold_italic_H. In this case σL=2λmax(Δ)subscript𝜎𝐿2subscript𝜆maxΔ\sigma_{L}=2\lambda_{\text{max}}(\Delta)italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT = 2 italic_λ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT ( roman_Δ ).

  2. 2.

    The term sR(𝒙)subscript𝑠𝑅𝒙s_{R}\left(\bm{x}\right)italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x ) is assumed to be σRsubscript𝜎𝑅\sigma_{R}italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT relatively smooth with respect to κ(𝒙)=𝟏log(𝒙)𝜅𝒙superscript1top𝒙\kappa\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}\right)italic_κ ( bold_italic_x ) = - bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_log ( bold_italic_x ). Relative smoothness is a generalization of Lipschitz smoothness, and is defined as follows [35, Definition 1.1]:

    f(𝒚)f(𝒙)+f(𝒙),𝒚𝒙+σRκ(𝒚,𝒙)𝑓𝒚𝑓𝒙𝑓𝒙𝒚𝒙subscript𝜎𝑅subscript𝜅𝒚𝒙f\left(\bm{y}\right)\leq f\left(\bm{x}\right)+\left\langle\nabla f\left(\bm{x}% \right),\bm{y}-\bm{x}\right\rangle+\sigma_{R}\mathcal{B}_{\kappa}\left(\bm{y},% \bm{x}\right)italic_f ( bold_italic_y ) ≤ italic_f ( bold_italic_x ) + ⟨ ∇ italic_f ( bold_italic_x ) , bold_italic_y - bold_italic_x ⟩ + italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT caligraphic_B start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ( bold_italic_y , bold_italic_x ) (6)

    where κsubscript𝜅\mathcal{B}_{\kappa}caligraphic_B start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT is the Bregman divergence [3] associated with κ𝜅\kappaitalic_κ [35, equation 7]:

    κ(𝒚,𝒙):=κ(𝒚)κ(𝒙)κ(𝒙),𝒚𝒙for all 𝒙,𝒚𝒞.formulae-sequenceassignsubscript𝜅𝒚𝒙𝜅𝒚𝜅𝒙𝜅𝒙𝒚𝒙for all 𝒙𝒚𝒞\mathcal{B}_{\kappa}\left(\bm{y},\bm{x}\right):=\kappa\left(\bm{y}\right)-% \kappa\left(\bm{x}\right)-\left\langle\nabla\kappa\left(\bm{x}\right),\bm{y}-% \bm{x}\right\rangle\hskip 10.00002pt\text{for all }\bm{x},\bm{y}\in\mathcal{C}.caligraphic_B start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ( bold_italic_y , bold_italic_x ) := italic_κ ( bold_italic_y ) - italic_κ ( bold_italic_x ) - ⟨ ∇ italic_κ ( bold_italic_x ) , bold_italic_y - bold_italic_x ⟩ for all bold_italic_x , bold_italic_y ∈ caligraphic_C .

    While Lipschitz functions can be upper bounded with quadratic functions, we observe from (6) that relative smoothness allows us to upper bound a function with the Bregman divergence of a function κ𝜅\kappaitalic_κ. This allows us to use a much wider range of functions to regularize our problem, and in particular non-gradient Lipschitz ones. As an example, let us consider κ(𝒙)=𝟏log(𝒙)𝜅𝒙superscript1top𝒙\kappa\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}\right)italic_κ ( bold_italic_x ) = - bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_log ( bold_italic_x ), with its Bregman divergence:

    κ(𝒚,𝒙)=i(yixilog(yixi)1)subscript𝜅𝒚𝒙subscript𝑖subscript𝑦𝑖subscript𝑥𝑖subscript𝑦𝑖subscript𝑥𝑖1\mathcal{B}_{\kappa}\left(\bm{y},\bm{x}\right)=\sum_{i}\left(\frac{y_{i}}{x_{i% }}-\log\left(\frac{y_{i}}{x_{i}}\right)-1\right)caligraphic_B start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ( bold_italic_y , bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( divide start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG - roman_log ( divide start_ARG italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) - 1 )

    One can observe that the objective function (2) is relatively smooth with respect to κ(𝒙)=𝟏log(𝒙)𝜅𝒙superscript1top𝒙\kappa\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}\right)italic_κ ( bold_italic_x ) = - bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_log ( bold_italic_x ). This term could also be used to introduce soft contraints such as a log-barrier: sR(𝒙)=𝟏log(𝒙ϵ𝟏)subscript𝑠𝑅𝒙superscript1top𝒙italic-ϵ1s_{R}\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}-\epsilon\bm{1}\right)italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x ) = - bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_log ( bold_italic_x - italic_ϵ bold_1 ) for 𝒙>ϵ𝒙italic-ϵ\bm{x}>\epsilonbold_italic_x > italic_ϵ and ++\infty+ ∞ otherwise.

  3. 3.

    Eventually, sC(x)subscript𝑠𝐶𝑥s_{C}\left(x\right)italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) is a smooth point-wise concave function (i.e. sCsubscript𝑠𝐶-s_{C}- italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is convex). A typical example of this regularisation could be sC(x)=log(x+α1)subscript𝑠𝐶𝑥𝑥superscript𝛼1s_{C}\left(x\right)=\log\left(x+\alpha^{-1}\right)italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x ) = roman_log ( italic_x + italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) which favor sparsity in the vector 𝒙𝒙\bm{x}bold_italic_x without penalizing large values too heavily. Its slope starts at α𝛼\alphaitalic_α for x=0𝑥0x=0italic_x = 0 and tends to 00 for x𝑥x\rightarrow\inftyitalic_x → ∞.

Our approach can handle regularizations of the form R(𝑾,𝑯).𝑅𝑾𝑯R\left(\bm{W},\bm{H}\right).italic_R ( bold_italic_W , bold_italic_H ) . However, for simplicity of notation, we restrict ourselves to separable regularizations.

The fidelity term 𝒀(𝑾,𝑯)subscript𝒀𝑾𝑯\mathcal{L}_{\bm{Y}}\left(\bm{W},\bm{H}\right)caligraphic_L start_POSTSUBSCRIPT bold_italic_Y end_POSTSUBSCRIPT ( bold_italic_W , bold_italic_H ) is convex in 𝑯𝑯\bm{H}bold_italic_H and in 𝑾𝑾\bm{W}bold_italic_W, but jointly non convex. We note that depending on the regularization and the constraints, multiple equivalent scaled solutions (stationary points) could exist, i.e., 𝑾𝑯=𝑾𝑯𝑾𝑯superscript𝑾superscript𝑯\bm{W}\bm{H}=\bm{W}^{\prime}\bm{H}^{\prime}bold_italic_W bold_italic_H = bold_italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for 𝑾=α1𝑾superscript𝑾superscript𝛼1𝑾\bm{W}^{\prime}=\alpha^{-1}\bm{W}bold_italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_W and 𝑯=α𝑯superscript𝑯𝛼𝑯\bm{H}^{\prime}=\alpha\bm{H}bold_italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_α bold_italic_H. However, this is generally not the case with additional constraints or regularizations.

Why is this problem challenging?

In general, Poisson Non Negative Matrix factorization, i.e. minimizing 𝒀(𝑾,𝑯)subscript𝒀𝑾𝑯\mathcal{L}_{\bm{Y}}\left(\bm{W},\bm{H}\right)caligraphic_L start_POSTSUBSCRIPT bold_italic_Y end_POSTSUBSCRIPT ( bold_italic_W , bold_italic_H ) is challenging because it is not gradient Lipschitz222The function 𝒀subscript𝒀\mathcal{L}_{\bm{Y}}caligraphic_L start_POSTSUBSCRIPT bold_italic_Y end_POSTSUBSCRIPT is, according to the definition, gradient Lipschitz because of the constraint 𝑾,𝑯ϵ𝑾𝑯italic-ϵ\bm{W},\bm{H}\geq\epsilonbold_italic_W , bold_italic_H ≥ italic_ϵ. However, in practice, the constant ϵitalic-ϵ\epsilonitalic_ϵ is chosen to be so small that its actual Lipschitz constant is too large to be useful. despite being differentiable for 𝑾,𝑯>0𝑾𝑯0\bm{W},\bm{H}>0bold_italic_W , bold_italic_H > 0. In practice, this implies that there exists no fixed learning rate that ensures convergence of gradient descent and that line search would have to be used. For that reason, solving the more general problem (3) is a difficult task, and to the best of our knowledge, there are no existing algorithms that can be directly applied to it. Although there exist many algorithms to solve the traditional Poisson NMF problem [16, 30, 29, 24, 12], none of them focuses on the regularized case (see the related work Section 2). Our main contribution is to fill this gap by providing multiple algorithms that minimize (3) for a wide range of regularizationsR(𝑾,𝑯)𝑅𝑾𝑯R\left(\bm{W},\bm{H}\right)italic_R ( bold_italic_W , bold_italic_H ) described by (4) and some linear constraints.

Our approach

A natural approach to minimize (3) is to optimize for each variable 𝑾,𝑯𝑾𝑯\bm{W},\bm{H}bold_italic_W , bold_italic_H at a time. For example, Block Coordinate Descent (BCD) can be expressed as

𝑾t+1minimize𝑾𝒞𝑾(𝑾,𝑯t),superscript𝑾𝑡1subscriptminimize𝑾subscript𝒞𝑾𝑾superscript𝑯𝑡\bm{W}^{t+1}\leftarrow\operatorname*{minimize~{}~{}}_{\bm{W}\in\mathcal{C}_{% \bm{W}}}\mathcal{L}\left(\bm{W},\bm{H}^{t}\right),bold_italic_W start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ← start_OPERATOR roman_minimize end_OPERATOR start_POSTSUBSCRIPT bold_italic_W ∈ caligraphic_C start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_W , bold_italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , (7)
𝑯t+1minimize𝑯𝒞𝑯(𝑾t+1,𝑯).superscript𝑯𝑡1subscriptminimize𝑯subscript𝒞𝑯superscript𝑾𝑡1𝑯\bm{H}^{t+1}\leftarrow\operatorname*{minimize~{}~{}}_{\bm{H}\in\mathcal{C}_{% \bm{H}}}\mathcal{L}\left(\bm{W}^{t+1},\bm{H}\right).bold_italic_H start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ← start_OPERATOR roman_minimize end_OPERATOR start_POSTSUBSCRIPT bold_italic_H ∈ caligraphic_C start_POSTSUBSCRIPT bold_italic_H end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( bold_italic_W start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT , bold_italic_H ) . (8)

This type of iterative scheme ensures that the loss does not increase between iterations and has been successfully used for the L2 case. However, in the Poisson case, there is no closed-form solution for problems (7) and (8), making this approach generally computationally expensive.

Fortunately, in practice, one does not need to find the global minima of (7) and (8) at each iteration. Instead, using a Block Successive Minimization (BSUM) algorithm [41], it is sufficient to minimize approximations of \mathcal{L}caligraphic_L which are locally tight upper bounds of \mathcal{L}caligraphic_L. To use the BSUM efficiently, these approximation functions need to have three properties: (1) to satisfy the hypotheses of the BSUM Theorem [41, Theorem 2], (2) to be as tight as possible, and (3) to be easy to optimize, i.e. to lead to a closed-form solution for each subproblem.

Our contributions can be summarized as follows. We show how regularized Poisson NMF can be efficiently solved using BSUM. We derive tight upper bounds for multiple regularizers and compare our approach with traditional algorithms. We also propose a simple way to introduce linear constraints into the problem and suggest using line search to build even tighter upper bounds. Finally, we propose multiple algorithms for regularized Poisson NMF and conduct numerical simulations to demonstrate the effectiveness of our approach.

Outline of this contribution

In Section 2, we provide a review of the literature. In Section 3, we clarify the notation and provide the necessary definitions for the BSUM Theorem [41, Theorem 2] that will be used for the convergence of our algorithm. In Section 4, we develop convenient approximations of the objective and regularization functions leading to sub-problems with closed-form solutions. In Section 4.3, we explore how to modify the optimization scheme to introduce generalized simplex constraints. In Section 5, we present our algorithms. Section 6 provides numerical applications of our algorithm, and Section 7 concludes this contribution.

2 Related work

Applications of Poisson Distribution likelihood Maximization

The maximization of likelihood in a Poisson distribution finds relevance in various applications, prompting the resolution of the problem outlined in Equation (3).

Many such applications arise in the domain of physical constrained linear unmixing problems [21]. Some noteworthy instances encompass: 1. Scanning transmission electron microscopy (STEM) [52, 45, 5, 19], 2. Hyperspectral Raman and optical imaging [22, 54, 11], 3. Tensor SVD applied to denoise atomic-resolution 4D scanning transmission electron microscopy [59], and 4. Non-local Poisson PCA denoising [44, 57]. It is noteworthy that many of these applications predominantly employ the L2 case, which offers a comparatively simpler solution. As a result, data is often renormalized to convert Poisson distributions into Gaussian distributions [28]. Nevertheless, the efficacy of these applications could be significantly enhanced by the development of algorithms tailored explicitly for the Poisson case.

Furthermore, Problem (3) also surfaces in hyperspectral image denoising, where noise is assumed to follow a Poisson distribution [60, 58]. In the domain of text mining, the Poisson distribution assumption is frequently utilized for modeling word occurrences based on latent variables such as categories, leading to the problem formulation depicted by (3) [17, 38]. Additionally, within the context of recommender systems, several matrix factorization problems can be reformulated into the structure of (3) [46, 13, 27].

Other Optimization Approaches

The literature offers a limited number of optimization methods suitable for addressing the problem presented by Equation (3). This constraint arises from the requirement of many optimization techniques to have a continuously differentiable gradient Lipschitz function. Examples of such techniques include gradient descent [40], perturbed gradient descent [20], nonlinear conjugate gradient method [9], various proximal point minimization algorithms [26], and second-order methods like the Newton-CG algorithms [42, 43].

One potentially attractive direction is the utilization of Proximal Alternating (Linearized) Minimization (PALM) [2] or Proximal Alternating Minimization (PAM) [1]. These algorithms are designed to solve problems of the form:

minimize𝑾,𝑯RW(𝑾)+RH(𝑯)+(𝑾,𝑯)subscriptminimize𝑾𝑯subscript𝑅𝑊𝑾subscript𝑅𝐻𝑯𝑾𝑯\operatorname*{minimize~{}~{}}_{\bm{W},\bm{H}}R_{W}\left(\bm{W}\right)+R_{H}% \left(\bm{H}\right)+\mathcal{L}\left(\bm{W},\bm{H}\right)start_OPERATOR roman_minimize end_OPERATOR start_POSTSUBSCRIPT bold_italic_W , bold_italic_H end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_italic_W ) + italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( bold_italic_H ) + caligraphic_L ( bold_italic_W , bold_italic_H )

PALM employs a Gauss-Seidel iteration scheme, consisting of the following sub-problems:

𝑾k+1superscript𝑾𝑘1\displaystyle\bm{\bm{W}}^{k+1}bold_italic_W start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT =argmin𝑾RW(𝑾)+(𝑾,𝑯k)+cW𝑾𝑾k22absentsubscriptargmin𝑾subscript𝑅𝑊𝑾𝑾superscript𝑯𝑘subscript𝑐𝑊superscriptsubscriptnorm𝑾superscript𝑾𝑘22\displaystyle=\operatorname*{arg\,min}_{\bm{W}}R_{W}\left(\bm{W}\right)+% \mathcal{L}\left(\bm{W},\bm{H}^{k}\right)+c_{W}\left\|\bm{W}-\bm{W}^{k}\right% \|_{2}^{2}= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_W end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_italic_W ) + caligraphic_L ( bold_italic_W , bold_italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) + italic_c start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ∥ bold_italic_W - bold_italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
𝑯k+1superscript𝑯𝑘1\displaystyle\bm{\bm{H}}^{k+1}bold_italic_H start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT =argmin𝑯RH(𝑯)+(𝑾k+1,𝑯)+cH𝑯𝑯k22absentsubscriptargmin𝑯subscript𝑅𝐻𝑯superscript𝑾𝑘1𝑯subscript𝑐𝐻superscriptsubscriptnorm𝑯superscript𝑯𝑘22\displaystyle=\operatorname*{arg\,min}_{\bm{H}}R_{H}\left(\bm{H}\right)+% \mathcal{L}\left(\bm{W}^{k+1},\bm{H}\right)+c_{H}\left\|\bm{H}-\bm{H}^{k}% \right\|_{2}^{2}= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_H end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( bold_italic_H ) + caligraphic_L ( bold_italic_W start_POSTSUPERSCRIPT italic_k + 1 end_POSTSUPERSCRIPT , bold_italic_H ) + italic_c start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∥ bold_italic_H - bold_italic_H start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Unfortunately, PALM requires the objective function \mathcal{L}caligraphic_L to possess a Lipschitz gradient, which is not the case in our scenario. Additionally, the Gauss-Seidel iterations generally lack a closed-form solution, resulting in a slow algorithm with sub-iterations.

To overcome the non-Lipschitz gradient issue, Bregman gradient descent (B-GD) [31, Algorithm 1.1] can be considered. This type of algorithm has been extended to alternating minimization [31, Algorithm 1.3 and 1.4]. Such an approach can be adapted to our case, as the objective function (3) exhibits relative smoothness for most regularization scenarios (see (6)). However, this optimization scheme involves non-tight majorization functions, leading to slow convergence, as discussed in Section 5.2.

Given the presence of two blocks of variables, Block Coordinate Descent (BCD) algorithms [53] naturally emerge as a potential solution. In fact, previous work [24] demonstrates that many existing approaches can be viewed as Block Coordinate Descent (BCD) problems. However, a primary challenge with BCD lies in its propensity to necessitate full minimization of the sub-problems, which proves to be challenging in the Poisson case. In the L2 case, the subproblems often have closed-form solutions. To address this concern, we explore BSUM algorithms [41] in this study, a generalization of BCD that avoids the requirement for full minimization of the subproblems, instead using upper bounds for the objective function.

Non-Negative Matrix Factorization

Non-Negative Matrix Factorization (NMF) algorithms have been extensively studied, and for a comprehensive review, one can refer to [12]. Initially formulated as Positive Matrix Factorization for the Gaussian (L2-NMF) case by [39], NMF has seen numerous algorithmic developments. In the L2 case, popular approaches include the Alternating Nonnegative Least Squares (ANLS) framework [23, 25, 34] and the Hierarchical Alternating Least Squares (HALS) method [7, 8]. As for the Poisson case, known as KL NMF, the first algorithm using Multiplicative Updates (MU) was proposed by [30], with later demonstrations of its convergence provided in [29] for both Poisson and Gaussian cases. A more rigorous convergence analysis is presented in [33].

Considering the specific problem of Poisson KL-NMF, there have been a few notable contributions. For instance, [48] employed the Alternating Direction Method of Multipliers (ADMM) with the variable change X=WH𝑋𝑊𝐻X=WHitalic_X = italic_W italic_H. The Primal Dual algorithm, based on the framework from Chambolle-Pock, was explored by [56]. Moreover, [16] conducted a comparative study of various optimization algorithms for KL NMF, including MU [29], ADMM [48], Primal Dual [56], and Cyclic Coordinate Descent Method [18]. They also introduced three new algorithms for KL-NMF, namely Block Mirror Descent Method, A Scalar Newton-Type Algorithm, and A Hybrid SN-MU Algorithm.

In the introduction, we mentioned that there are few contributions that address the problem of regularized NMF, with many of them focusing on the L2 case. [24] demonstrated how many existing works can be cast as Block Coordinate Descent (BCD) problems, allowing the derivation of MU update rules for different regularizers, such as L1 for sparsity. However, their work is limited to L2 NMF. Xu et al. [55] proposed a general optimization scheme for block multiconvex optimization using block coordinate descent, which can accommodate regularization on each block. Although applied to L2-NMF, such an approach may lead to algorithms with sub-iterations. In the context of L2-NMF with sparsity constraints, [50] presented an approach to address this scenario. Additionally, [49] introduced a framework for handling L2-NMF with Lipschitz regularizers, akin to our term sLsubscript𝑠𝐿s_{L}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. Other forms of regularization have also been explored, such as graph-based [6] or simplex constraint [17].

Regarding the specific problem of regularized KL loss, [15] provided a notable contribution. However, their work focused solely on the subproblems of the NMF problem, rather than addressing the NMF problem itself. Notably, one could potentially employ a similar approach to solve the subproblems of (3) using BCD. Nonetheless, this would result in a less efficient algorithm with sub-iterations.

3 Preliminaries

3.1 Notation

We reserve capital letters for matrices and vectors, e.g., 𝑨,𝒂𝑨𝒂\bm{A},\bm{a}bold_italic_A , bold_italic_a. We use 𝒂𝒊subscript𝒂𝒊\bm{a_{i}}bold_italic_a start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT to refer to the i𝑖iitalic_i-th numbered vector, and aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to denote the i𝑖iitalic_i-th element of vector 𝒂𝒂\bm{a}bold_italic_a. The j𝑗jitalic_j-th element of vector 𝒂isubscript𝒂𝑖\bm{a}_{i}bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT or the element at the i𝑖iitalic_i-th row and j𝑗jitalic_j-th column of matrix 𝑨𝑨\bm{A}bold_italic_A is denoted as aijsubscript𝑎𝑖𝑗a_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

𝑨0𝑨0\bm{A}\geq 0bold_italic_A ≥ 0 and 𝒂0𝒂0\bm{a}\geq 0bold_italic_a ≥ 0 indicate that all entries of matrix 𝑨𝑨\bm{A}bold_italic_A or vector 𝒂𝒂\bm{a}bold_italic_a are greater than or equal to 00, i.e., aij0subscript𝑎𝑖𝑗0a_{ij}\geq 0italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≥ 0 for all i,j𝑖𝑗i,jitalic_i , italic_j. 𝑨,𝒂superscript𝑨topsuperscript𝒂top\bm{A}^{\top},\bm{a}^{\top}bold_italic_A start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT represent the transpose of 𝑨𝑨\bm{A}bold_italic_A and 𝒂𝒂\bm{a}bold_italic_a, respectively. We use 𝑨tsuperscript𝑨𝑡\bm{A}^{t}bold_italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT to denote 𝑨𝑨\bm{A}bold_italic_A at step t𝑡titalic_t. 𝑨t\bm{A}^{t}{}^{\top}bold_italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ⊤ end_FLOATSUPERSCRIPT denotes the transpose of 𝑨tsuperscript𝑨𝑡\bm{A}^{t}bold_italic_A start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. \circ and \oslash denote elementwise multiplication (also known as the Hadamard product) and division for matrices, respectively. For example, [𝑨𝑩]ij=aijbijsubscriptdelimited-[]𝑨𝑩𝑖𝑗subscript𝑎𝑖𝑗subscript𝑏𝑖𝑗[\bm{A}\circ\bm{B}]_{ij}=a_{ij}b_{ij}[ bold_italic_A ∘ bold_italic_B ] start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and [𝑨𝑩]ij=aijbijsubscriptdelimited-[]𝑨𝑩𝑖𝑗subscript𝑎𝑖𝑗subscript𝑏𝑖𝑗[\bm{A}\oslash\bm{B}]_{ij}=\frac{a_{ij}}{b_{ij}}[ bold_italic_A ⊘ bold_italic_B ] start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG.

As mentioned in the introduction, the matrix to be factorized is generally denoted as 𝒀𝑾𝑯𝒀𝑾𝑯\bm{Y}\approx\bm{W}\bm{H}bold_italic_Y ≈ bold_italic_W bold_italic_H, where 𝑾𝑾\bm{W}bold_italic_W and 𝑯𝑯\bm{H}bold_italic_H are its factors. We will use 𝒘,𝒉𝒘𝒉\bm{w},\bm{h}bold_italic_w , bold_italic_h as the vectorized versions of 𝑾,𝑯𝑾𝑯\bm{W},\bm{H}bold_italic_W , bold_italic_H, while 𝒘i,𝒉isubscript𝒘𝑖subscript𝒉𝑖\bm{w}_{i},\bm{h}_{i}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT will denote the i𝑖iitalic_i-th column of 𝑾,𝑯𝑾𝑯\bm{W},\bm{H}bold_italic_W , bold_italic_H. x,𝒙𝑥𝒙x,\bm{x}italic_x , bold_italic_x are general variables that can replace either 𝒘𝒘\bm{w}bold_italic_w or 𝒉𝒉\bm{h}bold_italic_h. We use (𝑾,𝑯)𝑾𝑯\mathcal{L}\left(\bm{W},\bm{H}\right)caligraphic_L ( bold_italic_W , bold_italic_H ) as the general loss function, and f(𝒙)𝑓𝒙f\left(\bm{x}\right)italic_f ( bold_italic_x ) is broadly used to denote a multivariate scalar function.

Finally, we commonly employ calligraphic notation for variable domains. Let 𝒳𝒳\mathcal{X}caligraphic_X serve as a generic domain for the variable x𝑥xitalic_x. In practical terms, it is frequently defined by the ϵabsentitalic-ϵ\geq\epsilon≥ italic_ϵ constraint, specifically as 𝒳={𝒙m|𝒙ϵ}𝒳conditional-set𝒙superscript𝑚𝒙italic-ϵ\mathcal{X}=\left\{\bm{x}\in\mathbb{R}^{m}|\bm{x}\geq\epsilon\right\}caligraphic_X = { bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | bold_italic_x ≥ italic_ϵ }. Additionally, we utilize Cwsubscript𝐶𝑤C_{w}italic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and Chsubscript𝐶C_{h}italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT to represent the domains of 𝒘𝒘\bm{w}bold_italic_w and 𝒉𝒉\bm{h}bold_italic_h, respectively.

3.2 Definitions

In this contribution, we consider the loss function (𝒘,𝒉)𝒘𝒉\mathcal{L}\left(\bm{w},\bm{h}\right)caligraphic_L ( bold_italic_w , bold_italic_h ) with two blocks of variables: 𝒘𝒞w𝒘subscript𝒞𝑤\bm{w}\in\mathcal{C}_{w}bold_italic_w ∈ caligraphic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and 𝒉𝒞h𝒉subscript𝒞\bm{h}\in\mathcal{C}_{h}bold_italic_h ∈ caligraphic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, where 𝒞wmwsubscript𝒞𝑤superscriptsubscript𝑚𝑤\mathcal{C}_{w}\subset\mathbb{R}^{m_{w}}caligraphic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒞hmhsubscript𝒞superscriptsubscript𝑚\mathcal{C}_{h}\subset\mathbb{R}^{m_{h}}caligraphic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are both non-empty convex sets. Here, 𝒘𝒘\bm{w}bold_italic_w and 𝒉𝒉\bm{h}bold_italic_h correspond to the vectorized versions of the two matrices 𝑾𝑾\bm{W}bold_italic_W and 𝑯𝑯\bm{H}bold_italic_H, respectively. Therefore, 𝒞wsubscript𝒞𝑤\mathcal{C}_{w}caligraphic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and 𝒞hsubscript𝒞\mathcal{C}_{h}caligraphic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT often correspond to the sets 𝒘ϵ𝒘italic-ϵ\bm{w}\geq\epsilonbold_italic_w ≥ italic_ϵ and 𝒉ϵ𝒉italic-ϵ\bm{h}\geq\epsilonbold_italic_h ≥ italic_ϵ with ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0. Let us use 𝒛=[𝒘,𝒉]𝒛𝒘𝒉\bm{z}=\left[\bm{w},\bm{h}\right]bold_italic_z = [ bold_italic_w , bold_italic_h ] to denote all the variables. We have 𝒛𝒞=𝒞w×𝒞hm=mw×mh𝒛𝒞subscript𝒞𝑤subscript𝒞superscript𝑚superscriptsubscript𝑚𝑤superscriptsubscript𝑚\bm{z}\in\mathcal{C}=\mathcal{C}_{w}\times\mathcal{C}_{h}\subset\mathbb{R}^{m}% =\mathbb{R}^{m_{w}}\times\mathbb{R}^{m_{h}}bold_italic_z ∈ caligraphic_C = caligraphic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × caligraphic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where the total dimension of the problem is m=mw+mh𝑚subscript𝑚𝑤subscript𝑚m=m_{w}+m_{h}italic_m = italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT + italic_m start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

Definition 1 (Directional derivative).

Let :𝒞:𝒞\mathcal{L}:\mathcal{C}\rightarrow\mathbb{R}caligraphic_L : caligraphic_C → blackboard_R be a scalar function, where 𝒞m𝒞superscript𝑚\mathcal{C}\subset\mathbb{R}^{m}caligraphic_C ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a convex set. The directional derivative of \mathcal{L}caligraphic_L at point 𝐱𝐱\bm{x}bold_italic_x in the direction 𝐝𝐝\bm{d}bold_italic_d is defined by

(𝒛;𝒅)limλ0(𝒛+λ𝒅)(𝒛)λsuperscript𝒛𝒅subscript𝜆0𝒛𝜆𝒅𝒛𝜆\mathcal{L}^{\prime}\left(\bm{z};\bm{d}\right)\coloneqq\lim_{\lambda\downarrow 0% }\frac{\mathcal{L}\left(\bm{z}+\lambda\bm{d}\right)-\mathcal{L}\left(\bm{z}% \right)}{\lambda}caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_z ; bold_italic_d ) ≔ roman_lim start_POSTSUBSCRIPT italic_λ ↓ 0 end_POSTSUBSCRIPT divide start_ARG caligraphic_L ( bold_italic_z + italic_λ bold_italic_d ) - caligraphic_L ( bold_italic_z ) end_ARG start_ARG italic_λ end_ARG

Note that when \mathcal{L}caligraphic_L is differentiable, (𝒛;𝒅)=(𝒛)𝒅superscript𝒛𝒅𝒛superscript𝒅top\mathcal{L}^{\prime}(\bm{z};\bm{d})=\nabla\mathcal{L}(\bm{z})\bm{d}^{\top}caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_z ; bold_italic_d ) = ∇ caligraphic_L ( bold_italic_z ) bold_italic_d start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT since 𝒅𝒅\bm{d}bold_italic_d and 𝒛𝒛\bm{z}bold_italic_z are row vectors. In this contribution, almost all functions are differentiable on the domain of interest.

Definition 2 (Coordinatewise Minimum).

The point 𝐳=[𝐰,𝐡]𝒞𝐳𝐰𝐡𝒞\bm{z}=[\bm{w},\bm{h}]\in\mathcal{C}bold_italic_z = [ bold_italic_w , bold_italic_h ] ∈ caligraphic_C is a coordinatewise minimum of a function \mathcal{L}caligraphic_L if

(𝒘+𝒅w,𝒉)(𝒘,𝒉)𝒅wmwwith𝒘+𝒅w𝒞wformulae-sequence𝒘subscript𝒅𝑤𝒉𝒘𝒉formulae-sequencefor-allsubscript𝒅𝑤superscriptsubscript𝑚𝑤with𝒘subscript𝒅𝑤subscript𝒞𝑤\mathcal{L}\left(\bm{w}+\bm{d}_{w},\bm{h}\right)\geq\mathcal{L}\left(\bm{w},% \bm{h}\right)\hskip 10.00002pt\forall\bm{d}_{w}\in\mathbb{R}^{m_{w}}\hskip 10.% 00002pt\text{with}\hskip 10.00002pt\bm{w}+\bm{d}_{w}\in\mathcal{C}_{w}caligraphic_L ( bold_italic_w + bold_italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , bold_italic_h ) ≥ caligraphic_L ( bold_italic_w , bold_italic_h ) ∀ bold_italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with bold_italic_w + bold_italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT
(𝒘,𝒉+𝒅h)(𝒘,𝒉)𝒅hmhwith𝒉+𝒅h𝒞hformulae-sequence𝒘𝒉subscript𝒅𝒘𝒉formulae-sequencefor-allsubscript𝒅superscriptsubscript𝑚with𝒉subscript𝒅subscript𝒞\mathcal{L}\left(\bm{w},\bm{h}+\bm{d}_{h}\right)\geq\mathcal{L}\left(\bm{w},% \bm{h}\right)\hskip 10.00002pt\forall\bm{d}_{h}\in\mathbb{R}^{m_{h}}\hskip 10.% 00002pt\text{with}\hskip 10.00002pt\bm{h}+\bm{d}_{h}\in\mathcal{C}_{h}caligraphic_L ( bold_italic_w , bold_italic_h + bold_italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) ≥ caligraphic_L ( bold_italic_w , bold_italic_h ) ∀ bold_italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with bold_italic_h + bold_italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT

A coordinatewise minimum is a natural termination point for an alternating minimization algorithm. However, it is important to note that a coordinatewise minimum is not equivalent to a local minimum, as it does not guarantee minimality in all directions. Figure 1 (left) provides a counterexample illustrating this.

Another significant concept is the notion of a stationary point, where the gradient is non-negative in all directions.

Definition 3 (Stationary Points of a function).

Let :𝒞:𝒞\mathcal{L}:\mathcal{C}\rightarrow\mathbb{R}caligraphic_L : caligraphic_C → blackboard_R be a scalar function, where 𝒞m𝒞superscript𝑚\mathcal{C}\subset\mathbb{R}^{m}caligraphic_C ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a convex set. A point 𝐱𝐱\bm{x}bold_italic_x is a stationary point of \mathcal{L}caligraphic_L if

(𝒛;𝒅)0𝒅|𝒛+𝒅𝒳formulae-sequencesuperscript𝒛𝒅0conditionalfor-all𝒅𝒛𝒅𝒳\mathcal{L}^{\prime}\left(\bm{z};\bm{d}\right)\geq 0\hskip 10.00002pt\forall% \bm{d}|\bm{z}+\bm{d}\in\mathcal{X}caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_z ; bold_italic_d ) ≥ 0 ∀ bold_italic_d | bold_italic_z + bold_italic_d ∈ caligraphic_X

We emphasize that a stationary point is not equivalent to a strict local minimum as there might be directions where the directional derivative equals 0. For example, in the simple function f([𝒘,𝒉])=(wh2)2𝑓𝒘𝒉superscript𝑤22f([\bm{w},\bm{h}])=(wh-2)^{2}italic_f ( [ bold_italic_w , bold_italic_h ] ) = ( italic_w italic_h - 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the point [𝒘,𝒉]=[2,2]𝒘𝒉22[\bm{w},\bm{h}]=[\sqrt{2},\sqrt{2}][ bold_italic_w , bold_italic_h ] = [ square-root start_ARG 2 end_ARG , square-root start_ARG 2 end_ARG ] has a zero derivative in the direction [1,1]11[1,-1][ 1 , - 1 ], which corresponds to rescaling the solution as [α,1/α]𝛼1𝛼[\alpha,1/\alpha][ italic_α , 1 / italic_α ]. Even worse, a stationary point is not necessarily a local minimum, even if it is a coordinatewise minimum, as shown in Figure 1 (a). Here, in the diagonal directions, the directional derivative equals 0 but the function is concave in this direction.

When it comes to the Poisson loss function, there is, at least, a continuous set of local minimas corresponding to rescaling the solution. This is illustrated in Figure 1 (b). Note that the introduction of regularization or constraints can lead to strict local minima, as shown in Figure 1 (c).

In this contribution, we prove convergence to a coordinatewise minimum that is also a stationary point. To accomplish this, we will consider a class of functions that are regular at their coordinatewise minima.

Definition 4 (Regularity of a function at a point).

The function :𝒞:𝒞\mathcal{L}:\mathcal{C}\rightarrow\mathbb{R}caligraphic_L : caligraphic_C → blackboard_R is said to be regular at the point 𝐳𝒞𝐳𝒞\bm{z}\in\mathcal{C}bold_italic_z ∈ caligraphic_C if (𝐳;𝐝)0superscript𝐳𝐝0\mathcal{L}^{\prime}(\bm{z};\bm{d})\geq 0caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_z ; bold_italic_d ) ≥ 0 for all 𝐝=[𝐝w,𝐝h]m𝐝subscript𝐝𝑤subscript𝐝superscript𝑚\bm{d}=[\bm{d}_{w},\bm{d}_{h}]\in\mathbb{R}^{m}bold_italic_d = [ bold_italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , bold_italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT such that (𝐳;[𝐝w,𝟎])0superscript𝐳subscript𝐝𝑤00\mathcal{L}^{\prime}(\bm{z};[\bm{d}_{w},\bm{0}])\geq 0caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_z ; [ bold_italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , bold_0 ] ) ≥ 0 and (𝐳;[𝟎,𝐝h])0superscript𝐳0subscript𝐝0\mathcal{L}^{\prime}(\bm{z};[\bm{0},\bm{d}_{h}])\geq 0caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_z ; [ bold_0 , bold_italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] ) ≥ 0.

Lemma 1.

Continuously differentiable functions are regular at their coordinatewise minimums.

Proof is provided in Appendix A.1. Lemma 1 plays a crucial role, as it ensures that the coordinate-wise minimum we converge to in Theorem 1 is also a stationary point. In this work, we assume the regularizer to be continuously differentiable on the domain.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 1: (a) A function where the point (0,0)00(0,0)( 0 , 0 ) is a coordinatewise minimum along the x𝑥xitalic_x and y𝑦yitalic_y directions but is not a local minimum. (b) The blue line represents the set of all global minima for the one-dimensional case of (2), where all coordinatewise minima are also stationary points. (c) We show the effect of adding a quadratic regularization term to (2), ensuring the uniqueness of the global minimum.

3.3 Approximation functions

In order to facilitate optimization algorithms, it is beneficial to work with approximation functions that majorize or approximate the objective function at a given point. One commonly used class of approximation functions is known as first-order majorization functions. These functions provide a convenient framework for constructing surrogates and facilitating optimization. We adopt the definition of first-order majorization functions from [41].

Definition 5.

[41, Assumption 1] A function g(𝐱,𝐱t)𝑔𝐱superscript𝐱𝑡g(\bm{x},\bm{x}^{t})italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is said to be a first-order majorization of f𝑓fitalic_f at the point 𝐱tsuperscript𝐱𝑡\bm{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT if it satisfies the following properties:

A.1𝐴.1\displaystyle A.1\hskip 30.00005ptitalic_A .1 g(𝒙,𝒙t)f(𝒙)𝒙,𝒙t𝒳,formulae-sequence𝑔𝒙superscript𝒙𝑡𝑓𝒙for-all𝒙superscript𝒙𝑡𝒳\displaystyle g(\bm{x},\bm{x}^{t})\geq f(\bm{x})\hskip 10.00002pt\forall\bm{x}% ,\bm{x}^{t}\in\mathcal{X},italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ≥ italic_f ( bold_italic_x ) ∀ bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_X ,
A.2𝐴.2\displaystyle A.2\hskip 30.00005ptitalic_A .2 g(𝒙t,𝒙t)=f(𝒙t)𝒙t𝒳,formulae-sequence𝑔superscript𝒙𝑡superscript𝒙𝑡𝑓superscript𝒙𝑡for-allsuperscript𝒙𝑡𝒳\displaystyle g(\bm{x}^{t},\bm{x}^{t})=f(\bm{x}^{t})\hskip 10.00002pt\forall% \bm{x}^{t}\in\mathcal{X},italic_g ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∀ bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_X ,
A.3𝐴.3\displaystyle A.3\hskip 30.00005ptitalic_A .3 g(𝒙,𝒙t;𝒅)|𝒙=𝒙t=f(𝒙t;𝒅)𝒅 such that 𝒙t+𝒅𝒳,formulae-sequenceevaluated-atsuperscript𝑔𝒙superscript𝒙𝑡𝒅𝒙superscript𝒙𝑡superscript𝑓superscript𝒙𝑡𝒅for-all𝒅 such that superscript𝒙𝑡𝒅𝒳\displaystyle g^{\prime}(\bm{x},\bm{x}^{t};\bm{d})\big{|}_{\bm{x}=\bm{x}^{t}}=% f^{\prime}(\bm{x}^{t};\bm{d})\hskip 10.00002pt\forall\bm{d}\text{ such that }% \bm{x}^{t}+\bm{d}\in\mathcal{X},italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; bold_italic_d ) | start_POSTSUBSCRIPT bold_italic_x = bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; bold_italic_d ) ∀ bold_italic_d such that bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + bold_italic_d ∈ caligraphic_X ,
A.4𝐴.4\displaystyle A.4\hskip 30.00005ptitalic_A .4 g(𝒙,𝒙t) is continuous in (𝒙,𝒙t).𝑔𝒙superscript𝒙𝑡 is continuous in 𝒙superscript𝒙𝑡\displaystyle g(\bm{x},\bm{x}^{t})\text{ is continuous in }(\bm{x},\bm{x}^{t}).italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is continuous in ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) .

It is worth noting that for continuously differentiable functions, the third statement can be equivalently expressed as 𝒙g(𝒙t,𝒙t)=f(𝒙t)subscript𝒙𝑔superscript𝒙𝑡superscript𝒙𝑡𝑓superscript𝒙𝑡\nabla_{\bm{x}}g(\bm{x}^{t},\bm{x}^{t})=\nabla f(\bm{x}^{t})∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_g ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = ∇ italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). Although the definition of first-order majorization functions resembles the concept of surrogate functions introduced in [37, Definition 2.2], the additional requirement for a surrogate function is that g(𝒙,𝒙t)f(𝒙)𝑔𝒙superscript𝒙𝑡𝑓𝒙g(\bm{x},\bm{x}^{t})-f(\bm{x})italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_f ( bold_italic_x ) is L gradient Lipschitz as defined in (5). Importantly, all majorization functions defined in the following Section 4.1 satisfy this condition and can thus serve as majorization functions.

Conveniently, majorization functions can be built term by term, leveraging their additivity property. This property allows us to combine multiple majorization functions to obtain a new majorization function.

Lemma 2.

First-order majorization functions are additive. If g1(𝐱,𝐱t)subscript𝑔1𝐱superscript𝐱𝑡g_{1}(\bm{x},\bm{x}^{t})italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and g2(𝐱,𝐱t)subscript𝑔2𝐱superscript𝐱𝑡g_{2}(\bm{x},\bm{x}^{t})italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) majorize f1(𝐱)subscript𝑓1𝐱f_{1}(\bm{x})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) and f2(𝐱)subscript𝑓2𝐱f_{2}(\bm{x})italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) at 𝐱tsuperscript𝐱𝑡\bm{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, respectively, then g1(𝐱,𝐱t)+g2(𝐱,𝐱t)subscript𝑔1𝐱superscript𝐱𝑡subscript𝑔2𝐱superscript𝐱𝑡g_{1}(\bm{x},\bm{x}^{t})+g_{2}(\bm{x},\bm{x}^{t})italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) majorizes f(𝐱)=f1(𝐱)+f2(𝐱)𝑓𝐱subscript𝑓1𝐱subscript𝑓2𝐱f(\bm{x})=f_{1}(\bm{x})+f_{2}(\bm{x})italic_f ( bold_italic_x ) = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x ) + italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x ) at 𝐱tsuperscript𝐱𝑡\bm{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

Proof.

The additivity property preserves each property of (5). ∎

Lemma 2 provides a valuable tool for constructing majorization functions by combining simpler majorization functions. Additionally, when proving that a function is majorizing, it is often unnecessary to explicitly demonstrate the equality of partial derivatives or gradients at 𝒙tsuperscript𝒙𝑡\bm{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Instead, in the case of differentiable functions, it is typically sufficient to establish the first two properties (A.1 and A.2). According to [41, Proposition 1], properties A.3 and A.4 follow as a consequence. Intuitively, one can observe that the continuity of the gradient ensures that the majorization function g𝑔gitalic_g shares the tangent spaces with f𝑓fitalic_f at the point 𝒙tsuperscript𝒙𝑡\bm{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

3.4 Two Blocks Successive Minimization (TBSUM)

The TBSUM algorithm is designed to solve the following problem:

minimize𝒉,𝒘(𝒘,𝒉)such that𝒉𝒞h,𝒘𝒞wformulae-sequencesubscriptminimize𝒉𝒘𝒘𝒉such that𝒉subscript𝒞𝒘subscript𝒞𝑤\begin{split}\operatorname*{minimize~{}~{}}_{\bm{h},\bm{w}}~{}&\mathcal{L}% \left(\bm{w},\bm{h}\right)\\ \text{such that}~{}~{}&\bm{h}\in\mathcal{C}_{h},\bm{w}\in\mathcal{C}_{w}\end{split}start_ROW start_CELL start_OPERATOR roman_minimize end_OPERATOR start_POSTSUBSCRIPT bold_italic_h , bold_italic_w end_POSTSUBSCRIPT end_CELL start_CELL caligraphic_L ( bold_italic_w , bold_italic_h ) end_CELL end_ROW start_ROW start_CELL such that end_CELL start_CELL bold_italic_h ∈ caligraphic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , bold_italic_w ∈ caligraphic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_CELL end_ROW (9)

It relies on two first-order majorizing functions: gw(𝒘,𝒘t,𝒉t)subscript𝑔𝑤𝒘superscript𝒘𝑡superscript𝒉𝑡g_{w}(\bm{w},\bm{w}^{t},\bm{h}^{t})italic_g start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_w , bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and gh(𝒉,𝒉t,𝒘t)subscript𝑔𝒉superscript𝒉𝑡superscript𝒘𝑡g_{h}(\bm{h},\bm{h}^{t},\bm{w}^{t})italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( bold_italic_h , bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), which majorize (𝒘,𝒉)𝒘𝒉\mathcal{L}(\bm{w},\bm{h})caligraphic_L ( bold_italic_w , bold_italic_h ) at (𝒘t,𝒉t)superscript𝒘𝑡superscript𝒉𝑡\left(\bm{w}^{t},\bm{h}^{t}\right)( bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) for all 𝒘t𝒞wsuperscript𝒘𝑡subscript𝒞𝑤\bm{w}^{t}\in\mathcal{C}_{w}bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and 𝒉t𝒞hsuperscript𝒉𝑡subscript𝒞\bm{h}^{t}\in\mathcal{C}_{h}bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. The construction of these functions will be presented in Section 4. The TBSUM algorithm, outlined in Algorithm 1, alternates between minimizing gwsubscript𝑔𝑤g_{w}italic_g start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and ghsubscript𝑔g_{h}italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. It is assumed that the subproblem solutions are unique. Theorem 1 establishes the convergence of the TBSUM algorithm, which is a variant of the algorithm presented in [41, Theorem 2a] adapted for solving the specific problem at hand.

Algorithm 1 TBSUM: Two-Block Successive Minimization Algorithm
1:Initialize the variables to a feasible point 𝒘0𝒞wsuperscript𝒘0subscript𝒞𝑤\bm{w}^{0}\in\mathcal{C}_{w}bold_italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, 𝒉0𝒞hsuperscript𝒉0subscript𝒞\bm{h}^{0}\in\mathcal{C}_{h}bold_italic_h start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, and set t=0𝑡0t=0italic_t = 0
2:repeat
3:     𝒘t+1argmin𝒘𝒞wgw(𝒘,𝒘t,𝒉t)superscript𝒘𝑡1subscriptargmin𝒘subscript𝒞𝑤subscript𝑔𝑤𝒘superscript𝒘𝑡superscript𝒉𝑡\bm{w}^{t+1}\leftarrow\operatorname*{arg\,min}_{\bm{w}\in\mathcal{C}_{w}}g_{w}% \left(\bm{w},\bm{w}^{t},\bm{h}^{t}\right)bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_w ∈ caligraphic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_w , bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
4:     𝒉t+1argmin𝒉𝒞hgh(𝒉,𝒉t,𝒘t+1)superscript𝒉𝑡1subscriptargmin𝒉subscript𝒞subscript𝑔𝒉superscript𝒉𝑡superscript𝒘𝑡1\bm{h}^{t+1}\leftarrow\operatorname*{arg\,min}_{\bm{h}\in\mathcal{C}_{h}}g_{h}% \left(\bm{h},\bm{h}^{t},\bm{w}^{t+1}\right)bold_italic_h start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ← start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_h ∈ caligraphic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( bold_italic_h , bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT )
5:     tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
6:until some convergence criterion is met
Theorem 1 (Convergence of TBSUM Algorithm 1).

Given two quasi-convex first order majorizing functions gh(𝐰,𝐰t,𝐡t)subscript𝑔𝐰superscript𝐰𝑡superscript𝐡𝑡g_{h}\left(\bm{w},\bm{w}^{t},\bm{h}^{t}\right)italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( bold_italic_w , bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and gw(𝐡,𝐡t,𝐰t+1)subscript𝑔𝑤𝐡superscript𝐡𝑡superscript𝐰𝑡1g_{w}\left(\bm{h},\bm{h}^{t},\bm{w}^{t+1}\right)italic_g start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_h , bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) of (𝐰,𝐡)𝐰𝐡\mathcal{L}\left(\bm{w},\bm{h}\right)caligraphic_L ( bold_italic_w , bold_italic_h ) at (𝐰t,𝐡t),𝐰t,𝐡t𝒞w×𝒞hsuperscript𝐰𝑡superscript𝐡𝑡for-allsuperscript𝐰𝑡superscript𝐡𝑡subscript𝒞𝑤subscript𝒞\left(\bm{w}^{t},\bm{h}^{t}\right),\forall\bm{w}^{t},\bm{h}^{t}\in\mathcal{C}_% {w}\times\mathcal{C}_{h}( bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , ∀ bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × caligraphic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Furthermore assuming that the two subproblems in the TBSUM Algorithm 1 have unique solutions for any points 𝐰tCwsuperscript𝐰𝑡subscript𝐶𝑤\bm{w}^{t}\in C_{w}bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, 𝐡tChsuperscript𝐡𝑡subscript𝐶\bm{h}^{t}\in C_{h}bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. Then, every limit point 𝐳=[𝐰,𝐡]𝐳𝐰𝐡\bm{\bm{z}=\left[\bm{w},\bm{h}\right]}bold_italic_z bold_= bold_[ bold_italic_w bold_, bold_italic_h bold_] of the iterates generated by the TBSUM Algorithm 1 is a coordinatewise minimum of (9). In addition, if \mathcal{L}caligraphic_L is regular at any point 𝐳𝒞𝐳𝒞\bm{z}\in\mathcal{C}bold_italic_z ∈ caligraphic_C, then 𝐳𝐳\bm{z}bold_italic_z is a stationary point of (9).

4 Subproblem minimization

In this section, we focus on constructing the appropriate majorization functions gh(𝒘,𝒘t,𝒉t)subscript𝑔𝒘superscript𝒘𝑡superscript𝒉𝑡g_{h}\left(\bm{w},\bm{w}^{t},\bm{h}^{t}\right)italic_g start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( bold_italic_w , bold_italic_w start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) and gw(𝒉,𝒉t,𝒘t+1)subscript𝑔𝑤𝒉superscript𝒉𝑡superscript𝒘𝑡1g_{w}\left(\bm{h},\bm{h}^{t},\bm{w}^{t+1}\right)italic_g start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_h , bold_italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_w start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ) for our problem (3). Since we consider the same type of regularization for 𝒘𝒘\bm{w}bold_italic_w and 𝒉𝒉\bm{h}bold_italic_h, both subfunctions have the same form.

Practically, the loss function can be rewritten as

𝒀(𝑾,𝑯)+R(𝑾,𝑯)=jfw(𝒘j)=ifh(𝒉i)subscript𝒀𝑾𝑯𝑅𝑾𝑯subscript𝑗subscript𝑓𝑤subscript𝒘𝑗subscript𝑖subscript𝑓subscript𝒉𝑖\mathcal{L}_{\bm{Y}}\left(\bm{W},\bm{H}\right){\color[rgb]{.75,.5,.25}+R\left(% \bm{W},\bm{H}\right)}=\sum_{j}f_{w}\left(\bm{w}_{j}\right)=\sum_{i}f_{h}\left(% \bm{h}_{i}\right)caligraphic_L start_POSTSUBSCRIPT bold_italic_Y end_POSTSUBSCRIPT ( bold_italic_W , bold_italic_H ) + italic_R ( bold_italic_W , bold_italic_H ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ( bold_italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where 𝒘jsubscript𝒘𝑗\bm{w}_{j}bold_italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝒉isubscript𝒉𝑖\bm{h}_{i}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row and jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT column of 𝑾𝑾\bm{W}bold_italic_W and 𝑯𝑯\bm{H}bold_italic_H. The functions fwsubscript𝑓𝑤f_{w}italic_f start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and fhsubscript𝑓f_{h}italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT have the form:

f(𝒙)=i=1m(bilog(𝒂i𝒙)+𝒂i𝒙)+sL(𝒙)+sR(𝒙)+j=1nsC(xj).𝑓𝒙superscriptsubscript𝑖1𝑚subscript𝑏𝑖superscriptsubscript𝒂𝑖top𝒙superscriptsubscript𝒂𝑖top𝒙subscript𝑠𝐿𝒙subscript𝑠𝑅𝒙superscriptsubscript𝑗1𝑛subscript𝑠𝐶subscript𝑥𝑗f\left(\bm{x}\right)={\color[rgb]{0.0,0.5,0.0}-\sum_{i=1}^{m}\left(b_{i}\log% \left(\bm{a}_{i}^{\top}\bm{x}\right)+\bm{a}_{i}^{\top}\bm{x}\right)}{\color[% rgb]{0,0,1}+s_{L}\left(\bm{x}\right)}{\color[rgb]{1,.5,0}+s_{R}\left(\bm{x}% \right)}{\color[rgb]{.75,0,.25}+\sum_{j=1}^{n}s_{C}\left(x_{j}\right)}.italic_f ( bold_italic_x ) = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ) + bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ) + italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ) + italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x ) + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) . (10)

Therefore, in this section, our objective is to find majorization functions for (10). Once this is done, we will provide closed-form solutions for steps 1 and 2 of Algorithm 1. It is worth noting that each term of (10) can be handled separately using the additivity property of majorizing functions (Lemma 2).

4.1 Majorizing functions

The following four lemmas provide majorizing functions for the different term of our objective function. Proofs are provided in Appendix A.2.

In order to develop an efficient algorithm, our objective is to identify majorizing functions that result in sub-problems with closed-form tractable solutions. Often, this can be accomplished under two conditions: 1. all the majorizing functions are of the same form, and, 2. the majorization function is separable with respect to the variables 𝐱𝐱\bf{x}bold_x, i.e., g(𝐱)=𝐢𝐠𝐢(𝐱𝐢)𝑔𝐱subscript𝐢subscript𝐠𝐢subscript𝐱𝐢g(\bf{x})=\sum_{i}g_{i}(x_{i})italic_g ( bold_x ) = ∑ start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT bold_g start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT ). Within the scope of this contribution, we consider two forms of majorizing functions: quadratic g(x)=a+bx+cx2𝑔𝑥𝑎𝑏𝑥𝑐superscript𝑥2g(x)=a+bx+cx^{2}italic_g ( italic_x ) = italic_a + italic_b italic_x + italic_c italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and logarithmic g(x)=a+bxclog(x)𝑔𝑥𝑎𝑏𝑥𝑐𝑥g(x)=a+bx-c\log(x)italic_g ( italic_x ) = italic_a + italic_b italic_x - italic_c roman_log ( italic_x ).

First, we propose a majorization scheme for the logarithmic term log(𝒂i𝒙)superscriptsubscript𝒂𝑖top𝒙\log\left(\bm{a}_{i}^{\top}\bm{x}\right)roman_log ( bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ) in the objective function (10). We utilize a widely used majorization technique based on the concavity of the logarithm function. This technique has been employed in the original work by Lee and Seung [29] as well as in many EM (Expectation-Maximization) schemes.

Lemma 3 (Log majorization).

Assuming 𝐚𝐱>0𝐚𝐱0\bm{a}\circ\bm{x}>0bold_italic_a ∘ bold_italic_x > 0, for 𝐱𝒞𝐱𝒞\bm{x}\in\mathcal{C}bold_italic_x ∈ caligraphic_C, let us define qj=ajxjtkakxktsubscript𝑞𝑗subscript𝑎𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑘subscript𝑎𝑘superscriptsubscript𝑥𝑘𝑡q_{j}=\frac{a_{j}x_{j}^{t}}{\sum_{k}a_{k}x_{k}^{t}}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG for 𝐱t𝒞superscript𝐱𝑡𝒞\bm{x}^{t}\in\mathcal{C}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_C, then g(𝐱,𝐱t)=jqjlog(ajxjqj)𝑔𝐱superscript𝐱𝑡subscript𝑗subscript𝑞𝑗subscript𝑎𝑗subscript𝑥𝑗subscript𝑞𝑗g\left(\bm{x},\bm{x}^{t}\right)=-\sum_{j}q_{j}\log\left(\frac{a_{j}x_{j}}{q_{j% }}\right)italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) is a first order majorizing function of f(𝐱)=log(𝐚𝐱)=log(jajxj)𝑓𝐱superscript𝐚top𝐱subscript𝑗subscript𝑎𝑗subscript𝑥𝑗f\left(\bm{x}\right)=-\log\left(\bm{a}^{\top}\bm{x}\right)=-\log\left(\sum_{j}% a_{j}x_{j}\right)italic_f ( bold_italic_x ) = - roman_log ( bold_italic_a start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ) = - roman_log ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ).

We now proceed to majorize the different terms of the regularisation function sL(𝒙)subscript𝑠𝐿𝒙s_{L}\left(\bm{x}\right)italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ), sR(𝒙)subscript𝑠𝑅𝒙s_{R}\left(\bm{x}\right)italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x ), and sC(xj)subscript𝑠𝐶subscript𝑥𝑗s_{C}\left(x_{j}\right)italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). We can majorize any Lipschitz function using the following lemma.

Lemma 4 (Lipschitz-majorization).

Given sL(𝐱)subscript𝑠𝐿𝐱s_{L}\left(\bm{x}\right)italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ) a gradient Lipschitz function with constant σLsubscript𝜎𝐿\sigma_{L}italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT over the domain 𝐱𝒞𝐱𝒞\bm{x}\in\mathcal{C}bold_italic_x ∈ caligraphic_C. The functions

g1(𝒙,𝒙t)subscript𝑔1𝒙superscript𝒙𝑡\displaystyle g_{1}\left(\bm{x},\bm{x}^{t}\right)italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =sL(𝒙t)+(𝒙𝒙t)sL(𝒙t)+σL𝒙𝒙t22absentsubscript𝑠𝐿superscript𝒙𝑡superscript𝒙superscript𝒙𝑡topsubscript𝑠𝐿superscript𝒙𝑡subscript𝜎𝐿superscriptsubscriptnorm𝒙superscript𝒙𝑡22\displaystyle=s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm{x}^{t}\right)^{% \top}\nabla s_{L}\left(\bm{x}^{t}\right)+\sigma_{L}\|\bm{x}-\bm{x}^{t}\|_{2}^{2}= italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ( bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (11)

and

g2(𝒙,𝒙t)=sL(𝒙t)+(𝒙𝒙t)sL(𝒙t)+2σL(maxjxjt)(jxjtlog(xjtxj)xjt+xj)subscript𝑔2𝒙superscript𝒙𝑡subscript𝑠𝐿superscript𝒙𝑡superscript𝒙superscript𝒙𝑡topsubscript𝑠𝐿superscript𝒙𝑡2subscript𝜎𝐿subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑗superscriptsubscript𝑥𝑗𝑡superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗g_{2}\left(\bm{x},\bm{x}^{t}\right)=s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-% \bm{x}^{t}\right)^{\top}\nabla s_{L}\left(\bm{x}^{t}\right)+2\sigma_{L}\left(% \max_{j}x_{j}^{t}\right)\left(\sum_{j}x_{j}^{t}\log\left(\frac{x_{j}^{t}}{x_{j% }}\right)-x_{j}^{t}+x_{j}\right)italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ( bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 2 italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (12)

are first oder majorizing functions at 𝐱t𝒞.superscript𝐱𝑡𝒞\bm{x}^{t}\in\mathcal{C}.bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_C .

We note that (11) (quadratic majorisation) is tighter than (12) (logarithmic majorisation) . However, the looser majorisation function is needed to obtain a close form solution for the MU (see Section 4).

Next, the term that is relatively smooth can be majorized using the following lemma.

Lemma 5 (Relative smoothness majorization).

Assuming sR(𝐱)subscript𝑠𝑅𝐱s_{R}\left(\bm{x}\right)italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x ) a σRsubscript𝜎𝑅\sigma_{R}italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT relatively smooth function with respect to κ(𝐱)=𝟏log(𝐱)𝜅𝐱superscript1top𝐱\kappa\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}\right)italic_κ ( bold_italic_x ) = - bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_log ( bold_italic_x ) for 𝐱𝒞+n.𝐱𝒞superscriptsubscript𝑛\bm{x}\in\mathcal{C\subset\mathbb{R}}_{+}^{n}.bold_italic_x ∈ caligraphic_C ⊂ blackboard_R start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT . Then the function

g(𝒙,𝒙t)=sR(𝒙t)+sR(𝒙t),𝒙𝒙t+σRin(xixitlog(xixit)1)𝑔𝒙superscript𝒙𝑡subscript𝑠𝑅superscript𝒙𝑡subscript𝑠𝑅superscript𝒙𝑡𝒙superscript𝒙𝑡subscript𝜎𝑅superscriptsubscript𝑖𝑛subscript𝑥𝑖superscriptsubscript𝑥𝑖𝑡subscript𝑥𝑖superscriptsubscript𝑥𝑖𝑡1g\left(\bm{x},\bm{x}^{t}\right)=s_{R}\left(\bm{x}^{t}\right)+\left\langle% \nabla s_{R}\left(\bm{x}^{t}\right),\bm{x}-\bm{x}^{t}\right\rangle+\sigma_{R}% \sum_{i}^{n}\left(\frac{x_{i}}{x_{i}^{t}}-\log\left(\frac{x_{i}}{x_{i}^{t}}% \right)-1\right)italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ⟨ ∇ italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ + italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG - roman_log ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ) - 1 ) (13)

is a first order majorizing function of sR(𝐱)subscript𝑠𝑅𝐱s_{R}\left(\bm{x}\right)italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x ) for 𝐱t𝒞.superscript𝐱𝑡𝒞\bm{x}^{t}\in\mathcal{C}.bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_C .

Lemma 6 (Concave majorisation).

Given s(x)𝑠𝑥s\left(x\right)italic_s ( italic_x ) a concave function defined on x𝒞𝑥𝒞x\in\mathcal{C}\subset\mathbb{R}italic_x ∈ caligraphic_C ⊂ blackboard_R, it’s linear approximation at the point xtsuperscript𝑥𝑡x^{t}italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT

g(x,xt)=s(xt)+s(xt)x(xxt)𝑔𝑥superscript𝑥𝑡𝑠superscript𝑥𝑡𝑠superscript𝑥𝑡𝑥𝑥superscript𝑥𝑡g\left(x,x^{t}\right)=s\left(x^{t}\right)+\frac{\partial s\left(x^{t}\right)}{% \partial x}\left(x-x^{t}\right)italic_g ( italic_x , italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_s ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + divide start_ARG ∂ italic_s ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_x end_ARG ( italic_x - italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (14)

is a first order majorization function for xt𝒞.superscript𝑥𝑡𝒞x^{t}\in\mathcal{C}.italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ caligraphic_C .

4.2 Subproblem updates

Now that we have defined majorizing functions for each term of (10), we can apply the additivity property of Lemma 2 to obtain a general majorizing function for f(𝒙)𝑓𝒙f(\bm{x})italic_f ( bold_italic_x ):

g(𝒙,𝒙t)𝑔𝒙superscript𝒙𝑡\displaystyle g\left(\bm{x},\bm{x}^{t}\right)italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =i=1m(bijqijlog(aijxjqij)+𝒂i𝒙)absentsuperscriptsubscript𝑖1𝑚subscript𝑏𝑖subscript𝑗subscript𝑞𝑖𝑗subscript𝑎𝑖𝑗subscript𝑥𝑗subscript𝑞𝑖𝑗superscriptsubscript𝒂𝑖top𝒙\displaystyle={\color[rgb]{0.0,0.5,0.0}-\sum_{i=1}^{m}\left(b_{i}\sum_{j}q_{ij% }\log\left(\frac{a_{ij}x_{j}}{q_{ij}}\right)+\bm{a}_{i}^{\top}\bm{x}\right)}= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG ) + bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x )
+sL(𝒙t)+(𝒙𝒙t)sL(𝒙t)+2σL(maxjxjt)(jxjtlog(xjtxj)xjt+xj)subscript𝑠𝐿superscript𝒙𝑡superscript𝒙superscript𝒙𝑡topsubscript𝑠𝐿superscript𝒙𝑡2subscript𝜎𝐿subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑗superscriptsubscript𝑥𝑗𝑡superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗\displaystyle{\color[rgb]{0,0,1}+s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm% {x}^{t}\right)^{\top}\nabla s_{L}\left(\bm{x}^{t}\right)+2\sigma_{L}\left(\max% _{j}x_{j}^{t}\right)\left(\sum_{j}x_{j}^{t}\log\left(\frac{x_{j}^{t}}{x_{j}}% \right)-x_{j}^{t}+x_{j}\right)}+ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ( bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 2 italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (15)
+sR(𝒙t)+sR(𝒙t),𝒙𝒙t+σRj(xjxjtlog(xjxjt)1)subscript𝑠𝑅superscript𝒙𝑡subscript𝑠𝑅superscript𝒙𝑡𝒙superscript𝒙𝑡subscript𝜎𝑅subscript𝑗subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡1\displaystyle{\color[rgb]{1,.5,0}+s_{R}\left(\bm{x}^{t}\right)+\left\langle% \nabla s_{R}\left(\bm{x}^{t}\right),\bm{x}-\bm{x}^{t}\right\rangle+\sigma_{R}% \sum_{j}\left(\frac{x_{j}}{x_{j}^{t}}-\log\left(\frac{x_{j}}{x_{j}^{t}}\right)% -1\right)}+ italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ⟨ ∇ italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ + italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG - roman_log ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ) - 1 )
+j=1nsC(xjt)+sC(xjt)xj(xjxjt)superscriptsubscript𝑗1𝑛subscript𝑠𝐶superscriptsubscript𝑥𝑗𝑡subscript𝑠𝐶superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡\displaystyle{\color[rgb]{.75,0,.25}+\sum_{j=1}^{n}s_{C}\left(x_{j}^{t}\right)% +\frac{\partial s_{C}\left(x_{j}^{t}\right)}{\partial x_{j}}\left(x_{j}-x_{j}^% {t}\right)}+ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + divide start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )

where qij=aijxjtkaikxkt.subscript𝑞𝑖𝑗subscript𝑎𝑖𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑘subscript𝑎𝑖𝑘superscriptsubscript𝑥𝑘𝑡q_{ij}=\frac{a_{ij}x_{j}^{t}}{\sum_{k}a_{ik}x_{k}^{t}}.italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG . We use the colors green, blue, orange, and purple to denote and keep track of the dependencies of the different terms in (10). Finding the local optimum of the majorizing function will provide us with an update for Algorithm 1.

Proposition 1 (Generalized MU for (10)).

Assuming 𝐱t,𝐱,𝐛,𝐀>0superscript𝐱𝑡𝐱𝐛𝐀0\bm{x}^{t},\bm{x},\bm{b},\bm{A}>0bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_x , bold_italic_b , bold_italic_A > 0, the first-order majorizing function defined in (15) is strictly convex, and its global minimum 𝐱t+1superscript𝐱𝑡1\bm{x}^{t+1}bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT is given by

xjt+1=xjtαjtβjtsuperscriptsubscript𝑥𝑗𝑡1superscriptsubscript𝑥𝑗𝑡superscriptsubscript𝛼𝑗𝑡superscriptsubscript𝛽𝑗𝑡x_{j}^{t+1}=x_{j}^{t}\frac{\alpha_{j}^{t}}{\beta_{j}^{t}}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG (16)

where

αjtsuperscriptsubscript𝛼𝑗𝑡\displaystyle\alpha_{j}^{t}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =ibiaijkaikxkt+2(maxixit)σL+σRxjtandabsentsubscript𝑖subscript𝑏𝑖subscript𝑎𝑖𝑗subscript𝑘subscript𝑎𝑖𝑘superscriptsubscript𝑥𝑘𝑡2subscript𝑖superscriptsubscript𝑥𝑖𝑡subscript𝜎𝐿subscript𝜎𝑅superscriptsubscript𝑥𝑗𝑡and\displaystyle={\color[rgb]{0.0,0.5,0.0}\sum_{i}b_{i}\frac{a_{ij}}{\sum_{k}a_{% ik}x_{k}^{t}}}{\color[rgb]{0,0,1}+2\left(\max_{i}x_{i}^{t}\right)\sigma_{L}}{% \color[rgb]{1,.5,0}+\frac{\sigma_{R}}{x_{j}^{t}}}\hskip 10.00002pt\text{and}= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG + 2 ( roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG and (17)
βjtsuperscriptsubscript𝛽𝑗𝑡\displaystyle\beta_{j}^{t}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =iaij+xjsL(𝒙t)+2(maxixit)σL+xjsR(𝒙t)+σRxjt+sC(xjt)x.absentsubscript𝑖subscript𝑎𝑖𝑗subscriptsubscript𝑥𝑗subscript𝑠𝐿superscript𝒙𝑡2subscript𝑖superscriptsubscript𝑥𝑖𝑡subscript𝜎𝐿subscriptsubscript𝑥𝑗subscript𝑠𝑅superscript𝒙𝑡subscript𝜎𝑅superscriptsubscript𝑥𝑗𝑡subscript𝑠𝐶superscriptsubscript𝑥𝑗𝑡𝑥\displaystyle={\color[rgb]{0.0,0.5,0.0}\sum_{i}a_{ij}}{\color[rgb]{0,0,1}+% \nabla_{x_{j}}s_{L}\left(\bm{x}^{t}\right)+2\left(\max_{i}x_{i}^{t}\right)% \sigma_{L}}{\color[rgb]{1,.5,0}+\nabla_{x_{j}}s_{R}\left(\bm{x}^{t}\right)+% \frac{\sigma_{R}}{x_{j}^{t}}}{\color[rgb]{.75,0,.25}+\frac{\partial s_{C}\left% (x_{j}^{t}\right)}{\partial x}}.= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 2 ( roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG + divide start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_x end_ARG . (18)

The proof is provided in Appendix A.3.

Generalization of the traditional MU Rule

We note that (16) serves as a generalization of the original Multiplicative Update (MU) rule presented in [29]. Removing the regularization terms (blue, orange, and purple) results in precisely the MU rule as outlined in [29].

Connection with (Block) Mirror Descent [16, Algorithm 1]

Another interesting observation is that the majorization of the relatively smooth term is done similarly to a Bregman proximal method algorithm [14]. Since the objective function i=1m(bilog(𝒂i𝒙)+𝒂i𝒙)superscriptsubscript𝑖1𝑚subscript𝑏𝑖superscriptsubscript𝒂𝑖top𝒙superscriptsubscript𝒂𝑖top𝒙-\sum_{i=1}^{m}\left(b_{i}\log\left(\bm{a}_{i}^{\top}\bm{x}\right)+\bm{a}_{i}^% {\top}\bm{x}\right)- ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ) + bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ) is relatively smooth, one could drop all terms except for sRsubscript𝑠𝑅s_{R}italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and optimize using Block Bregman Proximal Gradient (BBPG) [50]. This would result in an algorithm very similar to Block Mirror Descent (BMD), which has recently been proposed for solving Poisson NMF [16]. Nevertheless, we advice against this this solution as discussed further in Section 5.2.

Alternative majorizing function and Quadratic Update (QU)

As shown experimentally in Section 6 and illustrated in Figure 2, having a majorization function as tight as possible leads to faster convergence of the algorithm. In (1), we deliberately choose to use a looser majorizing function for the term sLsubscript𝑠𝐿s_{L}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT in order to recover an algorithm with multiplicative update that generalizes the original approach from [29]. However, instead of using (12), one can also use (11) when constructing the majorizing function:

g(𝒙,𝒙t)𝑔𝒙superscript𝒙𝑡\displaystyle g\left(\bm{x},\bm{x}^{t}\right)italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =i=1m(bijqijlog(aijxjqij)+𝒂i𝒙)absentsuperscriptsubscript𝑖1𝑚subscript𝑏𝑖subscript𝑗subscript𝑞𝑖𝑗subscript𝑎𝑖𝑗subscript𝑥𝑗subscript𝑞𝑖𝑗superscriptsubscript𝒂𝑖top𝒙\displaystyle={\color[rgb]{0.0,0.5,0.0}-\sum_{i=1}^{m}\left(b_{i}\sum_{j}q_{ij% }\log\left(\frac{a_{ij}x_{j}}{q_{ij}}\right)+\bm{a}_{i}^{\top}\bm{x}\right)}= - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG ) + bold_italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x ) (19)
+sL(𝒙t)+(𝒙𝒙t)sL(𝒙t)+σL𝒙𝒙t22subscript𝑠𝐿superscript𝒙𝑡superscript𝒙superscript𝒙𝑡topsubscript𝑠𝐿superscript𝒙𝑡subscript𝜎𝐿superscriptsubscriptnorm𝒙superscript𝒙𝑡22\displaystyle{\color[rgb]{0,0,1}+s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm% {x}^{t}\right)^{\top}\nabla s_{L}\left(\bm{x}^{t}\right)+\sigma_{L}\left\|\bm{% x}-\bm{x}^{t}\right\|_{2}^{2}}+ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ( bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+sR(𝒙t)+sR(𝒙t),𝒙𝒙t+σRj(xjxjtlog(xjxjt)1)subscript𝑠𝑅superscript𝒙𝑡subscript𝑠𝑅superscript𝒙𝑡𝒙superscript𝒙𝑡subscript𝜎𝑅subscript𝑗subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡1\displaystyle{\color[rgb]{1,.5,0}+s_{R}\left(\bm{x}^{t}\right)+\left\langle% \nabla s_{R}\left(\bm{x}^{t}\right),\bm{x}-\bm{x}^{t}\right\rangle+\sigma_{R}% \sum_{j}\left(\frac{x_{j}}{x_{j}^{t}}-\log\left(\frac{x_{j}}{x_{j}^{t}}\right)% -1\right)}+ italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ⟨ ∇ italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ + italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG - roman_log ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ) - 1 )
+j=1nsC(xjt)+sC(xjt)xj(xjxjt)superscriptsubscript𝑗1𝑛subscript𝑠𝐶superscriptsubscript𝑥𝑗𝑡subscript𝑠𝐶superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡\displaystyle{\color[rgb]{.75,0,.25}+\sum_{j=1}^{n}s_{C}\left(x_{j}^{t}\right)% +\frac{\partial s_{C}\left(x_{j}^{t}\right)}{\partial x_{j}}\left(x_{j}-x_{j}^% {t}\right)}+ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + divide start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )

which is also a strictly convex function.

Proposition 2 (QU for (10)).

Assuming 𝐱t,𝐱,𝐛,𝐀>0superscript𝐱𝑡𝐱𝐛𝐀0\bm{x}^{t},\bm{x},\bm{b},\bm{A}>0bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_x , bold_italic_b , bold_italic_A > 0, the first-order majorizing function defined in Equation (19) is strictly convex, and its global minimum 𝐱t+1superscript𝐱𝑡1\bm{x}^{t+1}bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT is given by

xjt+1=βjt+(βjt)2+4αζjt2αsuperscriptsubscript𝑥𝑗𝑡1superscriptsubscript𝛽𝑗𝑡superscriptsuperscriptsubscript𝛽𝑗𝑡24𝛼superscriptsubscript𝜁𝑗𝑡2𝛼x_{j}^{t+1}=\frac{-\beta_{j}^{t}+\sqrt{\left(\beta_{j}^{t}\right)^{2}+4\alpha% \zeta_{j}^{t}}}{2\alpha}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = divide start_ARG - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 italic_α end_ARG (20)

where

α=2σLβjt=iaij+xjsL(𝒙t)2σLxjt+xjsR(𝒙t)+σRxjt+sC(xjt)xζjt=ibiaijxjtkaikxkt+σR.formulae-sequence𝛼2subscript𝜎𝐿formulae-sequencesuperscriptsubscript𝛽𝑗𝑡subscript𝑖subscript𝑎𝑖𝑗subscriptsubscript𝑥𝑗subscript𝑠𝐿superscript𝒙𝑡2subscript𝜎𝐿superscriptsubscript𝑥𝑗𝑡subscriptsubscript𝑥𝑗subscript𝑠𝑅superscript𝒙𝑡subscript𝜎𝑅superscriptsubscript𝑥𝑗𝑡subscript𝑠𝐶superscriptsubscript𝑥𝑗𝑡𝑥superscriptsubscript𝜁𝑗𝑡subscript𝑖subscript𝑏𝑖subscript𝑎𝑖𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑘subscript𝑎𝑖𝑘superscriptsubscript𝑥𝑘𝑡subscript𝜎𝑅\alpha={\color[rgb]{0,0,1}2\sigma_{L}}\hskip 10.00002pt\beta_{j}^{t}={\color[% rgb]{0.0,0.5,0.0}\sum_{i}a_{ij}}{\color[rgb]{0,0,1}+\nabla_{x_{j}}s_{L}\left(% \bm{x}^{t}\right)-2\sigma_{L}x_{j}^{t}}{\color[rgb]{1,.5,0}+\nabla_{x_{j}}s_{R% }\left(\bm{x}^{t}\right)+\frac{\sigma_{R}}{x_{j}^{t}}}{\color[rgb]{.75,0,.25}+% \frac{\partial s_{C}\left(x_{j}^{t}\right)}{\partial x}}\hskip 10.00002pt\zeta% _{j}^{t}={\color[rgb]{0.0,0.5,0.0}\sum_{i}b_{i}\frac{a_{ij}x_{j}^{t}}{\sum_{k}% a_{ik}x_{k}^{t}}}{\color[rgb]{1,.5,0}+\sigma_{R}}.italic_α = 2 italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - 2 italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG + divide start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_x end_ARG italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG + italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT . (21)

The proof is provided in Appendix A.3. Both of these propositions lead to the update rule for our MU and QU algorithms detailed in Section 5. We note also that, with the appropriate assumptions, the update rule 16 and 20 will preserve positivity of the variable 𝒙𝒙\bm{x}bold_italic_x. However, since our desire is also to handle extra constraint, we develop in the next section rigorous approach.

4.3 Generalized simplex constraint

We need to handle two constraints: 1. the linear constraint 𝒙ϵ𝒙italic-ϵ\bm{x}\geq\epsilonbold_italic_x ≥ italic_ϵ, and, 2. the scale constraint 𝒆𝒙=1superscript𝒆top𝒙1\bm{e}^{\top}\bm{x}=1bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x = 1, where 𝒆0𝒆0\bm{e}\geq 0bold_italic_e ≥ 0. While the first one is used to keep the variable non-negative, typically with a strictly positive small ϵitalic-ϵ\epsilonitalic_ϵ, the second one can set the scale of one of the variables (𝑾𝑾\bm{W}bold_italic_W or 𝑯𝑯\bm{H}bold_italic_H) in the factorization problem. Furthermore, in the case 𝒆=𝟏𝒆1\bm{e}=\bm{1}bold_italic_e = bold_1, the simplex constraint is recovered. It turns out that the update rules of (16) and (20) can simply be updated to handle this constraint. The actual optimization problem we want to solve becomes:

minimize𝒙ϵf(𝒙)such that𝒆𝒙=1.subscriptminimize𝒙italic-ϵ𝑓𝒙such thatsuperscript𝒆top𝒙1\operatorname*{minimize~{}~{}}_{{\color[rgb]{.5,0,.5}\bm{x}\geq\epsilon}}f% \left(\bm{x}\right)\hskip 10.00002pt\text{such that}\hskip 10.00002pt{\color[% rgb]{0,1,1}\bm{e}^{\top}\bm{x}=1}.start_OPERATOR roman_minimize end_OPERATOR start_POSTSUBSCRIPT bold_italic_x ≥ italic_ϵ end_POSTSUBSCRIPT italic_f ( bold_italic_x ) such that bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x = 1 .

where f𝑓fitalic_f is given in (10).

To solve this problem, we used the KKT approach, i.e, we find points that satisfy the KKT (Karush-Kuhn-Tucker) conditions:

1. Stationarity 𝒙L(𝒙˙,ν,𝝁)=𝟎,subscript𝒙𝐿˙𝒙𝜈𝝁0\displaystyle\nabla_{\bm{x}}L\left(\dot{\bm{x}},\nu,\bm{\mu}\right)=\bm{0},∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_L ( over˙ start_ARG bold_italic_x end_ARG , italic_ν , bold_italic_μ ) = bold_0 ,
2. Primal feasibility {𝒆𝒙˙1=0,𝒙˙ϵ,casessuperscript𝒆top˙𝒙10otherwise˙𝒙italic-ϵotherwise\displaystyle\begin{cases}{\color[rgb]{0,1,1}\bm{e}^{\top}\dot{\bm{x}}-1=0},\\ {\color[rgb]{.5,0,.5}\dot{\bm{x}}\geq\epsilon}\end{cases},{ start_ROW start_CELL bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over˙ start_ARG bold_italic_x end_ARG - 1 = 0 , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over˙ start_ARG bold_italic_x end_ARG ≥ italic_ϵ end_CELL start_CELL end_CELL end_ROW ,
3. Dual feasibility 𝝁𝟎,𝝁0\displaystyle{\color[rgb]{.5,0,.5}\bm{\mu}\geq\bm{0}},bold_italic_μ ≥ bold_0 ,
4. Complementary slackness 𝝁(𝒙+ϵ𝟏)=0,superscript𝝁top𝒙italic-ϵ10\displaystyle{\color[rgb]{.5,0,.5}\bm{\mu}^{\top}\left(-\bm{x}+\epsilon\bm{1}% \right)=0},bold_italic_μ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( - bold_italic_x + italic_ϵ bold_1 ) = 0 ,

where the Lagrangian is defined as:

L(𝒙,ν,𝝁)=f(𝒙)+ν(𝒆𝒙1)+𝝁(𝒙+ϵ𝟏).𝐿𝒙𝜈𝝁𝑓𝒙𝜈superscript𝒆top𝒙1superscript𝝁top𝒙italic-ϵ1L\left(\bm{x},\nu,\bm{\mu}\right)=f\left(\bm{x}\right){\color[rgb]{0,1,1}+\nu% \left(\bm{e}^{\top}\bm{x}-1\right)}{\color[rgb]{.5,0,.5}+\bm{\mu}^{\top}\left(% -\bm{x}+\epsilon\bm{1}\right)}.italic_L ( bold_italic_x , italic_ν , bold_italic_μ ) = italic_f ( bold_italic_x ) + italic_ν ( bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x - 1 ) + bold_italic_μ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( - bold_italic_x + italic_ϵ bold_1 ) .

We follow the same method as developed in Section 4.2, except that we majorize the Lagrangian L(𝒙,ν,𝝁)𝐿𝒙𝜈𝝁L\left(\bm{x},\nu,\bm{\mu}\right)italic_L ( bold_italic_x , italic_ν , bold_italic_μ ). The resulting first-order majorizing function is given by:

g(𝒙,𝒙t,ν,𝝁)=g(𝒙,𝒙t)+ν(𝒆𝒙1)+𝝁(𝒙+ϵ𝟏)superscript𝑔𝒙superscript𝒙𝑡𝜈𝝁𝑔𝒙superscript𝒙𝑡𝜈superscript𝒆top𝒙1superscript𝝁top𝒙italic-ϵ1g^{\prime}\left(\bm{x},\bm{x}^{t},\nu,\bm{\mu}\right)=g\left(\bm{x},\bm{x}^{t}% \right){\color[rgb]{0,1,1}+\nu\left(\bm{e}^{\top}\bm{x}-1\right)}{\color[rgb]{% .5,0,.5}+\bm{\mu}^{\top}\left(-\bm{x}+\epsilon\bm{1}\right)}italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_ν , bold_italic_μ ) = italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_ν ( bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x - 1 ) + bold_italic_μ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( - bold_italic_x + italic_ϵ bold_1 )

where g(𝒙,𝒙t)𝑔𝒙superscript𝒙𝑡g\left(\bm{x},\bm{x}^{t}\right)italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is given in (15) or (19). We repeat the development of Section 4.2 (and the proofs of Appendix A.3). We end up with and update that is very similar to (16) or (20). In the MU case, we end up with:

xjt+1=xjtαjβj+νejμj,superscriptsubscript𝑥𝑗𝑡1superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗subscript𝛽𝑗𝜈subscript𝑒𝑗subscript𝜇𝑗x_{j}^{t+1}=x_{j}^{t}\frac{\alpha_{j}}{\beta_{j}{\color[rgb]{0,1,1}+\nu e_{j}}% {\color[rgb]{.5,0,.5}-\mu_{j}}},italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ,

where the only final difference consists of two terms in cyan and violet (αjsubscript𝛼𝑗\alpha_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and βjsubscript𝛽𝑗\beta_{j}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT remain identical). For the QU, we stick to the same update rule (20), where only βjtsuperscriptsubscript𝛽𝑗𝑡\beta_{j}^{t}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is modified:

βjt=βjt+νejμj.superscriptsubscript𝛽𝑗𝑡superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗subscript𝜇𝑗\beta_{j}^{t\prime}=\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j}}{\color[rgb]{% .5,0,.5}-\mu_{j}}.italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ′ end_POSTSUPERSCRIPT = italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT .

This update rule ensures the first of the KKT conditions (stationarity). We now find ν,𝝁𝜈𝝁\nu,\bm{\mu}italic_ν , bold_italic_μ such that the second KKT condition holds (primal feasibility). It turns out that 𝝁𝝁\bm{\mu}bold_italic_μ does not need to be computed explicitly. In the MU case, μjsubscript𝜇𝑗\mu_{j}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is selected to be large enough such that

xjt+1superscriptsubscript𝑥𝑗𝑡1\displaystyle x_{j}^{t+1}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT =max(xjtαjβj+νej,ϵ)=xjtαjmin(xjtαjϵ,βj+νej).absentsuperscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗subscript𝛽𝑗𝜈subscript𝑒𝑗italic-ϵsuperscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗italic-ϵsubscript𝛽𝑗𝜈subscript𝑒𝑗\displaystyle=\max\left(x_{j}^{t}\frac{\alpha_{j}}{\beta_{j}{\color[rgb]{0,1,1% }+\nu e_{j}}},\epsilon\right)=\frac{x_{j}^{t}\alpha_{j}}{\min\left({\color[rgb% ]{.5,0,.5}\frac{x_{j}^{t}\alpha_{j}}{\epsilon}},\beta_{j}{\color[rgb]{0,1,1}+% \nu e_{j}}\right)}.= roman_max ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG , italic_ϵ ) = divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_min ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ end_ARG , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG . (22)

In the QU case, we obtain

xjt+1superscriptsubscript𝑥𝑗𝑡1\displaystyle x_{j}^{t+1}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT =max((βjt+νej)+(βjt+νej)2+4αζjt2α,ϵ)absentsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗superscriptsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗24𝛼superscriptsubscript𝜁𝑗𝑡2𝛼italic-ϵ\displaystyle=\max\left(\frac{-\left(\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j% }}\right)+\sqrt{\left(\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j}}\right)^{2}+4% \alpha\zeta_{j}^{t}}}{2\alpha},\epsilon\right)= roman_max ( divide start_ARG - ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 italic_α end_ARG , italic_ϵ ) (23)
=min(ζjtϵϵα,βjt+νej)+(min(ζjtϵϵα,βjt+νej))2+4αζjt2αabsentsuperscriptsubscript𝜁𝑗𝑡italic-ϵitalic-ϵ𝛼superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗superscriptsuperscriptsubscript𝜁𝑗𝑡italic-ϵitalic-ϵ𝛼superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗24𝛼superscriptsubscript𝜁𝑗𝑡2𝛼\displaystyle=\frac{-\min\left({\color[rgb]{.5,0,.5}\frac{\zeta_{j}^{t}}{% \epsilon}-\epsilon\alpha},\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j}}\right)+% \sqrt{\left(\min\left({\color[rgb]{.5,0,.5}\frac{\zeta_{j}^{t}}{\epsilon}-% \epsilon\alpha},\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j}}\right)\right)^{2}+% 4\alpha\zeta_{j}^{t}}}{2\alpha}= divide start_ARG - roman_min ( divide start_ARG italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG - italic_ϵ italic_α , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + square-root start_ARG ( roman_min ( divide start_ARG italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG - italic_ϵ italic_α , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 italic_α end_ARG

Note that dual feasibility and complementary slackness could be verified, but we leave them out for simplicity. We then need to find the value of ν𝜈\nuitalic_ν such that 𝒆𝒙=1superscript𝒆top𝒙1\bm{e}^{\top}\bm{x}=1bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x = 1, which is equivalent to searching for

h1(ν)subscript1𝜈\displaystyle h_{1}\left(\nu\right)italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ν ) =jejxjtαjmin(xjtαjϵ,βjt+νej)1=0absentsubscript𝑗subscript𝑒𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗italic-ϵsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗10\displaystyle=\sum_{j}e_{j}\frac{x_{j}^{t}\alpha_{j}}{\min\left({\color[rgb]{% .5,0,.5}\frac{x_{j}^{t}\alpha_{j}}{\epsilon}},\beta_{j}^{t}{\color[rgb]{0,1,1}% +\nu e_{j}}\right)}-1=0= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_min ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ end_ARG , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG - 1 = 0 (24)

Similarly, for the quadratic update of (20), we search for ν𝜈\nuitalic_ν that satisfies

h2(ν)=jejmin(ζjtϵϵα,βjt+νej)+(min(ζjtϵϵα,βjt+νej))2+4αζjt2α1=0.subscript2𝜈subscript𝑗subscript𝑒𝑗superscriptsubscript𝜁𝑗𝑡italic-ϵitalic-ϵ𝛼superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗superscriptsuperscriptsubscript𝜁𝑗𝑡italic-ϵitalic-ϵ𝛼superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗24𝛼superscriptsubscript𝜁𝑗𝑡2𝛼10h_{2}\left(\nu\right)=\sum_{j}e_{j}\frac{-\min\left({\color[rgb]{.5,0,.5}\frac% {\zeta_{j}^{t}}{\epsilon}-\epsilon\alpha},\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e% _{j}}\right)+\sqrt{\left(\min\left({\color[rgb]{.5,0,.5}\frac{\zeta_{j}^{t}}{% \epsilon}-\epsilon\alpha},\beta_{j}^{t}{\color[rgb]{0,1,1}+\nu e_{j}}\right)% \right)^{2}+4\alpha\zeta_{j}^{t}}}{2\alpha}-1=0.italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG - roman_min ( divide start_ARG italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG - italic_ϵ italic_α , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + square-root start_ARG ( roman_min ( divide start_ARG italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_ϵ end_ARG - italic_ϵ italic_α , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 italic_α end_ARG - 1 = 0 . (25)

There is no closed-form solution for ν𝜈\nuitalic_ν; however, the value can be found using a simple dichotomy search. Bounds for starting the dichotomy are computed in Appendix B.

Case ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0

Most of our reasoning relies on the fact that xi>0subscript𝑥𝑖0x_{i}>0italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 and, therefore, on the domain constraint ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0. We have found that setting ϵitalic-ϵ\epsilonitalic_ϵ to a small non-zero value works well in practice. However, our approach can likely be generalized to the case where ϵ=0italic-ϵ0\epsilon=0italic_ϵ = 0, following the approach of [33, Section 4], which studies the unregularized Poisson NMF case.

5 Algorithms for Poisson matrix factorisation

Equipped with the update rules developed in the previous Sections 4 and 4.3, we are ready to tackle the general problem of this contribution333Here we show the problem with the linear constraint on 𝑯𝑯\bm{H}bold_italic_H, however, by symmetry, a similar algorithm can be developed with the constraint on 𝑾𝑾\bm{W}bold_italic_W. which consists of minimizing (2):

𝑾˙,𝑯˙=argmin𝑾,𝑯˙𝑾˙𝑯subscriptargmin𝑾𝑯\displaystyle\dot{\bm{W}},\dot{\bm{H}}=\operatorname*{arg\,min}_{\bm{W},\bm{H}}over˙ start_ARG bold_italic_W end_ARG , over˙ start_ARG bold_italic_H end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_W , bold_italic_H end_POSTSUBSCRIPT 𝒀,log(𝑾𝑯)+𝟏,𝑾𝑯+RW(𝑾)+RH(𝑯)𝒀𝑾𝑯1𝑾𝑯subscript𝑅𝑊𝑾subscript𝑅𝐻𝑯\displaystyle-\left\langle\bm{Y},\log\left(\bm{W}\bm{H}\right)\right\rangle+% \left\langle\mathbf{1},\bm{W}\bm{H}\right\rangle+R_{W}\left(\bm{W}\right)+R_{H% }\left(\bm{H}\right)- ⟨ bold_italic_Y , roman_log ( bold_italic_W bold_italic_H ) ⟩ + ⟨ bold_1 , bold_italic_W bold_italic_H ⟩ + italic_R start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_italic_W ) + italic_R start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( bold_italic_H ) (26)
such that 𝑾ϵ,𝑯ϵ,𝒆𝑯=𝟏formulae-sequence𝑾italic-ϵformulae-sequence𝑯italic-ϵsuperscript𝒆top𝑯1\displaystyle\bm{W}\geq\epsilon,\bm{H}\geq\epsilon,\bm{e}^{\top}\bm{H}=\bm{1}bold_italic_W ≥ italic_ϵ , bold_italic_H ≥ italic_ϵ , bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_H = bold_1

We first observe that with respect to each variable 𝑾,𝑯𝑾𝑯\bm{W},\bm{H}bold_italic_W , bold_italic_H, the problem is separable by column/row. For example, given 𝑾𝑾\bm{W}bold_italic_W, finding the optimal 𝑯𝑯\bm{H}bold_italic_H can be done for each column:

𝒉i˙=argmin𝒉i𝒚i,log(𝑾𝒉i)+𝟏,𝑾𝒉i+rH(𝒉i)such that𝒉iϵ,𝒆𝒉i=1formulae-sequence˙subscript𝒉𝑖subscriptargminsubscript𝒉𝑖subscript𝒚𝑖𝑾subscript𝒉𝑖1𝑾subscript𝒉𝑖subscript𝑟𝐻subscript𝒉𝑖such thatformulae-sequencesubscript𝒉𝑖italic-ϵsuperscript𝒆topsubscript𝒉𝑖1\dot{\bm{h}_{i}}=\operatorname*{arg\,min}_{\bm{h}_{i}}-\left\langle\bm{y}_{i},% \log\left(\bm{W}\bm{h}_{i}\right)\right\rangle+\left\langle\mathbf{1},\bm{W}% \bm{h}_{i}\right\rangle+r_{H}\left(\bm{h}_{i}\right)\hskip 10.00002pt\text{% such that}\hskip 10.00002pt\bm{h}_{i}\geq\epsilon,\bm{e}^{\top}\bm{h}_{i}=1over˙ start_ARG bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - ⟨ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , roman_log ( bold_italic_W bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⟩ + ⟨ bold_1 , bold_italic_W bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ + italic_r start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) such that bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_ϵ , bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1

We therefore apply the TBSUM Algorithm 1, where all lines of 𝑾𝑾\bm{W}bold_italic_W and all columns of 𝑯𝑯\bm{H}bold_italic_H are updated independently, and obtain the two Algorithms 2 and 3. We note here that the function max(,ϵ)𝑚𝑎𝑥italic-ϵmax(\cdot,\epsilon)italic_m italic_a italic_x ( ⋅ , italic_ϵ ) ensures that the solutions are ϵabsentitalic-ϵ\geq\epsilon≥ italic_ϵ (non-negativity).

Algorithm 2 MU Algorithm for Regularized Poisson NMF
1:Initialize the variables 𝑾0ϵsuperscript𝑾0italic-ϵ\bm{W}^{0}\geq\epsilonbold_italic_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ≥ italic_ϵ, 𝑯ϵ𝑯italic-ϵ\bm{H}\geq\epsilonbold_italic_H ≥ italic_ϵ, t=0𝑡0t=0italic_t = 0 such that 𝒆𝑯=𝟏superscript𝒆top𝑯1\bm{e}^{\top}\bm{H}=\bm{1}bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_H = bold_1.
2:while some convergence criterion is met do
3:     for each line 𝒘itsuperscriptsubscript𝒘𝑖limit-from𝑡top\bm{w}_{i}^{t\top}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ⊤ end_POSTSUPERSCRIPT of 𝑾tsuperscript𝑾𝑡\bm{W}^{t}bold_italic_W start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT do
4:         Compute 𝜶itsuperscriptsubscript𝜶𝑖limit-from𝑡top\bm{\alpha}_{i}^{t\top}bold_italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ⊤ end_POSTSUPERSCRIPT and 𝜷itsuperscriptsubscript𝜷𝑖limit-from𝑡top\bm{\beta}_{i}^{t\top}bold_italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ⊤ end_POSTSUPERSCRIPT using (17) and (18).
5:         Update using the MU rule (22): wijt+1max(wijtαijtβijt,ϵ)superscriptsubscript𝑤𝑖𝑗𝑡1superscriptsubscript𝑤𝑖𝑗𝑡superscriptsubscript𝛼𝑖𝑗𝑡superscriptsubscript𝛽𝑖𝑗𝑡italic-ϵw_{ij}^{t+1}\leftarrow\max\left(w_{ij}^{t}\frac{\alpha_{ij}^{t}}{\beta_{ij}^{t% }},\epsilon\right)italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ← roman_max ( italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG , italic_ϵ ).
6:     end for
7:     for each column 𝒉itsuperscriptsubscript𝒉𝑖𝑡\bm{h}_{i}^{t}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT of 𝑯tsuperscript𝑯𝑡\bm{H}^{t}bold_italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT do
8:         Compute 𝜶itsuperscriptsubscript𝜶𝑖𝑡\bm{\alpha}_{i}^{t}bold_italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and 𝜷itsuperscriptsubscript𝜷𝑖𝑡\bm{\beta}_{i}^{t}bold_italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using (17) and (18).
9:         Find the dual variable νisubscript𝜈𝑖\nu_{i}italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by dichotomy of the function (24) (set ν=0𝜈0\nu=0italic_ν = 0 if no constraint is present).
10:         Update using the MU (22): hijt+1max(hijtαijtβijt+νiej,ϵ)superscriptsubscript𝑖𝑗𝑡1superscriptsubscript𝑖𝑗𝑡superscriptsubscript𝛼𝑖𝑗𝑡superscriptsubscript𝛽𝑖𝑗𝑡subscript𝜈𝑖subscript𝑒𝑗italic-ϵh_{ij}^{t+1}\leftarrow\max\left(h_{ij}^{t}\frac{\alpha_{ij}^{t}}{\beta_{ij}^{t% }+\nu_{i}e_{j}},\epsilon\right)italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ← roman_max ( italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT divide start_ARG italic_α start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG , italic_ϵ ).
11:     end for
12:     tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1.
13:end while
Algorithm 3 QU Algorithm for Regularized Poisson NMF
1:Initialize the variables 𝑾0ϵsuperscript𝑾0italic-ϵ\bm{W}^{0}\geq\epsilonbold_italic_W start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ≥ italic_ϵ, 𝑯ϵ𝑯italic-ϵ\bm{H}\geq\epsilonbold_italic_H ≥ italic_ϵ, t=0𝑡0t=0italic_t = 0 such that 𝒆𝑯=𝟏superscript𝒆top𝑯1\bm{e}^{\top}\bm{H}=\bm{1}bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_H = bold_1.
2:while some convergence criterion is met do
3:     for each line 𝒘itsuperscriptsubscript𝒘𝑖limit-from𝑡top\bm{w}_{i}^{t\top}bold_italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ⊤ end_POSTSUPERSCRIPT of 𝑾tsuperscript𝑾𝑡\bm{W}^{t}bold_italic_W start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT do
4:         Compute αtsuperscript𝛼𝑡\alpha^{t}italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, 𝜷itsuperscriptsubscript𝜷𝑖limit-from𝑡top\bm{\beta}_{i}^{t\top}bold_italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ⊤ end_POSTSUPERSCRIPT, and 𝜸itsuperscriptsubscript𝜸𝑖limit-from𝑡top\bm{\gamma}_{i}^{t\top}bold_italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t ⊤ end_POSTSUPERSCRIPT using (21).
5:         Update using the QU rule (23): wijt+1max(βijt+(βijt)2+4αtζijt2αt,ϵ)superscriptsubscript𝑤𝑖𝑗𝑡1superscriptsubscript𝛽𝑖𝑗𝑡superscriptsuperscriptsubscript𝛽𝑖𝑗𝑡24superscript𝛼𝑡superscriptsubscript𝜁𝑖𝑗𝑡2superscript𝛼𝑡italic-ϵw_{ij}^{t+1}\leftarrow\max\left(\frac{-\beta_{ij}^{t}+\sqrt{\left(\beta_{ij}^{% t}\right)^{2}+4\alpha^{t}\zeta_{ij}^{t}}}{2\alpha^{t}},\epsilon\right)italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ← roman_max ( divide start_ARG - italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ζ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG , italic_ϵ ).
6:     end for
7:     for each column 𝒉itsuperscriptsubscript𝒉𝑖𝑡\bm{h}_{i}^{t}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT of 𝑯tsuperscript𝑯𝑡\bm{H}^{t}bold_italic_H start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT do
8:         Compute αtsuperscript𝛼𝑡\alpha^{t}italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, 𝜷itsuperscriptsubscript𝜷𝑖𝑡\bm{\beta}_{i}^{t}bold_italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and 𝜸itsuperscriptsubscript𝜸𝑖𝑡\bm{\gamma}_{i}^{t}bold_italic_γ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT using (21).
9:         Find the dual variable νisubscript𝜈𝑖\nu_{i}italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by dichotomy of the function (25) (set ν=0𝜈0\nu=0italic_ν = 0 if no constraint is present).
10:         Update using the QU rule (23): hijt+1max((βijt+νiej)+(βijt+νiej)2+4αtζijt2αt,ϵ)superscriptsubscript𝑖𝑗𝑡1superscriptsubscript𝛽𝑖𝑗𝑡subscript𝜈𝑖subscript𝑒𝑗superscriptsuperscriptsubscript𝛽𝑖𝑗𝑡subscript𝜈𝑖subscript𝑒𝑗24superscript𝛼𝑡superscriptsubscript𝜁𝑖𝑗𝑡2superscript𝛼𝑡italic-ϵh_{ij}^{t+1}\leftarrow\max\left(\frac{-\left(\beta_{ij}^{t}+\nu_{i}e_{j}\right% )+\sqrt{\left(\beta_{ij}^{t}+\nu_{i}e_{j}\right)^{2}+4\alpha^{t}\zeta_{ij}^{t}% }}{2\alpha^{t}},\epsilon\right)italic_h start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ← roman_max ( divide start_ARG - ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_ζ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG , italic_ϵ ).
11:     end for
12:     tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1.
13:end while

Convergence

The two update rules in steps 5 and 10 correspond to minimizing first-order strictly convex majorization functions. As a result, we can apply Theorem 1 to guarantee convergence towards a coordinate-wise minimum. It is important to note that this coordinate-wise minimum is also a stationary point, given that the objective function remains regular for any point 𝑾𝒞W,𝑯𝒞Hformulae-sequence𝑾subscript𝒞𝑊𝑯subscript𝒞𝐻\bm{W}\in\mathcal{C}_{W},\bm{H}\in\mathcal{C}_{H}bold_italic_W ∈ caligraphic_C start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT , bold_italic_H ∈ caligraphic_C start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT.

5.1 Algorithm complexity

Let’s examine the complexity of both Algorithms 2 and 3 when considering 𝑾n×k𝑾superscript𝑛𝑘\bm{W}\in\mathbb{R}^{n\times k}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_k end_POSTSUPERSCRIPT and 𝑯k×m𝑯superscript𝑘𝑚\bm{H}\in\mathbb{R}^{k\times m}bold_italic_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_m end_POSTSUPERSCRIPT. In each iteration, the following complexities are observed:
(a) Step 4 has a complexity of 𝒪(nmk)𝒪𝑛𝑚𝑘\mathcal{O}\left(nmk\right)caligraphic_O ( italic_n italic_m italic_k ).
(b) Step 5 has a complexity of 𝒪(nk)𝒪𝑛𝑘\mathcal{O}\left(nk\right)caligraphic_O ( italic_n italic_k ).
(c) Step 8 has a complexity of 𝒪(nmk)𝒪𝑛𝑚𝑘\mathcal{O}\left(nmk\right)caligraphic_O ( italic_n italic_m italic_k ).
(d) Step 9 has a complexity of 𝒪(cdkm)𝒪subscript𝑐𝑑𝑘𝑚\mathcal{O}\left(c_{d}km\right)caligraphic_O ( italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_k italic_m ), where cdsubscript𝑐𝑑c_{d}italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT denotes the number of iterations performed by the dichotomy.
(e) Step 10 has a complexity of 𝒪(mk)𝒪𝑚𝑘\mathcal{O}\left(mk\right)caligraphic_O ( italic_m italic_k ).
Thus, the overall complexity per iteration can be expressed as 𝒪(nmk)+𝒪(cdkm)=𝒪((n+cd)mk)𝒪𝑛𝑚𝑘𝒪subscript𝑐𝑑𝑘𝑚𝒪𝑛subscript𝑐𝑑𝑚𝑘\mathcal{O}\left(nmk\right)+\mathcal{O}\left(c_{d}km\right)=\mathcal{O}\left(% \left(n+c_{d}\right)mk\right)caligraphic_O ( italic_n italic_m italic_k ) + caligraphic_O ( italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_k italic_m ) = caligraphic_O ( ( italic_n + italic_c start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) italic_m italic_k ). This indicates that the computational complexity per iteration is linear with respect to the problem size, i.e., nm𝑛𝑚nmitalic_n italic_m, multiplied by the number of components, i.e., k𝑘kitalic_k.

Impact of the dichotomy

When n𝑛nitalic_n is small, the computational cost of the dichotomy in step 4 becomes dominant. Nevertheless, in general, for larger values of n𝑛nitalic_n, the impact of the dichotomy becomes negligible.

5.2 Tight Majorizing Functions

While we do not make any theoretical contributions concerning the speed of convergence of the algorithm, we want to emphasize the natural fact that tighter majorizing functions lead to faster convergence. Therefore, when evaluating an algorithm, we believe that the analysis of the underlying majorizing function is as insightful as the experimental evaluation. As an example, we could have used Block Mirror Descent to solve (3), as was done in [16, Algorithm 1]. This algorithm uses a Bregman Difference to create a majorization function for the subproblem. However, this would result in a much looser majorization function, which partly explains the slow convergence of this algorithm observed in [16]. This difference between majorization functions is exemplified in Figure 2.

Refer to captionRefer to caption
Figure 2: Two first-order majorizing functions for the function f(x0,x1)=log(0.2x0+0.8x1)𝑓subscript𝑥0subscript𝑥10.2subscript𝑥00.8subscript𝑥1f(x_{0},x_{1})=\log(0.2x_{0}+0.8x_{1})italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = roman_log ( 0.2 italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 0.8 italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) are considered. We observe that Lemma (3) provides a tighter majorization than Lemma (5) for f𝑓fitalic_f, resulting in a more optimal update 𝒙t+1superscript𝒙𝑡1\bm{x}^{t+1}bold_italic_x start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT.

Linesearch

By tightening the bounds we used to construct the surrogate function, we can develop a more efficient algorithm. Here, we apply a classic "linesearch" method to the functions sLsubscript𝑠𝐿s_{L}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. However, the same technique can be trivially applied to sRsubscript𝑠𝑅s_{R}italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT as well. First, in (12) or (11), replace the constant σLsubscript𝜎𝐿\sigma_{L}italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT with a parameter γ𝛾\gammaitalic_γ and initialize it with σLsubscript𝜎𝐿\sigma_{L}italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. Second, at each iteration, update the parameter γ𝛾\gammaitalic_γ according to the following rule:

γt+1={υγtifsL(𝒙)g(𝒙,𝒙t,γ)1τγtotherwise.superscript𝛾𝑡1cases𝜐superscript𝛾𝑡ifsubscript𝑠𝐿𝒙𝑔𝒙superscript𝒙𝑡𝛾1𝜏superscript𝛾𝑡otherwise.\gamma^{t+1}=\begin{cases}\upsilon\gamma^{t}&\text{if}\hskip 10.00002pts_{L}% \left(\bm{x}\right)\geq g\left(\bm{x},\bm{x}^{t},\gamma\right)\\ \frac{1}{\tau}\gamma^{t}&\text{otherwise.}\end{cases}italic_γ start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = { start_ROW start_CELL italic_υ italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_CELL start_CELL if italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ) ≥ italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_γ ) end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_CELL start_CELL otherwise. end_CELL end_ROW

Here, υ𝜐\upsilonitalic_υ and τ𝜏\tauitalic_τ are two update rates that determine how fast γ𝛾\gammaitalic_γ is updated. Choosing values that are too small for these parameters leads to an inefficient linesearch, while selecting values that are too large can result in strong oscillation patterns. Typical values for υ𝜐\upsilonitalic_υ and τ𝜏\tauitalic_τ range from 1.05 to 1.5. However, it is important to note that when using linesearch, we are not guaranteed to converge, as we might invalidate the assumptions of Theorem 1.

6 Numerical Simulation

Problem

In this section, we analyze the speed of convergence of Algorithms 2 and 3 through numerical simulations. As a regularizer for 𝑯𝑯\bm{H}bold_italic_H, we consider the Laplacian regularization R(𝑯,λ)=λ2tr(𝑯TΔ𝑯)𝑅𝑯𝜆𝜆2trsuperscript𝑯𝑇Δ𝑯R(\bm{H},\lambda)=\frac{\lambda}{2}\text{tr}(\bm{H}^{T}\Delta\bm{H})italic_R ( bold_italic_H , italic_λ ) = divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG tr ( bold_italic_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Δ bold_italic_H ), where ΔΔ\Deltaroman_Δ represents the two-dimensional Laplacian for the k𝑘kitalic_kth line of 𝑯k×p2𝑯superscript𝑘superscript𝑝2\bm{H}\in\mathbb{R}^{k\times p^{2}}bold_italic_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT reshaped as k𝑘kitalic_k images of size p×p𝑝𝑝p\times pitalic_p × italic_p. Since a straightforward approach to minimize R(𝑯,λ)𝑅𝑯𝜆R(\bm{H},\lambda)italic_R ( bold_italic_H , italic_λ ) is to reduce the amplitude of 𝑯𝑯\bm{H}bold_italic_H, we add the simplex constraint 𝟏T𝑯=𝟏superscript1𝑇𝑯1\bm{1}^{T}\bm{H}=\bm{1}bold_1 start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_H = bold_1. This leads to the following optimization problem:

𝑾˙,𝑯˙=˙𝑾˙𝑯absent\displaystyle\dot{\bm{W}},\dot{\bm{H}}=over˙ start_ARG bold_italic_W end_ARG , over˙ start_ARG bold_italic_H end_ARG = argmin𝑾,𝑯𝒀,log(𝑾𝑯)+𝟏,𝑾𝑯+λ2tr(𝑯TΔ𝑯)subscript𝑾𝑯𝒀𝑾𝑯1𝑾𝑯𝜆2trsuperscript𝑯𝑇Δ𝑯\displaystyle\arg\min_{\bm{W},\bm{H}}-\left\langle\bm{Y},\log(\bm{W}\bm{H})% \right\rangle+\left\langle\mathbf{1},\bm{W}\bm{H}\right\rangle+\frac{\lambda}{% 2}\text{tr}(\bm{H}^{T}\Delta\bm{H})\ roman_arg roman_min start_POSTSUBSCRIPT bold_italic_W , bold_italic_H end_POSTSUBSCRIPT - ⟨ bold_italic_Y , roman_log ( bold_italic_W bold_italic_H ) ⟩ + ⟨ bold_1 , bold_italic_W bold_italic_H ⟩ + divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG tr ( bold_italic_H start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Δ bold_italic_H ) subject to 𝑾ϵ,𝑯ϵ,𝟏T𝑯=𝟏formulae-sequencesubject to 𝑾italic-ϵformulae-sequence𝑯italic-ϵsuperscript1𝑇𝑯1\displaystyle\text{subject to }\bm{W}\geq\epsilon,\bm{H}\geq\epsilon,\bm{1}^{T% }\bm{H}=\bm{1}subject to bold_italic_W ≥ italic_ϵ , bold_italic_H ≥ italic_ϵ , bold_1 start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_H = bold_1

This particular problem can be applied in various domains, such as Non-Negative Matrix Factorization for hyperspectral images [36] and remote sensing [32] (See Related Work Section 2 for more references and applications). Our algorithms and regularisations were specifically developed for the espm python package [51]. All algorithms and experiments can be found in the espm package.

Dataset

We construct two datasets consisting of 50 randomly drawn samples. In the first dataset, both matrices 𝑾𝑾\bm{W}bold_italic_W and 𝑯𝑯\bm{H}bold_italic_H are randomly generated from a uniform distribution. In the second dataset, each column of 𝑾𝑾\bm{W}bold_italic_W corresponds to the sum of Gaussian functions that are randomly centered and scaled. The matrix 𝑯𝑯\bm{H}bold_italic_H represents random smooth images. This second dataset is created using the espm package [51], where the toy model is used for 𝑾𝑾\bm{W}bold_italic_W and 𝑾𝑾\bm{W}bold_italic_W is generated using the "laplacian" weight type. This choice of dataset is selected because it can benefit from the Laplacian regularization on 𝑯𝑯\bm{H}bold_italic_H.

Once 𝑾𝑾\bm{W}bold_italic_W and 𝑯𝑯\bm{H}bold_italic_H are generated, the noiseless matrix 𝒀𝒀\bm{Y}bold_italic_Y is obtained as 𝒀=𝑾𝑯𝒀𝑾𝑯\bm{Y}=\bm{W}\bm{H}bold_italic_Y = bold_italic_W bold_italic_H. We introduce noise by independently sampling each element 𝒀~ijPoisson(λ𝒀ij)λsimilar-to~𝒀𝑖𝑗Poisson𝜆𝒀𝑖𝑗𝜆\tilde{\bm{Y}}{ij}\sim\frac{\text{Poisson}(\lambda\bm{Y}{ij})}{\lambda}over~ start_ARG bold_italic_Y end_ARG italic_i italic_j ∼ divide start_ARG Poisson ( italic_λ bold_italic_Y italic_i italic_j ) end_ARG start_ARG italic_λ end_ARG, where λ𝜆\lambdaitalic_λ can be regarded as the noise control parameter. For all samples, we set 𝑾n×k𝑾superscript𝑛𝑘\bm{W}\in\mathbb{R}^{n\times k}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_k end_POSTSUPERSCRIPT and 𝑯k×p2𝑯superscript𝑘superscript𝑝2\bm{H}\in\mathbb{R}^{k\times p^{2}}bold_italic_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT with k=3𝑘3k=3italic_k = 3, p=64𝑝64p=64italic_p = 64, and n𝑛nitalic_n selected from the set 25,100,500,1000251005001000{25,100,500,1000}25 , 100 , 500 , 1000. Thus, the images in the dataset have dimensions of 64×64646464\times 6464 × 64.

Results

We compare the performance of Algorithm 2 (MU), Algorithm 3 (QU), Block Mirror Descent (similar to [16, Algorithm 1]) and the projected gradient algorithm applied to (3). Figure 3 displays the convergence curves for 1000 iterations and n=25𝑛25n=25italic_n = 25. Although the overall complexity of all algorithms is the same, the time per iteration differs due to the different operations performed within each iteration and the time spent on dichotomy to compute the dual variable ν𝜈\nuitalic_ν. Therefore, we provide the time in seconds for each algorithm in Figure 4 for various value of n𝑛nitalic_n. All results are averaged over 50 repetitions.

Smooth images samples Random uniform 𝑾𝑾\bm{W}bold_italic_W, 𝑯𝑯\bm{H}bold_italic_H samples

Noiseless

Refer to caption Refer to caption

Noisy

Refer to caption Refer to caption
Figure 3: Convergence curves for 1000 iterations. We remove the minimum loss (𝑾˙,𝑯˙)˙𝑾˙𝑯\mathcal{L}\left(\dot{\bm{W}},\dot{\bm{H}}\right)caligraphic_L ( over˙ start_ARG bold_italic_W end_ARG , over˙ start_ARG bold_italic_H end_ARG )
Refer to caption
Figure 4: Execution time for 100 iterations for different problem sizes. Here we fix the dimension of 𝑯𝑯\bm{H}bold_italic_H to 3×6423superscript6423\times 64^{2}3 × 64 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and varies 𝑾𝑾\bm{W}bold_italic_W from 25×325325\times 325 × 3 to 1000×3100031000\times 31000 × 3.

Discussion

Let’s discuss the results in more detail:
- Number of iterations: Figure 3 illustrates the convergence behavior of QU and MU algorithms. It is evident that QU converges faster per iteration compared to MU, which aligns with our expectations due to the tighter majorizing function used in QU. However, it is important to note that the introduction of the linesearch technique, while accelerating convergence, can lead to occasional instability, as indicated by occasional increases in the loss function. This observation supports our earlier discussion in Section 5.2. The challenge with Projected Gradient Descent is that we need to find an initial learning rate that is not too large, as the algorithm can diverge and not too small as the algorithm can be slow. Overall, since the selected learning rate cannot be selected optimally, the algorithm is slower than QU and MU. The Block Mirror Descent algorithm is also slower than QU and MU, which is consistent with the results of [16] and can be explained by the fact that the majorizing function used in the algorithm is looser.
- Time per iteration: Figure 4 presents the total time taken by the algorithms to complete 100 iterations. The results demonstrate that for small values of n𝑛nitalic_n, the computation of the p2superscript𝑝2p^{2}italic_p start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT dual variable during the dichotomy process dominates the overall execution time. However, as n𝑛nitalic_n increases, the time spent on dichotomy becomes negligible in comparison. These findings align with the complexity per iteration discussed in Section 5.1.

7 Conclusion

This contribution is the first to address the Poisson NMF problem with general regularization terms, such as Lipschitz functions, relatively smooth functions, or those expressed as linear constraints. We introduce two new algorithms and demonstrate their convergence to a coordinate-wise minimum, which is also a stationary point. Emphasizing the impact of the majorizing function choice on convergence speed, we validate our findings through numerical simulations. In essence, we believe that this work serves as a helpful guide for develo** efficient algorithms suited for regularized Poisson NMF problems.

Appendix A Proofs

In this Appendix, we provide the different proofs used in the paper.

A.1 Proof of Lemma 1

We note that this lemma and its proof likely exist in the literature, but we were unable to find a reference. See 1

Proof.

If \mathcal{L}caligraphic_L is continuously differentiable, the directional derivative can be written as (𝒛;𝒅)=𝒅(𝒛)superscript𝒛𝒅superscript𝒅top𝒛\mathcal{L}^{\prime}(\bm{z};\bm{d})=\bm{d}^{\top}\nabla\mathcal{L}(\bm{z})caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_z ; bold_italic_d ) = bold_italic_d start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ caligraphic_L ( bold_italic_z ). At a coordinatewise minimum 𝒛𝒛\bm{z}bold_italic_z, we have by definition:

(𝒛;[𝒅w,𝟎])=(𝒛)[𝒅w,𝟎]0superscript𝒛subscript𝒅𝑤0𝒛superscriptsubscript𝒅𝑤0top0\mathcal{L}^{\prime}\left(\bm{z};\left[\bm{d}_{w},\bm{0}\right]\right)=\nabla% \mathcal{L}\left(\bm{z}\right)\left[\bm{d}_{w},\bm{0}\right]^{\top}\geq 0caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_z ; [ bold_italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , bold_0 ] ) = ∇ caligraphic_L ( bold_italic_z ) [ bold_italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , bold_0 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ≥ 0

and

(𝒛;[𝟎,𝒅h])=(𝒛)[𝟎,𝒅h]0.superscript𝒛0subscript𝒅𝒛superscript0subscript𝒅top0\mathcal{L}^{\prime}\left(\bm{z};\left[\bm{0},\bm{d}_{h}\right]\right)=\nabla% \mathcal{L}\left(\bm{z}\right)\left[\bm{0},\bm{d}_{h}\right]^{\top}\geq 0.caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_z ; [ bold_0 , bold_italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] ) = ∇ caligraphic_L ( bold_italic_z ) [ bold_0 , bold_italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ≥ 0 .

Therefore,

(𝒛;𝒅)superscript𝒛𝒅\displaystyle\mathcal{L}^{\prime}(\bm{z};\bm{d})caligraphic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( bold_italic_z ; bold_italic_d ) =(𝒛)[𝒅w,𝒅h]absent𝒛superscriptsubscript𝒅𝑤subscript𝒅top\displaystyle=\nabla\mathcal{L}(\bm{z})[\bm{d}_{w},\bm{d}_{h}]^{\top}= ∇ caligraphic_L ( bold_italic_z ) [ bold_italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , bold_italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT
=(𝒛)([𝒅w,𝟎]+[𝟎,𝒅h])absent𝒛superscriptsubscript𝒅𝑤0topsuperscript0subscript𝒅top\displaystyle=\nabla\mathcal{L}(\bm{z})([\bm{d}_{w},\bm{0}]^{\top}+[\bm{0},\bm% {d}_{h}]^{\top})= ∇ caligraphic_L ( bold_italic_z ) ( [ bold_italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , bold_0 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + [ bold_0 , bold_italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT )
=(𝒛)[𝒅w,𝟎]+(𝒛)[𝟎,𝒅h]absent𝒛superscriptsubscript𝒅𝑤0top𝒛superscript0subscript𝒅top\displaystyle=\nabla\mathcal{L}(\bm{z})[\bm{d}_{w},\bm{0}]^{\top}+\nabla% \mathcal{L}(\bm{z})[\bm{0},\bm{d}_{h}]^{\top}= ∇ caligraphic_L ( bold_italic_z ) [ bold_italic_d start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , bold_0 ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT + ∇ caligraphic_L ( bold_italic_z ) [ bold_0 , bold_italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT
0.absent0\displaystyle\geq 0.≥ 0 .

A.2 Proof of majorizing functions

See 3

Proof.

First let us observe that

g(𝒙t,𝒙t)𝑔superscript𝒙𝑡superscript𝒙𝑡\displaystyle g\left(\bm{x}^{t},\bm{x}^{t}\right)italic_g ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =jajxjtkakxktlog(ajxjtajxjtkakxkt)absentsubscript𝑗subscript𝑎𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑘subscript𝑎𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑎𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑎𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑘subscript𝑎𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=-\sum_{j}\frac{a_{j}x_{j}^{t}}{\sum_{k}a_{k}x_{k}^{t}}\log\left(% \frac{a_{j}x_{j}^{t}}{\frac{a_{j}x_{j}^{t}}{\sum_{k}a_{k}x_{k}^{t}}}\right)= - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG roman_log ( divide start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG )
=jajxjtkakxktlog(kakxkt)absentsubscript𝑗subscript𝑎𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑘subscript𝑎𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑘subscript𝑎𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=-\sum_{j}\frac{a_{j}x_{j}^{t}}{\sum_{k}a_{k}x_{k}^{t}}\log\left(% \sum_{k}a_{k}x_{k}^{t}\right)= - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG roman_log ( ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
=log(kakxkt)=f(𝒙t).absentsubscript𝑘subscript𝑎𝑘superscriptsubscript𝑥𝑘𝑡𝑓superscript𝒙𝑡\displaystyle=-\log\left(\sum_{k}a_{k}x_{k}^{t}\right)=f\left(\bm{x}^{t}\right).= - roman_log ( ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) .

Therefore, we have g(𝒙,𝒙t)f(𝒙)𝑔𝒙superscript𝒙𝑡𝑓𝒙g\left(\bm{x},\bm{x}^{t}\right)\geq f\left(\bm{x}\right)italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ≥ italic_f ( bold_italic_x ). The inequality follows from the convexity of the log-\log- roman_log function:

log(jajxj)subscript𝑗subscript𝑎𝑗subscript𝑥𝑗\displaystyle-\log\left(\sum_{j}a_{j}x_{j}\right)- roman_log ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) =log(jqjajxjuj)jqjlog(ajxjuj)absentsubscript𝑗subscript𝑞𝑗subscript𝑎𝑗subscript𝑥𝑗subscript𝑢𝑗subscript𝑗subscript𝑞𝑗subscript𝑎𝑗subscript𝑥𝑗subscript𝑢𝑗\displaystyle=-\log\left(\sum_{j}q_{j}\frac{a_{j}x_{j}}{u_{j}}\right)\leq-\sum% _{j}q_{j}\log\left(\frac{a_{j}x_{j}}{u_{j}}\right)= - roman_log ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) ≤ - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log ( divide start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG )

where we set qj=ajxjtkakxktsubscript𝑞𝑗subscript𝑎𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑘subscript𝑎𝑘superscriptsubscript𝑥𝑘𝑡q_{j}=\frac{a_{j}x_{j}^{t}}{\sum_{k}a_{k}x_{k}^{t}}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG. Finally, by continuity, we obtain the 3rdsuperscript3rd3^{\text{rd}}3 start_POSTSUPERSCRIPT rd end_POSTSUPERSCRIPT and 4thsuperscript4th4^{\text{th}}4 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT properties of majorizing functions. ∎

We now proceed to majorize sL(𝒙)subscript𝑠𝐿𝒙s_{L}\left(\bm{x}\right)italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ), sR(𝒙)subscript𝑠𝑅𝒙s_{R}\left(\bm{x}\right)italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x ), and sC(xj)subscript𝑠𝐶subscript𝑥𝑗s_{C}\left(x_{j}\right)italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). See 4

Proof.

First it can be trivially observed that

sL(𝒙t)=g1(𝒙t,𝒙t)=g2(𝒙t,𝒙t),subscript𝑠𝐿superscript𝒙𝑡subscript𝑔1superscript𝒙𝑡superscript𝒙𝑡subscript𝑔2superscript𝒙𝑡superscript𝒙𝑡s_{L}\left(\bm{x}^{t}\right)=g_{1}\left(\bm{x}^{t},\bm{x}^{t}\right)=g_{2}% \left(\bm{x}^{t},\bm{x}^{t}\right),italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ,

which satisfies the first property. We then take 1stsuperscript1st1^{\text{st}}1 start_POSTSUPERSCRIPT st end_POSTSUPERSCRIPT order Taylor expension of sLsubscript𝑠𝐿s_{L}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT around 𝒙tsuperscript𝒙𝑡\bm{x}^{t}bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and find

sL(𝒙)=sL(𝒙t)+(𝒙𝒙t)sL(𝒙t)+(𝒙,𝒙t)subscript𝑠𝐿𝒙subscript𝑠𝐿superscript𝒙𝑡superscript𝒙superscript𝒙𝑡topsubscript𝑠𝐿superscript𝒙𝑡(𝒙,𝒙t)s_{L}\left(\bm{x}\right)=s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm{x}^{t}% \right)^{\top}\nabla s_{L}\left(\bm{x}^{t}\right)+\text{$\mathcal{R}$$\left(% \bm{x},\bm{x}^{t}\right)$}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ) = italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ( bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + caligraphic_R ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )

where (𝒙,𝒙t)σL𝒙𝒙t22(𝒙,𝒙t)subscript𝜎𝐿superscriptsubscriptnorm𝒙superscript𝒙𝑡22\text{$\mathcal{R}$$\left(\bm{x},\bm{x}^{t}\right)$}\leq\sigma_{L}\|\bm{x}-\bm% {x}^{t}\|_{2}^{2}caligraphic_R ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ≤ italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT since the function sLsubscript𝑠𝐿s_{L}italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT is gradient Lipschitz with constant σLsubscript𝜎𝐿\sigma_{L}italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT. Therefor we have

sL(𝒙)subscript𝑠𝐿𝒙\displaystyle s_{L}\left(\bm{x}\right)italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x ) sL(𝒙t)+(𝒙𝒙t)sL(𝒙t)+σL𝒙𝒙t22=g1(𝒙,𝒙t)absentsubscript𝑠𝐿superscript𝒙𝑡superscript𝒙superscript𝒙𝑡topsubscript𝑠𝐿superscript𝒙𝑡subscript𝜎𝐿superscriptsubscriptnorm𝒙superscript𝒙𝑡22subscript𝑔1𝒙superscript𝒙𝑡\displaystyle\leq s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm{x}^{t}\right)^% {\top}\nabla s_{L}\left(\bm{x}^{t}\right)+\sigma_{L}\|\bm{x}-\bm{x}^{t}\|_{2}^% {2}=g_{1}\left(\bm{x},\bm{x}^{t}\right)≤ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ( bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
sL(𝒙t)+(𝒙𝒙t)sL(𝒙t)+2σL(maxjxjt)(jxjtlog(xjtxj)xjt+xj)absentsubscript𝑠𝐿superscript𝒙𝑡superscript𝒙superscript𝒙𝑡topsubscript𝑠𝐿superscript𝒙𝑡2subscript𝜎𝐿subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑗superscriptsubscript𝑥𝑗𝑡superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗\displaystyle\leq s_{L}\left(\bm{x}^{t}\right)+\left(\bm{x}-\bm{x}^{t}\right)^% {\top}\nabla s_{L}\left(\bm{x}^{t}\right)+2\sigma_{L}\left(\max_{j}x_{j}^{t}% \right)\left(\sum_{j}x_{j}^{t}\log\left(\frac{x_{j}^{t}}{x_{j}}\right)-x_{j}^{% t}+x_{j}\right)≤ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ( bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∇ italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 2 italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (27)
=g2(𝒙,𝒙t),absentsubscript𝑔2𝒙superscript𝒙𝑡\displaystyle=g_{2}\left(\bm{x},\bm{x}^{t}\right),= italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ,

where (27) will be shown later in this proof. By continuity, we obtain the 3rdsuperscript3rd3^{\text{rd}}3 start_POSTSUPERSCRIPT rd end_POSTSUPERSCRIPT and 4thsuperscript4th4^{\text{th}}4 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT property of majorizing functions. We now need to prove (27) and reformulate it as

𝒙𝒙t222(maxjxjt)(jxjtlog(xjtxj)xjt+xj)=2(maxjxjt)DGKL(𝒙𝒙t).superscriptsubscriptnorm𝒙superscript𝒙𝑡222subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑗superscriptsubscript𝑥𝑗𝑡superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗2subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝐷𝐺𝐾𝐿conditional𝒙superscript𝒙𝑡\|\bm{x}-\bm{x}^{t}\|_{2}^{2}\leq 2\left(\max_{j}x_{j}^{t}\right)\left(\sum_{j% }x_{j}^{t}\log\left(\frac{x_{j}^{t}}{x_{j}}\right)-x_{j}^{t}+x_{j}\right)=2% \left(\max_{j}x_{j}^{t}\right)D_{GKL}\left(\bm{x}\|\bm{x}^{t}\right).∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 ( roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 2 ( roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_D start_POSTSUBSCRIPT italic_G italic_K italic_L end_POSTSUBSCRIPT ( bold_italic_x ∥ bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) . (28)

For simplicity, let us define the function

q(𝒙)=jxjlog(xj),𝑞𝒙subscript𝑗subscript𝑥𝑗subscript𝑥𝑗q\left(\bm{x}\right)=\sum_{j}x_{j}\log\left(x_{j}\right),italic_q ( bold_italic_x ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,

with the gradient xjq(𝒙)=log(xj)+1subscriptsubscript𝑥𝑗𝑞𝒙subscript𝑥𝑗1\nabla_{x_{j}}q\left(\bm{x}\right)=\log\left(x_{j}\right)+1∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q ( bold_italic_x ) = roman_log ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + 1 and the Hessian

𝑯ijq(𝒙)=qxixj(𝒙)={1xjif i=j0otherwise.superscriptsubscript𝑯𝑖𝑗𝑞𝒙𝑞subscript𝑥𝑖subscript𝑥𝑗𝒙cases1subscript𝑥𝑗if 𝑖𝑗0otherwise.\bm{H}_{ij}^{q}\left(\bm{x}\right)=\frac{\partial q}{\partial x_{i}\partial x_% {j}}\left(\bm{x}\right)=\begin{cases}\frac{1}{x_{j}}&\text{if }i=j\\ 0&\text{otherwise.}\end{cases}bold_italic_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( bold_italic_x ) = divide start_ARG ∂ italic_q end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∂ italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ( bold_italic_x ) = { start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_CELL start_CELL if italic_i = italic_j end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise. end_CELL end_ROW

Note that q𝑞qitalic_q is a strictly convex function for 𝒙>0𝒙0\bm{x}>0bold_italic_x > 0. We expand the generalized KL divergence:

DGKL(𝒙𝒙t)subscript𝐷𝐺𝐾𝐿conditional𝒙superscript𝒙𝑡\displaystyle D_{GKL}\left(\bm{x}\|\bm{x}^{t}\right)italic_D start_POSTSUBSCRIPT italic_G italic_K italic_L end_POSTSUBSCRIPT ( bold_italic_x ∥ bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =jxjtlog(xjt)jxjtlog(xj)jxjt+jxjabsentsubscript𝑗superscriptsubscript𝑥𝑗𝑡superscriptsubscript𝑥𝑗𝑡subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑗subscript𝑥𝑗\displaystyle=\sum_{j}x_{j}^{t}\log\left(x_{j}^{t}\right)-\sum_{j}x_{j}^{t}% \log\left(x_{j}\right)-\sum_{j}x_{j}^{t}+\sum_{j}x_{j}= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
=jxjtlog(xjt)jxjlog(xj)j(log(xj)+1)(xjtxj)absentsubscript𝑗superscriptsubscript𝑥𝑗𝑡superscriptsubscript𝑥𝑗𝑡subscript𝑗subscript𝑥𝑗subscript𝑥𝑗subscript𝑗subscript𝑥𝑗1superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗\displaystyle=\sum_{j}x_{j}^{t}\log\left(x_{j}^{t}\right)-\sum_{j}x_{j}\log% \left(x_{j}\right)-\sum_{j}\left(\log\left(x_{j}\right)+1\right)\left(x_{j}^{t% }-x_{j}\right)= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_log ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_log ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( roman_log ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) + 1 ) ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
=q(𝒙t)(q(𝒙)+q(𝒙)(𝒙t𝒙))absent𝑞superscript𝒙𝑡𝑞𝒙𝑞superscript𝒙topsuperscript𝒙𝑡𝒙\displaystyle=q\left(\bm{x}^{t}\right)-\left(q\left(\bm{x}\right)+\nabla q% \left(\bm{x}\right)^{\top}\left(\bm{x}^{t}-\bm{x}\right)\right)= italic_q ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ( italic_q ( bold_italic_x ) + ∇ italic_q ( bold_italic_x ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_x ) )
=12(𝒙t𝒙)T𝑯q(𝒙~)(𝒙t𝒙),absent12superscriptsuperscript𝒙𝑡𝒙𝑇superscript𝑯𝑞~𝒙superscript𝒙𝑡𝒙\displaystyle=\frac{1}{2}\left(\bm{x}^{t}-\bm{x}\right)^{T}\bm{H}^{q}\left(% \tilde{\bm{x}}\right)\left(\bm{x}^{t}-\bm{x}\right),= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_H start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG ) ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - bold_italic_x ) , (29)

where 𝒙~~𝒙\tilde{\bm{x}}over~ start_ARG bold_italic_x end_ARG is selected such that the last equality holds. Since q𝑞qitalic_q is a strictly convex function, we know that 𝒙~=ρ𝒙+(1ρ)𝒙t~𝒙𝜌𝒙1𝜌superscript𝒙𝑡\tilde{\bm{x}}=\rho\bm{x}+\left(1-\rho\right)\bm{x}^{t}over~ start_ARG bold_italic_x end_ARG = italic_ρ bold_italic_x + ( 1 - italic_ρ ) bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for some give ρ[0,1].𝜌01\rho\in[0,1].italic_ρ ∈ [ 0 , 1 ] . Now we bound the Hessian as

𝑯q(𝒙)1maxjxj𝑰superscript𝑯𝑞𝒙1subscript𝑗subscript𝑥𝑗𝑰\bm{H}^{q}\left(\bm{x}\right)\geq\frac{1}{\max_{j}x_{j}}\bm{I}bold_italic_H start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( bold_italic_x ) ≥ divide start_ARG 1 end_ARG start_ARG roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG bold_italic_I

and introducing this inquality in (29), we obtain

DGKL(𝒙𝒙t)12maxjxj𝒙𝒙t22,subscript𝐷𝐺𝐾𝐿conditional𝒙superscript𝒙𝑡12subscript𝑗subscript𝑥𝑗superscriptsubscriptnorm𝒙superscript𝒙𝑡22D_{GKL}\left(\bm{x}\|\bm{x}^{t}\right)\geq\frac{1}{2\max_{j}x_{j}}\|\bm{x}-\bm% {x}^{t}\|_{2}^{2},italic_D start_POSTSUBSCRIPT italic_G italic_K italic_L end_POSTSUBSCRIPT ( bold_italic_x ∥ bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ≥ divide start_ARG 1 end_ARG start_ARG 2 roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ∥ bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

which is equivalent to (28) and completes the proof. ∎

See 5

Proof.

The first property g(𝒙t,𝒙t)=sR(𝒙t)𝑔superscript𝒙𝑡superscript𝒙𝑡subscript𝑠𝑅superscript𝒙𝑡g\left(\bm{x}^{t},\bm{x}^{t}\right)=s_{R}\left(\bm{x}^{t}\right)italic_g ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) can be trivially verified. Then using by the definition of relatively smoot function

sR(𝒙)sR(𝒙t)+sR(𝒙t),𝒙𝒙t+σRκ(𝒙,𝒙t)subscript𝑠𝑅𝒙subscript𝑠𝑅superscript𝒙𝑡subscript𝑠𝑅superscript𝒙𝑡𝒙superscript𝒙𝑡subscript𝜎𝑅subscript𝜅𝒙superscript𝒙𝑡s_{R}\left(\bm{x}\right)\leq s_{R}\left(\bm{x}^{t}\right)+\left\langle\nabla s% _{R}\left(\bm{x}^{t}\right),\bm{x}-\bm{x}^{t}\right\rangle+\sigma_{R}\mathcal{% B}_{\kappa}\left(\bm{x},\bm{x}^{t}\right)italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x ) ≤ italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + ⟨ ∇ italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ + italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT caligraphic_B start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )

where

κ(𝒙,𝒙t):=κ(𝒙)κ(𝒙t)κ(𝒙t),𝒙𝒙t.assignsubscript𝜅𝒙superscript𝒙𝑡𝜅𝒙𝜅superscript𝒙𝑡𝜅superscript𝒙𝑡𝒙superscript𝒙𝑡\mathcal{B}_{\kappa}\left(\bm{x},\bm{x}^{t}\right):=\kappa\left(\bm{x}\right)-% \kappa\left(\bm{x}^{t}\right)-\left\langle\nabla\kappa\left(\bm{x}^{t}\right),% \bm{x}-\bm{x}^{t}\right\rangle.caligraphic_B start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) := italic_κ ( bold_italic_x ) - italic_κ ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - ⟨ ∇ italic_κ ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , bold_italic_x - bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ⟩ .

Given that κ(𝒙)=𝟏log(𝒙)𝜅𝒙superscript1top𝒙\kappa\left(\bm{x}\right)=-\bm{1}^{\top}\log\left(\bm{x}\right)italic_κ ( bold_italic_x ) = - bold_1 start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT roman_log ( bold_italic_x ), we compute

κ(𝒙,𝒙t)=in(xixitlog(xixit)1)subscript𝜅𝒙superscript𝒙𝑡superscriptsubscript𝑖𝑛subscript𝑥𝑖superscriptsubscript𝑥𝑖𝑡subscript𝑥𝑖superscriptsubscript𝑥𝑖𝑡1\mathcal{B}_{\kappa}\left(\bm{x},\bm{x}^{t}\right)=\sum_{i}^{n}\left(\frac{x_{% i}}{x_{i}^{t}}-\log\left(\frac{x_{i}}{x_{i}^{t}}\right)-1\right)caligraphic_B start_POSTSUBSCRIPT italic_κ end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG - roman_log ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ) - 1 )

Finally, by continuity, we obtain the 3rdsuperscript3rd3^{\text{rd}}3 start_POSTSUPERSCRIPT rd end_POSTSUPERSCRIPT and 4thsuperscript4th4^{\text{th}}4 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT property of majorizing functions. ∎

See 6

Proof.

One can simply observe s(xt)=g(xt,xt)𝑠superscript𝑥𝑡𝑔superscript𝑥𝑡superscript𝑥𝑡s\left(x^{t}\right)=g\left(x^{t},x^{t}\right)italic_s ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_g ( italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). Then by concavity, we have

s(xi)s(xit)+s(xit)xi(xixit)𝑠subscript𝑥𝑖𝑠superscriptsubscript𝑥𝑖𝑡𝑠superscriptsubscript𝑥𝑖𝑡subscript𝑥𝑖subscript𝑥𝑖superscriptsubscript𝑥𝑖𝑡s\left(x_{i}\right)\leq s\left(x_{i}^{t}\right)+\frac{\partial s\left(x_{i}^{t% }\right)}{\partial x_{i}}\left(x_{i}-x_{i}^{t}\right)italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + divide start_ARG ∂ italic_s ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )

Finally, by continuity, we obtain the 3rdsuperscript3rd3^{\text{rd}}3 start_POSTSUPERSCRIPT rd end_POSTSUPERSCRIPT and 4thsuperscript4th4^{\text{th}}4 start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT property of majorizing functions. ∎

A.3 Proof of subproblem updates

In this subsection, we present the proof of the subproblem updates used in the MU and QU algorithms. Let us start with the MU updates. See 1

Proof.

Assuming aij,bi>0subscript𝑎𝑖𝑗subscript𝑏𝑖0a_{ij},b_{i}>0italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0, (10) is strictly convex, as the green term is strictly convex, and the remaining terms are convex. Consequently, (10) possesses a global minimum. To identify this minimum, we seek the stationary point 𝒙g=𝟎subscript𝒙𝑔0\nabla_{\bm{x}}g=\bm{0}∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_g = bold_0. Due to our meticulous selection of majorizing functions, this subproblem becomes separable. When computing the gradient with respect to the variable xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we obtain:

xjg(𝒙,𝒙t)subscriptsubscript𝑥𝑗𝑔𝒙superscript𝒙𝑡\displaystyle\nabla_{x_{j}}g\left(\bm{x},\bm{x}^{t}\right)∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =1xjibiaijxjtkaikxkt+iaijabsent1subscript𝑥𝑗subscript𝑖subscript𝑏𝑖subscript𝑎𝑖𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑘subscript𝑎𝑖𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑖subscript𝑎𝑖𝑗\displaystyle={\color[rgb]{0.0,0.5,0.0}-\frac{1}{x_{j}}\sum_{i}b_{i}\frac{a_{% ij}x_{j}^{t}}{\sum_{k}a_{ik}x_{k}^{t}}+\sum_{i}a_{ij}}= - divide start_ARG 1 end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT
+xjsL(𝒙t)+2(maxixit)σL(1xjtxj)subscriptsubscript𝑥𝑗subscript𝑠𝐿superscript𝒙𝑡2subscript𝑖superscriptsubscript𝑥𝑖𝑡subscript𝜎𝐿1superscriptsubscript𝑥𝑗𝑡subscript𝑥𝑗\displaystyle{\color[rgb]{0,0,1}+\nabla_{x_{j}}s_{L}\left(\bm{x}^{t}\right)+2% \left(\max_{i}x_{i}^{t}\right)\sigma_{L}\left(1-\frac{x_{j}^{t}}{x_{j}}\right)}+ ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 2 ( roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( 1 - divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG )
+xjsR(𝒙t)+σR(1xjt1xj)subscriptsubscript𝑥𝑗subscript𝑠𝑅superscript𝒙𝑡subscript𝜎𝑅1superscriptsubscript𝑥𝑗𝑡1subscript𝑥𝑗\displaystyle{\color[rgb]{1,.5,0}+\nabla_{x_{j}}s_{R}\left(\bm{x}^{t}\right)+% \sigma_{R}\left(\frac{1}{x_{j}^{t}}-\frac{1}{x_{j}}\right)}+ ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG )
+sC(xjt)x=0subscript𝑠𝐶superscriptsubscript𝑥𝑗𝑡𝑥0\displaystyle{\color[rgb]{.75,0,.25}+\frac{\partial s_{C}\left(x_{j}^{t}\right% )}{\partial x}}=0+ divide start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_x end_ARG = 0 (30)

where we assume that xj,xjt>0subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡0x_{j},x_{j}^{t}>0italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT > 0 since 0𝒞0𝒞0\notin\mathcal{C}0 ∉ caligraphic_C. Transforming the above expression, we find a multiplicative update rule (16) for xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. ∎

The proof of the QU updates is similar to the MU updates, expect that we need to solve a quadratic equation to obtain a closed-form solution. See 2

Proof.

Assuming aij,bi>0subscript𝑎𝑖𝑗subscript𝑏𝑖0a_{ij},b_{i}>0italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0, Equation (19) is strictly convex. This convexity arises from the strict convexity of the green term, coupled with the convexity of the other terms. Consequently, (19) possesses a global minimum. To identify this minimum, we seek the stationary point by computing 𝒙g=𝟎subscript𝒙𝑔0\nabla_{\bm{x}}g=\bm{0}∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_g = bold_0:

xjg(𝒙,𝒙t)subscriptsubscript𝑥𝑗𝑔𝒙superscript𝒙𝑡\displaystyle\nabla_{x_{j}}g\left(\bm{x},\bm{x}^{t}\right)∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_g ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =1xjibiaijxjtkaikxkt+iaijabsent1subscript𝑥𝑗subscript𝑖subscript𝑏𝑖subscript𝑎𝑖𝑗superscriptsubscript𝑥𝑗𝑡subscript𝑘subscript𝑎𝑖𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑖subscript𝑎𝑖𝑗\displaystyle={\color[rgb]{0.0,0.5,0.0}-\frac{1}{x_{j}}\sum_{i}b_{i}\frac{a_{% ij}x_{j}^{t}}{\sum_{k}a_{ik}x_{k}^{t}}+\sum_{i}a_{ij}}= - divide start_ARG 1 end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT
+xjsL(𝒙t)+2σL(xjxjt)subscriptsubscript𝑥𝑗subscript𝑠𝐿superscript𝒙𝑡2subscript𝜎𝐿subscript𝑥𝑗superscriptsubscript𝑥𝑗𝑡\displaystyle{\color[rgb]{0,0,1}+\nabla_{x_{j}}s_{L}\left(\bm{x}^{t}\right)+2% \sigma_{L}\left(x_{j}-x_{j}^{t}\right)}+ ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + 2 italic_σ start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT )
+xjsR(𝒙t)+σR(1xjt1xj)subscriptsubscript𝑥𝑗subscript𝑠𝑅superscript𝒙𝑡subscript𝜎𝑅1superscriptsubscript𝑥𝑗𝑡1subscript𝑥𝑗\displaystyle{\color[rgb]{1,.5,0}+\nabla_{x_{j}}s_{R}\left(\bm{x}^{t}\right)+% \sigma_{R}\left(\frac{1}{x_{j}^{t}}-\frac{1}{x_{j}}\right)}+ ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG )
+sC(xjt)x=0subscript𝑠𝐶superscriptsubscript𝑥𝑗𝑡𝑥0\displaystyle{\color[rgb]{.75,0,.25}+\frac{\partial s_{C}\left(x_{j}^{t}\right% )}{\partial x}=0}+ divide start_ARG ∂ italic_s start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ italic_x end_ARG = 0 (31)

We observe that it is a separable quadratic function, hence the update rule named Quadratic Update (QU). Solving (31) for xj>0subscript𝑥𝑗0x_{j}>0italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0 can be rewritten as

αxj2+βjtxjζjt=0,𝛼superscriptsubscript𝑥𝑗2superscriptsubscript𝛽𝑗𝑡subscript𝑥𝑗superscriptsubscript𝜁𝑗𝑡0\alpha x_{j}^{2}+\beta_{j}^{t}x_{j}-\zeta_{j}^{t}=0,italic_α italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 0 ,

where α𝛼\alphaitalic_α, βjtsuperscriptsubscript𝛽𝑗𝑡\beta_{j}^{t}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, and ζjtsuperscriptsubscript𝜁𝑗𝑡\zeta_{j}^{t}italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT are given in (21). Assuming ζjt0superscriptsubscript𝜁𝑗𝑡0\zeta_{j}^{t}\neq 0italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≠ 0, we have ζjt>0superscriptsubscript𝜁𝑗𝑡0\zeta_{j}^{t}>0italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT > 0 and 4αζjt>04𝛼superscriptsubscript𝜁𝑗𝑡04\alpha\zeta_{j}^{t}>04 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT > 0. Therefore, the previous quadratic equation has two real solutions. Since (βjt)2+4αζjt>βjtsuperscriptsuperscriptsubscript𝛽𝑗𝑡24𝛼superscriptsubscript𝜁𝑗𝑡superscriptsubscript𝛽𝑗𝑡\sqrt{\left(\beta_{j}^{t}\right)^{2}+4\alpha\zeta_{j}^{t}}>\beta_{j}^{t}square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG > italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, they are of opposite sign. Due to the constraint xjϵ>0subscript𝑥𝑗italic-ϵ0x_{j}\geq\epsilon>0italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ italic_ϵ > 0, we select the positive one, leading to the update rule of (20) for xjsubscript𝑥𝑗x_{j}italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. ∎

Appendix B Computation of Lower and Upper Bounds for Dichotomy

In Section 4.3, we introduced modifications to the MU and QU algorithms to incorporate the positivity constraint 𝒙ϵ𝒙italic-ϵ\bm{x}\geq\epsilonbold_italic_x ≥ italic_ϵ and the linear constraint 𝒆𝒙=1superscript𝒆top𝒙1\bm{e}^{\top}\bm{x}=1bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x = 1. However, solving for the dual parameter ν𝜈\nuitalic_ν in equations (24) for MU or (25) for QU is intractable. To address this, we propose using the dichotomy method to solve for h(ν)=0𝜈0h(\nu)=0italic_h ( italic_ν ) = 0. Therefore, this appendix provides the computation of lower bound νlowsubscript𝜈low\nu_{\text{low}}italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT and upper bound νupsubscript𝜈up\nu_{\text{up}}italic_ν start_POSTSUBSCRIPT up end_POSTSUBSCRIPT such that h(νlow)<0subscript𝜈low0h(\nu_{\text{low}})<0italic_h ( italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT ) < 0 and h(νup)>0subscript𝜈up0h(\nu_{\text{up}})>0italic_h ( italic_ν start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ) > 0. These bounds will serve as convenient initializations for the dichotomy algorithm.

B.1 Case 1: MU

For MU, we aim to solve equation (24) for ν𝜈\nuitalic_ν:

h1(ν)=jejxjtαjmin(xjtαjϵ,βjt+νej)1=0subscript1𝜈subscript𝑗subscript𝑒𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗italic-ϵsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗10h_{1}\left(\nu\right)=\sum_{j}e_{j}\frac{x_{j}^{t}\alpha_{j}}{\min\left(\frac{% x_{j}^{t}\alpha_{j}}{\epsilon},\beta_{j}^{t}+\nu e_{j}\right)}-1=0italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ν ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_min ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ end_ARG , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG - 1 = 0

Terms where ej=0subscript𝑒𝑗0e_{j}=0italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 can be ignored since they do not contribute to the sum. Assuming ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0, the function h1subscript1h_{1}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is well-defined for ν𝜈\nu\in\mathbb{R}italic_ν ∈ blackboard_R. Since xjtαj>0superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗0x_{j}^{t}\alpha_{j}>0italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > 0, we have h1(ν)=𝒆1ϵ1subscript1𝜈subscriptnorm𝒆1italic-ϵ1h_{1}\left(\nu\right)=\frac{\|\bm{e}\|_{1}}{\epsilon}-1italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ν ) = divide start_ARG ∥ bold_italic_e ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ end_ARG - 1 for ννlim=minj(xjtαjϵβjtej)𝜈subscript𝜈subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗italic-ϵsuperscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗\nu\leq\nu_{\lim}=\min_{j}\left(\frac{\frac{x_{j}^{t}\alpha_{j}}{\epsilon}-% \beta_{j}^{t}}{e_{j}}\right)italic_ν ≤ italic_ν start_POSTSUBSCRIPT roman_lim end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( divide start_ARG divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ end_ARG - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ), and h1subscript1h_{1}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is monotonically decreasing for ν[νlim,[𝜈subscript𝜈\nu\in\left[\nu_{\lim},\infty\right[italic_ν ∈ [ italic_ν start_POSTSUBSCRIPT roman_lim end_POSTSUBSCRIPT , ∞ [. Assuming 𝒆1ϵ10subscriptnorm𝒆1italic-ϵ10\frac{\|\bm{e}\|_{1}}{\epsilon}-1\geq 0divide start_ARG ∥ bold_italic_e ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ end_ARG - 1 ≥ 0 (which ensures the feasibility of the constraints 𝒙ϵ𝒙italic-ϵ\bm{x}\geq\epsilonbold_italic_x ≥ italic_ϵ and 𝒆𝒙=1superscript𝒆top𝒙1\bm{e}^{\top}\bm{x}=1bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x = 1), we have h1(νlim)0subscript1subscript𝜈lim0h_{1}(\nu_{\text{lim}})\geq 0italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT lim end_POSTSUBSCRIPT ) ≥ 0 and limνh1(ν)=1subscript𝜈subscript1𝜈1\lim_{\nu\to\infty}h_{1}(\nu)=-1roman_lim start_POSTSUBSCRIPT italic_ν → ∞ end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ν ) = - 1. Thus, there exists exactly one root for the function h1subscript1h_{1}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Negative bound

First, let’s find νlowsubscript𝜈low\nu_{\text{low}}italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT such that h1(ν)<0subscript1𝜈0h_{1}(\nu)<0italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ν ) < 0 for νlowν<subscript𝜈low𝜈\nu_{\text{low}}\leq\nu<\inftyitalic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT ≤ italic_ν < ∞. We can bound h1subscript1h_{1}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as follows:

h1(ν)subscript1𝜈\displaystyle h_{1}\left(\nu\right)italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ν ) =jxjtαjmin(xjtαjϵej,βjtej+ν)1absentsubscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗italic-ϵsubscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗𝜈1\displaystyle=\sum_{j}\frac{x_{j}^{t}\alpha_{j}}{\min\left(\frac{x_{j}^{t}% \alpha_{j}}{\epsilon e_{j}},\frac{\beta_{j}^{t}}{e_{j}}+\nu\right)}-1= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_min ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG , divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + italic_ν ) end_ARG - 1
nmaxjxjtαjminjmin(xjtαjϵej,βjtej+ν)1<0absent𝑛subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗italic-ϵsubscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗𝜈10\displaystyle\leq n\frac{\max_{j}x_{j}^{t}\alpha_{j}}{\min_{j}\min\left(\frac{% x_{j}^{t}\alpha_{j}}{\epsilon e_{j}},\frac{\beta_{j}^{t}}{e_{j}}+\nu\right)}-1<0≤ italic_n divide start_ARG roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_min start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_min ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG , divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + italic_ν ) end_ARG - 1 < 0

Therefore one possible bound is

νlow=nmaxjxjtαjminjβjtej.subscript𝜈low𝑛subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗subscript𝑗superscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗\nu_{\text{low}}=n\max_{j}x_{j}^{t}\alpha_{j}-\min_{j}\frac{\beta_{j}^{t}}{e_{% j}}.italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT = italic_n roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_min start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG .

Positive bound

Similarly, let’s find νupsubscript𝜈up\nu_{\text{up}}italic_ν start_POSTSUBSCRIPT up end_POSTSUBSCRIPT such that h1(ν)0subscript1𝜈0h_{1}(\nu)\geq 0italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ν ) ≥ 0 for νupννlimsubscript𝜈up𝜈subscript𝜈lim\nu_{\text{up}}\geq\nu\geq\nu_{\text{lim}}italic_ν start_POSTSUBSCRIPT up end_POSTSUBSCRIPT ≥ italic_ν ≥ italic_ν start_POSTSUBSCRIPT lim end_POSTSUBSCRIPT. Note that νlimsubscript𝜈lim\nu_{\text{lim}}italic_ν start_POSTSUBSCRIPT lim end_POSTSUBSCRIPT is not a good bound when ϵitalic-ϵ\epsilonitalic_ϵ is small. We can bound h1subscript1h_{1}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT as follows:

h1(ν)subscript1𝜈\displaystyle h_{1}\left(\nu\right)italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_ν ) =j=1nxjtαjmin(xjtαjϵej,βjtej+ν)1absentsuperscriptsubscript𝑗1𝑛superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗italic-ϵsubscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗𝜈1\displaystyle=\sum_{j=1}^{n}\frac{x_{j}^{t}\alpha_{j}}{\min\left(\frac{x_{j}^{% t}\alpha_{j}}{\epsilon e_{j}},\frac{\beta_{j}^{t}}{e_{j}}+\nu\right)}-1= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_min ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG , divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + italic_ν ) end_ARG - 1
maxjxjtαjmin(xjtαjϵej,βjtej+ν)1absentsubscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗italic-ϵsubscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗𝜈1\displaystyle\geq\max_{j}\frac{x_{j}^{t}\alpha_{j}}{\min\left(\frac{x_{j}^{t}% \alpha_{j}}{\epsilon e_{j}},\frac{\beta_{j}^{t}}{e_{j}}+\nu\right)}-1≥ roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_min ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_ϵ italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG , divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + italic_ν ) end_ARG - 1
maxjxjtαjβjtej+ν1>0,absentsubscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗superscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗𝜈10\displaystyle\geq\max_{j}\frac{x_{j}^{t}\alpha_{j}}{\frac{\beta_{j}^{t}}{e_{j}% }+\nu}-1>0,≥ roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG + italic_ν end_ARG - 1 > 0 ,

As a result, we have

νup=maxj(xjtαjβjtej)subscript𝜈upsubscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗superscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗\nu_{\text{up}}=\max_{j}\left(x_{j}^{t}\alpha_{j}-\frac{\beta_{j}^{t}}{e_{j}}\right)italic_ν start_POSTSUBSCRIPT up end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG )

To improve numerical stability, one could use νlow=2nmaxjxjtαjminjβjtejsuperscriptsubscript𝜈low2𝑛subscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗subscript𝑗superscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗\nu_{\text{low}}^{\prime}=2n\max_{j}x_{j}^{t}\alpha_{j}-\min_{j}\frac{\beta_{j% }^{t}}{e_{j}}italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 2 italic_n roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - roman_min start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG and νup<maxjxjtαj2βjtejsuperscriptsubscript𝜈upsubscript𝑗superscriptsubscript𝑥𝑗𝑡subscript𝛼𝑗2superscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗\nu_{\text{up}}^{\prime}<\max_{j}\frac{x_{j}^{t}\alpha_{j}}{2}-\frac{\beta_{j}% ^{t}}{e_{j}}italic_ν start_POSTSUBSCRIPT up end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG.

B.2 Case 2: QU

For QU, we aim to solve equation (25) for ν𝜈\nuitalic_ν:

h2(ν)=jejmax(βjtνej+(βjt+νej)2+4αζjt2α,ϵ)1=0.subscript2𝜈subscript𝑗subscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗superscriptsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗24𝛼superscriptsubscript𝜁𝑗𝑡2𝛼italic-ϵ10h_{2}\left(\nu\right)=\sum_{j}e_{j}\max\left(\frac{-\beta_{j}^{t}-\nu e_{j}+% \sqrt{\left(\beta_{j}^{t}+\nu e_{j}\right)^{2}+4\alpha\zeta_{j}^{t}}}{2\alpha}% ,\epsilon\right)-1=0.italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_max ( divide start_ARG - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 italic_α end_ARG , italic_ϵ ) - 1 = 0 .

Let’s analyze the function h2subscript2h_{2}italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We observe that the function x+x2+δ𝑥superscript𝑥2𝛿-x+\sqrt{x^{2}+\delta}- italic_x + square-root start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_δ end_ARG is strictly decreasing over \mathbb{R}blackboard_R for δ>0𝛿0\delta>0italic_δ > 0, since its derivative 1+xx2+δ1𝑥superscript𝑥2𝛿-1+\frac{x}{\sqrt{x^{2}+\delta}}- 1 + divide start_ARG italic_x end_ARG start_ARG square-root start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_δ end_ARG end_ARG is strictly negative for δ>0𝛿0\delta>0italic_δ > 0. Therefore, each term of the sum is decreasing, and h2subscript2h_{2}italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a decreasing function. Note that once at least for one j𝑗jitalic_j, we have

βjtνej+(βjt+νej)2+4αζjt2αϵ,superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗superscriptsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗24𝛼superscriptsubscript𝜁𝑗𝑡2𝛼italic-ϵ-\beta_{j}^{t}-\nu e_{j}+\sqrt{(\beta_{j}^{t}+\nu e_{j})^{2}+4\alpha\zeta_{j}^% {t}}\geq 2\alpha\epsilon,- italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ≥ 2 italic_α italic_ϵ ,

and therefore the function h2subscript2h_{2}italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT becomes strictly decreasing. In the limit, we have limνh2(ν)=subscript𝜈subscript2𝜈\lim_{\nu\to-\infty}h_{2}(\nu)=\inftyroman_lim start_POSTSUBSCRIPT italic_ν → - ∞ end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν ) = ∞ and limνh2(ν)=ϵ𝒆11subscript𝜈subscript2𝜈italic-ϵsubscriptnorm𝒆11\lim_{\nu\to\infty}h_{2}(\nu)=\epsilon\|\bm{e}\|_{1}-1roman_lim start_POSTSUBSCRIPT italic_ν → ∞ end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν ) = italic_ϵ ∥ bold_italic_e ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1. Assuming ϵ𝒆1>1italic-ϵsubscriptnorm𝒆11\epsilon\|\bm{e}\|_{1}>1italic_ϵ ∥ bold_italic_e ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 1 (which ensures the feasibility of the constraints 𝒙ϵ𝒙italic-ϵ\bm{x}\geq\epsilonbold_italic_x ≥ italic_ϵ and 𝒆𝒙=1superscript𝒆top𝒙1\bm{e}^{\top}\bm{x}=1bold_italic_e start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x = 1), we know that the function h2subscript2h_{2}italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT has exactly one root.

Negative bound

Let’s find νlowsubscript𝜈low\nu_{\text{low}}italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT such that h2(ν)0subscript2𝜈0h_{2}(\nu)\leq 0italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν ) ≤ 0 for ννlow𝜈subscript𝜈low\nu\geq\nu_{\text{low}}italic_ν ≥ italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT. We start by bounding the term

βjtνej+(βjt+νej)2+4αζjtδj,superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗superscriptsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗24𝛼superscriptsubscript𝜁𝑗𝑡subscript𝛿𝑗-\beta_{j}^{t}-\nu e_{j}+\sqrt{\left(\beta_{j}^{t}+\nu e_{j}\right)^{2}+4% \alpha\zeta_{j}^{t}}\leq\delta_{j},- italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG ≤ italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ,

where δj>ϵ>0subscript𝛿𝑗italic-ϵ0\delta_{j}>\epsilon>0italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_ϵ > 0. We move βjtνejsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗-\beta_{j}^{t}-\nu e_{j}- italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to the left and square the inequality to remove the square root:

(βjt+νej)2+4αζjt(βjt+νej+ϵj)2superscriptsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗24𝛼superscriptsubscript𝜁𝑗𝑡superscriptsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗subscriptitalic-ϵ𝑗2\left(\beta_{j}^{t}+\nu e_{j}\right)^{2}+4\alpha\zeta_{j}^{t}\leq\left(\beta_{% j}^{t}+\nu e_{j}+\epsilon_{j}\right)^{2}( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ≤ ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

Eventually, we can extract a bound for ν𝜈\nuitalic_ν to ensure that the inequality is satisfied for a chosen δjsubscript𝛿𝑗\delta_{j}italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT:

ν4αζjtϵj22δjejβjtej.𝜈4𝛼superscriptsubscript𝜁𝑗𝑡superscriptsubscriptitalic-ϵ𝑗22subscript𝛿𝑗subscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗\nu\geq\frac{4\alpha\zeta_{j}^{t}-\epsilon_{j}^{2}}{2\delta_{j}e_{j}}-\frac{% \beta_{j}^{t}}{e_{j}}.italic_ν ≥ divide start_ARG 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ϵ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG - divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG .

Let’s set δj=2αmejsubscript𝛿𝑗2𝛼𝑚subscript𝑒𝑗\delta_{j}=\frac{2\alpha}{me_{j}}italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 2 italic_α end_ARG start_ARG italic_m italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG, where m𝑚mitalic_m is the number of elements in the sum, and take the maximum over j𝑗jitalic_j to obtain the bound:

νlowmaxj(mζjtαmej2βjtej)subscript𝜈lowsubscript𝑗𝑚superscriptsubscript𝜁𝑗𝑡𝛼𝑚superscriptsubscript𝑒𝑗2superscriptsubscript𝛽𝑗𝑡subscript𝑒𝑗\nu_{\text{low}}\geq\max_{j}\left(m\zeta_{j}^{t}-\frac{\alpha}{me_{j}^{2}}-% \frac{\beta_{j}^{t}}{e_{j}}\right)italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT ≥ roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_m italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - divide start_ARG italic_α end_ARG start_ARG italic_m italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG )

We observe the validity of this bound provided that δj=2αmejϵsubscript𝛿𝑗2𝛼𝑚subscript𝑒𝑗italic-ϵ\delta_{j}=\frac{2\alpha}{me_{j}}\geq\epsilonitalic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 2 italic_α end_ARG start_ARG italic_m italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ≥ italic_ϵ for all j𝑗jitalic_j. Then, it can be verified that

h2(νlow)subscript2subscript𝜈low\displaystyle h_{2}\left(\nu_{\text{low}}\right)italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT ) =jejmax(βjtνlowej+(βjt+νlowej)2+4αζjt2α,ϵ)1absentsubscript𝑗subscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡subscript𝜈lowsubscript𝑒𝑗superscriptsuperscriptsubscript𝛽𝑗𝑡subscript𝜈lowsubscript𝑒𝑗24𝛼superscriptsubscript𝜁𝑗𝑡2𝛼italic-ϵ1\displaystyle=\sum_{j}e_{j}\max\left(\frac{-\beta_{j}^{t}-\nu_{\text{low}}e_{j% }+\sqrt{\left(\beta_{j}^{t}+\nu_{\text{low}}e_{j}\right)^{2}+4\alpha\zeta_{j}^% {t}}}{2\alpha},\epsilon\right)-1= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_max ( divide start_ARG - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 italic_α end_ARG , italic_ϵ ) - 1
=jejβjtνlowej+(βjt+νlowej)2+4αζjt2α1absentsubscript𝑗subscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡subscript𝜈lowsubscript𝑒𝑗superscriptsuperscriptsubscript𝛽𝑗𝑡subscript𝜈lowsubscript𝑒𝑗24𝛼superscriptsubscript𝜁𝑗𝑡2𝛼1\displaystyle=\sum_{j}e_{j}\frac{-\beta_{j}^{t}-\nu_{\text{low}}e_{j}+\sqrt{% \left(\beta_{j}^{t}+\nu_{\text{low}}e_{j}\right)^{2}+4\alpha\zeta_{j}^{t}}}{2% \alpha}-1= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 italic_α end_ARG - 1
jejδj2α1absentsubscript𝑗subscript𝑒𝑗subscript𝛿𝑗2𝛼1\displaystyle\leq\sum_{j}e_{j}\frac{\delta_{j}}{2\alpha}-1≤ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_α end_ARG - 1
=jmej2αmej2α1=0,absentsuperscriptsubscript𝑗𝑚subscript𝑒𝑗2𝛼𝑚subscript𝑒𝑗2𝛼10\displaystyle=\sum_{j}^{m}e_{j}\frac{\frac{2\alpha}{me_{j}}}{2\alpha}-1=0,= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG divide start_ARG 2 italic_α end_ARG start_ARG italic_m italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 2 italic_α end_ARG - 1 = 0 ,

which proves that νlowsubscript𝜈low\nu_{\text{low}}italic_ν start_POSTSUBSCRIPT low end_POSTSUBSCRIPT is a valid negative bound.

Positive bound

Similarly, let’s find νupsubscript𝜈up\nu_{\text{up}}italic_ν start_POSTSUBSCRIPT up end_POSTSUBSCRIPT such that h2(ν)0subscript2𝜈0h_{2}(\nu)\geq 0italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν ) ≥ 0 for ννup𝜈subscript𝜈up\nu\leq\nu_{\text{up}}italic_ν ≤ italic_ν start_POSTSUBSCRIPT up end_POSTSUBSCRIPT. This time, we will bound h2subscript2h_{2}italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from below and obtain

h2(ν)subscript2𝜈\displaystyle h_{2}\left(\nu\right)italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν ) =jejmax(βjtνej+(βjt+νej)2+4αζjt2α,ϵ)1absentsubscript𝑗subscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗superscriptsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗24𝛼superscriptsubscript𝜁𝑗𝑡2𝛼italic-ϵ1\displaystyle=\sum_{j}e_{j}\max\left(\frac{-\beta_{j}^{t}-\nu e_{j}+\sqrt{% \left(\beta_{j}^{t}+\nu e_{j}\right)^{2}+4\alpha\zeta_{j}^{t}}}{2\alpha},% \epsilon\right)-1= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_max ( divide start_ARG - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 italic_α end_ARG , italic_ϵ ) - 1
jejβjtνej+(βjt+νej)2+4αζjt2α1absentsubscript𝑗subscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗superscriptsuperscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗24𝛼superscriptsubscript𝜁𝑗𝑡2𝛼1\displaystyle\geq\sum_{j}e_{j}\frac{-\beta_{j}^{t}-\nu e_{j}+\sqrt{\left(\beta% _{j}^{t}+\nu e_{j}\right)^{2}+4\alpha\zeta_{j}^{t}}}{2\alpha}-1≥ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + square-root start_ARG ( italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_α italic_ζ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 2 italic_α end_ARG - 1
12αjej(βjtνej)1absent12𝛼subscript𝑗subscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑒𝑗1\displaystyle\geq\frac{1}{2\alpha}\sum_{j}e_{j}\left(-\beta_{j}^{t}-\nu e_{j}% \right)-1≥ divide start_ARG 1 end_ARG start_ARG 2 italic_α end_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_ν italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - 1
=12α(jejβjt+νjej2)1>0.absent12𝛼subscript𝑗subscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡𝜈subscript𝑗superscriptsubscript𝑒𝑗210\displaystyle=-\frac{1}{2\alpha}\left(\sum_{j}e_{j}\beta_{j}^{t}+\nu\sum_{j}e_% {j}^{2}\right)-1>0.= - divide start_ARG 1 end_ARG start_ARG 2 italic_α end_ARG ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT + italic_ν ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - 1 > 0 .

We can, therefore, define

νup=2α+jejβjtjej2.subscript𝜈up2𝛼subscript𝑗subscript𝑒𝑗superscriptsubscript𝛽𝑗𝑡subscript𝑗superscriptsubscript𝑒𝑗2\nu_{\text{up}}=-\frac{2\alpha+\sum_{j}e_{j}\beta_{j}^{t}}{\sum_{j}e_{j}^{2}}.italic_ν start_POSTSUBSCRIPT up end_POSTSUBSCRIPT = - divide start_ARG 2 italic_α + ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

References

  • [1] Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the kurdyka-łojasiewicz inequality. Mathematics of operations research, 35(2):438–457, 2010.
  • [2] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1):459–494, 2014.
  • [3] Lev M Bregman. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR computational mathematics and mathematical physics, 7(3):200–217, 1967.
  • [4] Jean-Philippe Brunet, Pablo Tamayo, Todd R Golub, and Jill P Mesirov. Metagenes and molecular pattern discovery using matrix factorization. Proceedings of the national academy of sciences, 101(12):4164–4169, 2004.
  • [5] Stefania Cacovich, Fabio Matteocci, Mojtaba Abdi-Jalebi, Samuel D Stranks, Aldo Di Carlo, Caterina Ducati, and Giorgio Divitini. Unveiling the chemical composition of halide perovskite films using multivariate statistical analyses. ACS Applied Energy Materials, 1(12):7174–7181, 2018.
  • [6] Deng Cai, Xiaofei He, Jiawei Han, and Thomas S Huang. Graph regularized nonnegative matrix factorization for data representation. IEEE transactions on pattern analysis and machine intelligence, 33(8):1548–1560, 2010.
  • [7] Andrzej Cichocki and Anh-Huy Phan. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences, 92(3):708–721, 2009.
  • [8] Andrzej Cichocki, Rafal Zdunek, and Shun-ichi Amari. Hierarchical als algorithms for nonnegative matrix and 3d tensor factorization. In International Conference on Independent Component Analysis and Signal Separation, pages 169–176. Springer, 2007.
  • [9] Yu-Hong Dai and Yaxiang Yuan. A nonlinear conjugate gradient method with a strong global convergence property. SIAM Journal on optimization, 10(1):177–182, 1999.
  • [10] Cédric Févotte, Nancy Bertin, and Jean-Louis Durrieu. Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis. Neural computation, 21(3):793–830, 2009.
  • [11] Dan Fu, Gary Holtom, Christian Freudiger, Xu Zhang, and Xiaoliang Sunney Xie. Hyperspectral imaging with stimulated raman scattering by chirped femtosecond lasers. The Journal of Physical Chemistry B, 117(16):4634–4640, 2013.
  • [12] Nicolas Gillis. The why and how of nonnegative matrix factorization. In Regularization, Optimization, Kernels, and Support Vector Machines, pages 275–310. Chapman and Hall/CRC, 2014.
  • [13] Prem Gopalan, Jake M Hofman, and David M Blei. Scalable recommendation with poisson factorization. arXiv preprint arXiv:1311.1704, 2013.
  • [14] Filip Hanzely, Peter Richtarik, and Lin Xiao. Accelerated bregman proximal gradient methods for relatively smooth convex optimization. Computational Optimization and Applications, 79:405–440, 2021.
  • [15] Niao He, Zaid Harchaoui, Yichen Wang, and Le Song. Fast and simple optimization for poisson likelihood models. arXiv preprint arXiv:1608.01264, 2016.
  • [16] Le Thi Khanh Hien and Nicolas Gillis. Algorithms for nonnegative matrix factorization with the kullback–leibler divergence. Journal of Scientific Computing, 87(3):1–32, 2021.
  • [17] Thomas Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50–57, 1999.
  • [18] Cho-Jui Hsieh and Inderjit S Dhillon. Fast coordinate descent methods with variable selection for non-negative matrix factorization. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1064–1072, 2011.
  • [19] BR Jany, Arkadiusz Janas, and Franciszek Krok. Retrieving the quantitative chemical information at nanoscale from scanning electron microscope energy dispersive x-ray measurements by machine learning. Nano letters, 17(11):6520–6525, 2017.
  • [20] Chi **, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, pages 1724–1732. PMLR, 2017.
  • [21] Ramakrishnan Kannan, AV Ievlev, Nouamane Laanait, Maxim A Ziatdinov, Rama K Vasudevan, Stephen Jesse, and Sergei V Kalinin. Deep data analysis via physically constrained linear unmixing: universal framework, domain examples, and a community-wide platform. Advanced Structural and Chemical Imaging, 4(1):1–20, 2018.
  • [22] Hideaki Kano, Hiroki Segawa, Masanari Okuno, Philippe Leproux, and Vincent Couderc. Hyperspectral coherent raman imaging–principle, theory, instrumentation, and applications to life sciences. Journal of Raman Spectroscopy, 47(1):116–123, 2016.
  • [23] Hyunsoo Kim and Haesun Park. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM journal on matrix analysis and applications, 30(2):713–730, 2008.
  • [24] **gu Kim, Yunlong He, and Haesun Park. Algorithms for nonnegative matrix and tensor factorizations: A unified view based on block coordinate descent framework. Journal of Global Optimization, 58(2):285–319, 2014.
  • [25] **gu Kim and Haesun Park. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientific Computing, 33(6):3261–3281, 2011.
  • [26] Nikos Komodakis and Jean-Christophe Pesquet. Playing with duality: An overview of recent primal? dual approaches for solving large-scale optimization problems. IEEE Signal Processing Magazine, 32(6):31–54, 2015.
  • [27] Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
  • [28] Paul G Kotula, Michael R Keenan, and Joseph R Michael. Automated analysis of sem x-ray spectral images: A powerful new microanalysis tool. Microscopy and Microanalysis, 9(1):1–17, 2003.
  • [29] Daniel Lee and H. Sebastian Seung. Algorithms for non-negative matrix factorization. In T. Leen, T. Dietterich, and V. Tresp, editors, Advances in Neural Information Processing Systems, volume 13, pages 556–562. MIT Press, 2001.
  • [30] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
  • [31] Qiuwei Li, Zhihui Zhu, Gongguo Tang, and Michael B Wakin. Provable bregman-divergence based methods for nonconvex and non-lipschitz problems. arXiv preprint arXiv:1904.09712, 2019.
  • [32] Xinghua Li, Liyuan Wang, Qing Cheng, Penghai Wu, Wenxia Gan, and Lina Fang. Cloud removal in remote sensing images using nonnegative matrix factorization and error correction. ISPRS journal of photogrammetry and remote sensing, 148:103–113, 2019.
  • [33] Chih-Jen Lin. On the convergence of multiplicative update algorithms for nonnegative matrix factorization. IEEE Transactions on Neural Networks, 18(6):1589–1596, 2007.
  • [34] Chih-Jen Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 19(10):2756–2779, 2007.
  • [35] Haihao Lu, Robert M Freund, and Yurii Nesterov. Relatively smooth convex optimization by first-order methods, and applications. SIAM Journal on Optimization, 28(1):333–354, 2018.
  • [36] Xiaoqiang Lu, Hao Wu, Yuan Yuan, **kun Yan, and Xuelong Li. Manifold regularized sparse nmf for hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing, 51(5):2815–2826, 2012.
  • [37] Julien Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829–855, 2015.
  • [38] Qiaozhu Mei and ChengXiang Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 198–207, 2005.
  • [39] Pentti Paatero. Least squares formulation of robust non-negative factor analysis. Chemometrics and intelligent laboratory systems, 37(1):23–35, 1997.
  • [40] Ioannis Panageas, Georgios Piliouras, and Xiao Wang. First-order methods almost always avoid saddle points: The case of vanishing step-sizes. Advances in Neural Information Processing Systems, 32, 2019.
  • [41] Meisam Razaviyayn, Mingyi Hong, and Zhi-Quan Luo. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013.
  • [42] Clément W Royer, Michael O’Neill, and Stephen J Wright. A newton-cg algorithm with complexity guarantees for smooth unconstrained optimization. Mathematical Programming, 180(1):451–488, 2020.
  • [43] Clément W Royer and Stephen J Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM Journal on Optimization, 28(2):1448–1477, 2018.
  • [44] Joseph Salmon, Zachary Harmany, Charles-Alban Deledalle, and Rebecca Willett. Poisson noise reduction with non-local pca. Journal of mathematical imaging and vision, 48(2):279–294, 2014.
  • [45] Motoki Shiga, Kazuyoshi Tatsumi, Shunsuke Muto, Koji Tsuda, Yuta Yamamoto, Toshiyuki Mori, and Takayoshi Tanji. Sparse modeling of eels and edx spectral imaging data by nonnegative matrix factorization. Ultramicroscopy, 170:43–59, 2016.
  • [46] Ajit P Singh and Geoffrey J Gordon. Relational learning via collective matrix factorization. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 650–658, 2008.
  • [47] Paris Smaragdis and Judith C Brown. Non-negative matrix factorization for polyphonic music transcription. In 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No. 03TH8684), pages 177–180. IEEE, 2003.
  • [48] Dennis L Sun and Cedric Fevotte. Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 6201–6205. IEEE, 2014.
  • [49] Leo Taslaman and Björn Nilsson. A framework for regularized non-negative matrix factorization, with application to the analysis of gene expression data. PloS one, 7(11):e46331, 2012.
  • [50] Marc Teboulle and Yakov Vaisbourd. Novel proximal gradient methods for nonnegative matrix factorization with sparsity constraints. SIAM Journal on Imaging Sciences, 13(1):381–421, 2020.
  • [51] Adrien Teurtrie, Nathanaël Perraudin, Thomas Holvoet, Hui Chen, Duncan TL Alexander, Guillaume Obozinski, and Cécile Hébert. espm: A python library for the simulation of stem-edxs datasets. Ultramicroscopy, page 113719, 2023.
  • [52] Adrien Teurtrie, Nathanaël Perraudin, Thomas Holvoet, Hui Chen, Duncan TL Alexander, Guillaume Obozinski, and Cécile Hébert. From stem-edxs data to phase separation and quantification using physics-guided nmf. To appear, 2024.
  • [53] Paul Tseng. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109:475–494, 2001.
  • [54] Musundi B Wabuyele, Fei Yan, Guy D Griffin, and Tuan Vo-Dinh. Hyperspectral surface-enhanced raman imaging of labeled silver nanoparticles in single cells. Review of scientific instruments, 76(6):063710, 2005.
  • [55] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6(3):1758–1789, 2013.
  • [56] Felipe Yanez and Francis Bach. Primal-dual algorithms for non-negative matrix factorization with the kullback-leibler divergence. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2257–2261. IEEE, 2017.
  • [57] Andrew B Yankovich, Chenyu Zhang, Albert Oh, Thomas JA Slater, Feridoon Azough, Robert Freer, Sarah J Haigh, Rebecca Willett, and Paul M Voyles. Non-rigid registration and non-local principle component analysis to improve electron microscopy spectrum images. Nanotechnology, 27(36):364001, 2016.
  • [58] Minchao Ye, Yuntao Qian, and Jun Zhou. Multitask sparse nonnegative matrix factorization for joint spectral–spatial hyperspectral imagery denoising. IEEE Transactions on Geoscience and Remote Sensing, 53(5):2621–2639, 2014.
  • [59] Chenyu Zhang, Rungang Han, Anru R Zhang, and Paul M Voyles. Denoising atomic resolution 4d scanning transmission electron microscopy data with tensor singular value decomposition. Ultramicroscopy, 219:113123, 2020.
  • [60] Changzhong Zou and Youshen Xia. Restoration of hyperspectral image contaminated by poisson noise using spectral unmixing. Neurocomputing, 275:430–437, 2018.