\newsiamremark

remarkRemark \newsiamremarkexampleExample \newsiamremarkhypothesisHypothesis \newsiamthmclaimClaim \headersConstructing structured tensor priors for Bayesian inverse problemsK. Batselier \externaldocument[][nocite]ex_supplement

Constructing structured tensor priors for Bayesian inverse problems thanks: \fundingThis publication is part of the project Sustainable learning for Artificial Intelligence from noisy large-scale data (with project number VI.Vidi.213.017) which is financed by the Dutch Research Council (NWO).

Kim Batselier Delft Center for Systems and Control, Delft University of Technology, The Netherlands () [email protected]
Abstract

Specifying a prior distribution is an essential part of solving Bayesian inverse problems. The prior encodes a belief on the nature of the solution and this regularizes the problem. In this article we completely characterize a Gaussian prior that encodes the belief that the solution is a structured tensor. We first define the notion of (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensors and show that they describe a large variety of different structures such as Hankel, circulant, triangular, symmetric, and so on. Then we completely characterize the Gaussian probability distribution of such tensors by specifying its mean vector and covariance matrix. Furthermore, explicit expressions are proved for the covariance matrix of tensors whose entries are invariant under a permutation. These results unlock a whole new class of priors for Bayesian inverse problems. We illustrate how new kernel functions can be designed and efficiently computed and apply our results on two particular Bayesian inverse problems: completing a Hankel matrix from a few noisy measurements and learning an image classifier of handwritten digits. The effectiveness of the proposed priors is demonstrated for both problems. All applications have been implemented as reactive Pluto notebooks in Julia.

keywords:
Bayesian inverse problems, structured tensors, tensors, kernel methods
{MSCcodes}

15A29, 15A69, 62F15

1 Introduction

We consider a set of data samples {(𝒙n,yn)|𝒙nD,yn}n=1Nsuperscriptsubscriptconditional-setsubscript𝒙𝑛subscript𝑦𝑛formulae-sequencesubscript𝒙𝑛superscript𝐷subscript𝑦𝑛𝑛1𝑁\{(\bm{x}_{n},y_{n})\,|\,\bm{x}_{n}\in\mathbb{R}^{D},\,y_{n}\in\mathbb{R}\}_{n% =1}^{N}{ ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and the following linear forward model

(1) yn=𝓟(𝒙n),𝓦+ϵn.subscript𝑦𝑛𝓟subscript𝒙𝑛𝓦subscriptitalic-ϵ𝑛\displaystyle{y}_{n}=\langle\bm{\mathcal{P}}(\bm{x}_{n}),\bm{\mathcal{W}}% \rangle+\epsilon_{n}.italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ⟨ bold_caligraphic_P ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , bold_caligraphic_W ⟩ + italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT .

Each scalar measurement ynsubscript𝑦𝑛y_{n}italic_y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is obtained from an inner product of a data-dependent tensor 𝓟(𝒙n)J1××JD𝓟subscript𝒙𝑛superscriptsubscript𝐽1subscript𝐽𝐷\bm{\mathcal{P}}(\bm{x}_{n})\in\mathbb{R}^{J_{1}\times\cdots\times J_{D}}bold_caligraphic_P ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with a tensor of unknown latent variables 𝓦J1××JD𝓦superscriptsubscript𝐽1subscript𝐽𝐷\bm{\mathcal{W}}\in\mathbb{R}^{J_{1}\times\cdots\times J_{D}}bold_caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, corrupted by measurement noise ϵnsubscriptitalic-ϵ𝑛\epsilon_{n}italic_ϵ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. Tensors in this context are D𝐷Ditalic_D-dimensional arrays, with vectors (D=1)𝐷1(D=1)( italic_D = 1 ) and matrices (D=2)𝐷2(D=2)( italic_D = 2 ) being the most common cases. Vectorizing all tensors and collecting the measurements y1,,yNsubscript𝑦1subscript𝑦𝑁y_{1},\ldots,y_{N}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT into a vector 𝒚N𝒚superscript𝑁\bm{y}\in\mathbb{R}^{N}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT allows (1) to be rewritten into the linear system of equations

(2) 𝒚=𝚽(𝒙)𝒘+ϵ.𝒚𝚽𝒙𝒘bold-italic-ϵ\displaystyle\bm{{y}}=\bm{\Phi}(\bm{x})\,\bm{w}+\bm{\epsilon}.bold_italic_y = bold_Φ ( bold_italic_x ) bold_italic_w + bold_italic_ϵ .

Row n𝑛nitalic_n of the matrix 𝚽(𝒙)N×J1JD𝚽𝒙superscript𝑁subscript𝐽1subscript𝐽𝐷\bm{\Phi}(\bm{x})\in\mathbb{R}^{N\times J_{1}\cdots J_{D}}bold_Φ ( bold_italic_x ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT contains the vectorization of the tensor 𝓟(𝒙n)𝓟subscript𝒙𝑛\bm{\mathcal{P}}(\bm{x}_{n})bold_caligraphic_P ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). For notational convenience the indication that 𝚽𝚽\bm{\Phi}bold_Φ depends on 𝒙𝒙\bm{x}bold_italic_x is dropped from here on. The inverse problem consists of inferring the latent variables 𝒘𝒘\bm{w}bold_italic_w from the noisy measurements 𝒚𝒚\bm{y}bold_italic_y. Inverse problems of this kind appear in many different applications fields such as machine learning [6, 26, 27, 31, 32] control [2, 3, 22, 25] and signal processing [10, 13, 14, 15, 19, 20, 30]. In this article a Bayesian approach [1] is considered by assuming that 𝒘𝒘\bm{w}bold_italic_w and ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ are random variables. The goal is then to infer the posterior distribution p(𝒘|𝒚)𝑝conditional𝒘𝒚p(\bm{w}|\bm{{y}})italic_p ( bold_italic_w | bold_italic_y ) of 𝒘𝒘\bm{w}bold_italic_w conditioned on the measurements 𝒚𝒚\bm{y}bold_italic_y using Bayes’ theorem

p(𝒘|𝒚)=p(𝒚|𝒘)p(𝒘)p(𝒚).𝑝conditional𝒘𝒚𝑝conditional𝒚𝒘𝑝𝒘𝑝𝒚\displaystyle p(\bm{w}|\bm{{y}})=\frac{p(\bm{{y}}|\bm{w})\;p({\bm{w}})}{p(\bm{% {y}})}.italic_p ( bold_italic_w | bold_italic_y ) = divide start_ARG italic_p ( bold_italic_y | bold_italic_w ) italic_p ( bold_italic_w ) end_ARG start_ARG italic_p ( bold_italic_y ) end_ARG .

The distribution p(𝒘)𝑝𝒘p(\bm{w})italic_p ( bold_italic_w ) is called the prior and encodes a belief on what 𝒘𝒘\bm{w}bold_italic_w is before the measurements are known. The main contribution of this article is the complete characterization of a prior p(𝒘)𝑝𝒘p(\bm{w})italic_p ( bold_italic_w ) that encodes the belief that the corresponding tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W is structured. A Gaussian distribution is assumed for the noise distribution p(ϵ)=𝒩(𝟎,𝚺)𝑝bold-italic-ϵ𝒩0𝚺p(\bm{\epsilon})=\mathcal{N}(\bm{0},\bm{\Sigma})italic_p ( bold_italic_ϵ ) = caligraphic_N ( bold_0 , bold_Σ ) with mean vector 𝟎0\bm{0}bold_0 and covariance matrix 𝚺𝚺\bm{\Sigma}bold_Σ and likewise for the prior p(𝒘)=𝒩(𝒘0,𝑷0)𝑝𝒘𝒩subscript𝒘0subscript𝑷0p(\bm{w})=\mathcal{N}(\bm{w}_{0},\bm{P}_{0})italic_p ( bold_italic_w ) = caligraphic_N ( bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). The linear forward model (2) combined with the Gaussian assumptions results in a Gaussian posterior p(𝒘|𝒚)=𝒩(𝒘+,𝑷+)𝑝conditional𝒘𝒚𝒩subscript𝒘subscript𝑷p(\bm{w}|\bm{y})=\mathcal{N}(\bm{w}_{+},\bm{P}_{+})italic_p ( bold_italic_w | bold_italic_y ) = caligraphic_N ( bold_italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) with mean vector 𝒘+subscript𝒘\bm{w}_{+}bold_italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and covariance matrix 𝑷+subscript𝑷\bm{P}_{+}bold_italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT

(3) 𝒘+subscript𝒘\displaystyle\bm{w}_{+}bold_italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT =(𝑷01+𝚽T𝚺1𝚽)1(𝚽T𝚺1𝒚+𝑷01𝒘0),absentsuperscriptsuperscriptsubscript𝑷01superscript𝚽𝑇superscript𝚺1𝚽1superscript𝚽𝑇superscript𝚺1𝒚superscriptsubscript𝑷01subscript𝒘0\displaystyle=(\bm{P}_{0}^{-1}+\bm{\Phi}^{T}\bm{\Sigma}^{-1}\bm{\Phi})^{-1}\,(% \bm{\Phi}^{T}\bm{\Sigma}^{-1}{\bm{y}}+\bm{P}_{0}^{-1}\bm{w}_{0}),= ( bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + bold_Φ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Φ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_Φ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_y + bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ,
(4) 𝑷+subscript𝑷\displaystyle\bm{P}_{+}bold_italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT =(𝑷01+𝚽T𝚺1𝚽)1.absentsuperscriptsuperscriptsubscript𝑷01superscript𝚽𝑇superscript𝚺1𝚽1\displaystyle=(\bm{P}_{0}^{-1}+\bm{\Phi}^{T}\bm{\Sigma}^{-1}\bm{\Phi})^{-1}.= ( bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + bold_Φ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Φ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT .

The role of the prior p(𝒘)𝑝𝒘p(\bm{w})italic_p ( bold_italic_w ) can now be understood from (3) and (4). In the absence of data (𝚽=𝟎𝚽0\bm{\Phi}=\bm{0}bold_Φ = bold_0 and 𝒚=𝟎𝒚0\bm{y}=\bm{0}bold_italic_y = bold_0) the posterior equals the prior. In other words, the prior encodes a belief on what the solution 𝒘𝒘\bm{w}bold_italic_w of (2) should be before any data is known. A natural question to ask is then what kind of prior to use. In this article we consider a prior encoding the belief that the tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W has a structure that is completely determined by a matrix 𝑨I×J1JD𝑨superscript𝐼subscript𝐽1subscript𝐽𝐷\bm{A}\in\mathbb{R}^{I\times J_{1}\cdots J_{D}}bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_I × italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and vector 𝒃I𝒃superscript𝐼\bm{b}\in\mathbb{R}^{I}bold_italic_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT such that

𝑨vec(𝓦)𝑨vec𝓦\displaystyle\bm{A}\;\operatorname{vec}{(\bm{\mathcal{W}})}bold_italic_A roman_vec ( bold_caligraphic_W ) =𝒃,absent𝒃\displaystyle=\bm{b},= bold_italic_b ,

which we will refer to as (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensors. The contributions of this article are threefold.

  1. 1.

    We show how the definition of (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensors is well-motivated since it encompasses a wide variety of relevant structured tensors. Examples are given for tensors with fixed entries, tensors with known sums of entries and symmetric, Hankel, Toeplitz, circulant, and triangular tensors.

  2. 2.

    In Theorem 3.1 we completely characterize the mean vector 𝒘0subscript𝒘0\bm{w}_{0}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and covariance matrix 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of the prior p(𝒘)𝑝𝒘p(\bm{w})italic_p ( bold_italic_w ) for (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensors.

  3. 3.

    In Theorems 4.6 and 5.1 we provide explicit expressions for 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensors whose entries remain invariant under a permutation 𝑷𝑷\bm{P}bold_italic_P. Such tensors will be called 𝑷𝑷\bm{P}bold_italic_P-invariant or skew-𝑷𝑷\bm{P}bold_italic_P-invariant.

These three contributions are important because the prior mean 𝒘0subscript𝒘0\bm{w}_{0}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and covariance matrix 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are necessary to solve the Bayesian inverse problem via equations (3) and (4). Contrary to most solution strategies for linear least squares problems the matrix inverse of 𝑷01+𝚽T𝚺1𝚽superscriptsubscript𝑷01superscript𝚽𝑇superscript𝚺1𝚽\bm{P}_{0}^{-1}+\bm{\Phi}^{T}\bm{\Sigma}^{-1}\bm{\Phi}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + bold_Φ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Φ is explicitly required as it forms the posterior covariance. Also note that the dimension of the matrix to invert is J1J2JD×J1J2JDsubscript𝐽1subscript𝐽2subscript𝐽𝐷subscript𝐽1subscript𝐽2subscript𝐽𝐷J_{1}J_{2}\ldots J_{D}\times J_{1}J_{2}\ldots J_{D}italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT × italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, which limits the use of direct solvers to cases of small J𝐽Jitalic_J and D𝐷Ditalic_D. Hybrid projection methods [7, 8] are a viable alternative for cases where J𝐽Jitalic_J and D𝐷Ditalic_D are prohibitively large. Another alternative is to solve the corresponding dual problem, which is described in terms of the so-called kernel matrix 𝚽𝑷0𝚽TN×N𝚽subscript𝑷0superscript𝚽𝑇superscript𝑁𝑁\bm{\Phi}\,\bm{P}_{0}\bm{\Phi}^{T}\in\mathbb{R}^{N\times N}bold_Φ bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_Φ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT. This approach is commonly used in least-squares support vector machines [27] and Gaussian Processes [32] and has a computational complexity of at least O(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). When the tensor 𝓟(𝒙n)𝓟subscript𝒙𝑛\bm{\mathcal{P}}(\bm{x}_{n})bold_caligraphic_P ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) exhibits a low-rank structure then another way to obtain low computational complexity of solving (3) is by imposing a low-rank tensor structure to 𝒘+subscript𝒘\bm{w}_{+}bold_italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and 𝑷+subscript𝑷\bm{P}_{+}bold_italic_P start_POSTSUBSCRIPT + end_POSTSUBSCRIPT [3, 21, 26]. Develo** dedicated solution strategies for equations (3) and (4), however, lies outside the scope of this article.

1.1 Notation

Tensors in this article are multi-dimensional arrays with real entries. We denote scalars by italic letters a,b,𝑎𝑏a,b,\ldotsitalic_a , italic_b , …, vectors by boldface italic letters 𝒂,𝒃,𝒂𝒃\bm{a},\bm{b},\ldotsbold_italic_a , bold_italic_b , …, matrices by boldface capitalized italic letters 𝑨,𝑩,𝑨𝑩\bm{A},\bm{B},\ldotsbold_italic_A , bold_italic_B , … and higher-order tensors by boldface calligraphic italic letters 𝓐,𝓑,𝓐𝓑\bm{\mathcal{A}},\bm{\mathcal{B}},\ldotsbold_caligraphic_A , bold_caligraphic_B , …. The vector 𝒆jdJdsubscript𝒆subscript𝑗𝑑superscriptsubscript𝐽𝑑\bm{e}_{j_{d}}\in\mathbb{R}^{J_{d}}bold_italic_e start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes a canonical basis vector that has a single nonzero unit entry at position jdsubscript𝑗𝑑j_{d}italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT. The vector 𝟏JdJdsubscript1subscript𝐽𝑑superscriptsubscript𝐽𝑑\bm{1}_{J_{d}}\in\mathbb{R}^{J_{d}}bold_1 start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denotes a vector of ones and 𝑰JdJd×Jdsubscript𝑰subscript𝐽𝑑superscriptsubscript𝐽𝑑subscript𝐽𝑑\bm{I}_{J_{d}}\in\mathbb{R}^{J_{d}\times J_{d}}bold_italic_I start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT × italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is the unit matrix. The number of indices required to determine an entry of a tensor is called the order of the tensor. A D𝐷Ditalic_Dth order or D𝐷Ditalic_D-way tensor is hence denoted 𝓐J1×J2××JD𝓐superscriptsubscript𝐽1subscript𝐽2subscript𝐽𝐷\bm{\mathcal{A}}\in\mathbb{R}^{J_{1}\times J_{2}\times\cdots\times J_{D}}bold_caligraphic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × ⋯ × italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. An index jdsubscript𝑗𝑑j_{d}italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT always satisfies 1jdJd1subscript𝑗𝑑subscript𝐽𝑑1\leq j_{d}\leq J_{d}1 ≤ italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ≤ italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, where Jdsubscript𝐽𝑑J_{d}italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is called the dimension of that particular mode. Tensor entries are denoted wj1,j2,,jDsubscript𝑤subscript𝑗1subscript𝑗2subscript𝑗𝐷w_{j_{1},j_{2},\cdots,j_{D}}italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The merger of a set of d𝑑ditalic_d separate indices j1,j2,,jdsubscript𝑗1subscript𝑗2subscript𝑗𝑑j_{1},j_{2},\ldots,j_{d}italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is denoted by the single index

j1j2jd¯=j1+(j21)J1++(jd1)J1Jd1.¯subscript𝑗1subscript𝑗2subscript𝑗𝑑subscript𝑗1subscript𝑗21subscript𝐽1subscript𝑗𝑑1subscript𝐽1subscript𝐽𝑑1\overline{j_{1}j_{2}\ldots j_{d}}=j_{1}+(j_{2}-1)\,J_{1}+\cdots+(j_{d}-1)J_{1}% \cdots J_{d-1}.over¯ start_ARG italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG = italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ( italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 ) italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + ( italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - 1 ) italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_J start_POSTSUBSCRIPT italic_d - 1 end_POSTSUBSCRIPT .

For a tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W we will always assume that the corresponding vector 𝒘=vec(𝓦)𝒘vec𝓦\bm{w}=\textrm{vec}(\bm{\mathcal{W}})bold_italic_w = vec ( bold_caligraphic_W ). The square root matrix 𝑷𝑷\sqrt{\bm{P}}square-root start_ARG bold_italic_P end_ARG of 𝑷𝑷\bm{P}bold_italic_P satisfies per definition 𝑷=𝑷(𝑷)T.𝑷𝑷superscript𝑷𝑇\bm{P}=\sqrt{\bm{P}}\,(\sqrt{\bm{P}})^{T}.bold_italic_P = square-root start_ARG bold_italic_P end_ARG ( square-root start_ARG bold_italic_P end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

2 (𝑨,𝒃)𝑨𝒃(\boldsymbol{A},\boldsymbol{b})( bold_italic_A , bold_italic_b )-constrained tensors

Before characterizing the prior p(𝒘)𝑝𝒘p(\bm{w})italic_p ( bold_italic_w ) we first demonstrate the breadth of (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensors through eight examples. These examples demonstrate that the definition of (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensors is well-motivated in that it captures a wide variety of structured tensors.

2.1 Tensors with fixed entries

A tensor 𝓦J1×J2××JD𝓦superscriptsubscript𝐽1subscript𝐽2subscript𝐽𝐷\bm{\mathcal{W}}\in\mathbb{R}^{J_{1}\times J_{2}\times\cdots\times J_{D}}bold_caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × ⋯ × italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with I𝐼Iitalic_I fixed entries can be described as 𝑨𝒘=𝒃𝑨𝒘𝒃\bm{A}\,\bm{w}=\bm{b}bold_italic_A bold_italic_w = bold_italic_b where row i𝑖iitalic_i of the matrix 𝑨I×J1JD𝑨superscript𝐼subscript𝐽1subscript𝐽𝐷\bm{A}\in\mathbb{R}^{I\times J_{1}\cdots J_{D}}bold_italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_I × italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a canonical basis vector 𝒆j1jD¯subscript𝒆¯subscript𝑗1subscript𝑗𝐷\bm{e}_{\overline{j_{1}\cdots j_{D}}}bold_italic_e start_POSTSUBSCRIPT over¯ start_ARG italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_ARG end_POSTSUBSCRIPT that selects entry wj1,,jDsubscript𝑤subscript𝑗1subscript𝑗𝐷w_{j_{1},\ldots,j_{D}}italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The corresponding fixed numerical value of wj1,,jDsubscript𝑤subscript𝑗1subscript𝑗𝐷w_{j_{1},\ldots,j_{D}}italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT is then given by bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Such fixed values are in practice usually zero, for example in triangular or banded matrices. Such structures can also be generalized to the tensor case.

Definition 2.1.

A tensor 𝓦J1×J2××JD𝓦superscriptsubscript𝐽1subscript𝐽2subscript𝐽𝐷\bm{\mathcal{W}}\in\mathbb{R}^{J_{1}\times J_{2}\times\cdots\times J_{D}}bold_caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_J start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × ⋯ × italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is lower (upper) triangular when wj1,j2,,jD=0subscript𝑤subscript𝑗1subscript𝑗2subscript𝑗𝐷0w_{j_{1},j_{2},\cdots,j_{D}}=0italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 0 holds for each consecutive index pair jd,jd+1subscript𝑗𝑑subscript𝑗𝑑1j_{d},j_{d+1}italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT such that jdjd+1<(>) 0subscript𝑗𝑑subscript𝑗𝑑1 0j_{d}-j_{d+1}<(>)\,0italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT < ( > ) 0.

The characterization of a lower (upper) triangular tensor as an (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensor is given in the following lemma.

Lemma 2.2.

Let 𝐋𝐋\bm{L}bold_italic_L be the J(J1)/2×J2𝐽𝐽12superscript𝐽2J(J-1)/2\times J^{2}italic_J ( italic_J - 1 ) / 2 × italic_J start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT matrix that has on each row a single unit entry for each particular occurrence of j1j2<(>) 0subscript𝑗1subscript𝑗2 0j_{1}-j_{2}<(>)\,0italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < ( > ) 0. Lower (upper) triangular tensors are then described by

𝑨𝑨\displaystyle\bm{A}bold_italic_A =(𝑳𝑰J𝑰J𝑰J𝑳𝑰J𝑰J𝑰J𝑳)(D1)(J1)JD12×JD,absentmatrixtensor-product𝑳subscript𝑰𝐽subscript𝑰𝐽tensor-productsubscript𝑰𝐽𝑳subscript𝑰𝐽tensor-productsubscript𝑰𝐽subscript𝑰𝐽𝑳superscript𝐷1𝐽1superscript𝐽𝐷12superscript𝐽𝐷\displaystyle=\begin{pmatrix}\bm{L}\otimes\bm{I}_{J}\otimes\cdots\otimes\bm{I}% _{J}\\ \bm{I}_{J}\otimes\bm{L}\otimes\cdots\otimes\bm{I}_{J}\\ \vdots\\ \bm{I}_{J}\otimes\bm{I}_{J}\otimes\cdots\otimes\bm{L}\end{pmatrix}\in\mathbb{R% }^{\frac{(D-1)(J-1)J^{D-1}}{2}\times J^{D}},= ( start_ARG start_ROW start_CELL bold_italic_L ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⊗ ⋯ ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⊗ bold_italic_L ⊗ ⋯ ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⊗ ⋯ ⊗ bold_italic_L end_CELL end_ROW end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG ( italic_D - 1 ) ( italic_J - 1 ) italic_J start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG × italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ,

and a vector 𝐛(D1)(J1)JD12𝐛superscript𝐷1𝐽1superscript𝐽𝐷12\bm{b}\in\mathbb{R}^{\frac{(D-1)(J-1)J^{D-1}}{2}}bold_italic_b ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG ( italic_D - 1 ) ( italic_J - 1 ) italic_J start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT of zeros.

Proof 2.3.

The known fixed values of lower (upper) triangular tensors are zero and hence 𝐛𝐛\bm{b}bold_italic_b is a vector of zeros. Each row of the matrix 𝐀𝐀\bm{A}bold_italic_A has a single unit entry to select a particular tensor entry for which some consecutive indices jd,jd+1subscript𝑗𝑑subscript𝑗𝑑1j_{d},j_{d+1}italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT satisfy jdjd1<(>) 0subscript𝑗𝑑subscript𝑗𝑑1 0j_{d}-j_{d-1}<(>)\,0italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT italic_d - 1 end_POSTSUBSCRIPT < ( > ) 0. A tensor with D𝐷Ditalic_D indices has D1𝐷1D-1italic_D - 1 consecutive index pairs and therefore 𝐀𝐀\bm{A}bold_italic_A is partitioned into D1𝐷1D-1italic_D - 1 block rows. Each block row is a Kronecker product of D2𝐷2D-2italic_D - 2 identity matrices with 𝐋𝐋\bm{L}bold_italic_L. The Kronecker product of identity matrices generates all possible index combinations of D2𝐷2D-2italic_D - 2 index values. The 𝐋𝐋\bm{L}bold_italic_L matrix factor in the Kronecker product adds the remaining 2 indices but only considers index pairs for which jdjd1<(>) 0subscript𝑗𝑑subscript𝑗𝑑1 0j_{d}-j_{d-1}<(>)\,0italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT italic_d - 1 end_POSTSUBSCRIPT < ( > ) 0.

The 𝑨𝑨\bm{A}bold_italic_A matrix that describes tensors with known fixed entries in Lemma 2.2 is sparse and highly structured as demonstrated by the following example.

Example 2.4.

Consider a lower triangular tensor 𝓦3×3×3𝓦superscript333\bm{\mathcal{W}}\in\mathbb{R}^{3\times 3\times 3}bold_caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 3 × 3 end_POSTSUPERSCRIPT. The condition jdjd+1<0subscript𝑗𝑑subscript𝑗𝑑10j_{d}-j_{d+1}<0italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT < 0 occurs in 3 cases (jd,jd+1){(1,2),(1,3),(2,3)}subscript𝑗𝑑subscript𝑗𝑑1121323(j_{d},j_{d+1})\in\{(1,2),(1,3),(2,3)\}( italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT ) ∈ { ( 1 , 2 ) , ( 1 , 3 ) , ( 2 , 3 ) }. Defining the matrix 𝐋3×9𝐋superscript39\bm{L}\in\mathbb{R}^{3\times 9}bold_italic_L ∈ blackboard_R start_POSTSUPERSCRIPT 3 × 9 end_POSTSUPERSCRIPT with 3 nonzero entries

l1,12¯=l2,13¯=l3,23¯=1subscript𝑙1¯12subscript𝑙2¯13subscript𝑙3¯231\displaystyle l_{1,\overline{12}}=l_{2,\overline{13}}=l_{3,\overline{23}}=1italic_l start_POSTSUBSCRIPT 1 , over¯ start_ARG 12 end_ARG end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT 2 , over¯ start_ARG 13 end_ARG end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT 3 , over¯ start_ARG 23 end_ARG end_POSTSUBSCRIPT = 1

allows us to describe the desired 𝐀𝐀\bm{A}bold_italic_A matrix as

(5) 𝑨𝑨\displaystyle\bm{A}bold_italic_A =(𝑰3𝑳𝑳𝑰3)18×27.absentmatrixtensor-productsubscript𝑰3𝑳tensor-product𝑳subscript𝑰3superscript1827\displaystyle=\begin{pmatrix}\bm{I}_{3}\otimes\bm{L}\\ \bm{L}\otimes\bm{I}_{3}\end{pmatrix}\in\mathbb{R}^{18\times 27}.= ( start_ARG start_ROW start_CELL bold_italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⊗ bold_italic_L end_CELL end_ROW start_ROW start_CELL bold_italic_L ⊗ bold_italic_I start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT 18 × 27 end_POSTSUPERSCRIPT .

This particular sparse structure is exploited in Section 3 when a basis for the nullspace of 𝐀𝐀\bm{A}bold_italic_A needs to be computed. Note that there are actually only 17 zero entries for which jdjd+1<0subscript𝑗𝑑subscript𝑗𝑑10j_{d}-j_{d+1}<0italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT < 0, which implies that the 𝐀𝐀\bm{A}bold_italic_A matrix from equation (5) counts the case j1=1,j2=2,j3=3formulae-sequencesubscript𝑗11formulae-sequencesubscript𝑗22subscript𝑗33j_{1}=1,j_{2}=2,j_{3}=3italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 , italic_j start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 3 twice. This, however, does not negatively affect the resulting prior.

2.2 Known sum of entries

Tensors for which the sum over all or only particular entries add up to a known value are also quite common in applications. Stochastic tensors are a particular example [11, 18]. Knowing a particular sum of entries can be described as follows.

Lemma 2.5.

Tensors 𝓦J1××JD𝓦superscriptsubscript𝐽1subscript𝐽𝐷\bm{\mathcal{W}}\in\mathbb{R}^{J_{1}\times\cdots\times J_{D}}bold_caligraphic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × ⋯ × italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for which the sum over the entries of an index set 𝒥𝒥\mathcal{J}caligraphic_J is a tensor 𝓑𝓑\bm{\mathcal{B}}bold_caligraphic_B are described by

(6) 𝑨𝒘=vec(𝓑) with 𝑨=𝑨D𝑨1,𝑨𝒘vec𝓑 with 𝑨tensor-productsubscript𝑨𝐷subscript𝑨1\displaystyle\bm{A}\;\bm{w}=\operatorname{vec}{(\bm{\mathcal{B}})}\;\textrm{ % with }\;\bm{A}=\bm{A}_{D}\otimes\cdots\otimes\bm{A}_{1},bold_italic_A bold_italic_w = roman_vec ( bold_caligraphic_B ) with bold_italic_A = bold_italic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ⊗ ⋯ ⊗ bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

where each matrix 𝐀d(d=1,,D)subscript𝐀𝑑𝑑1𝐷\bm{A}_{d}\;(d=1,\ldots,D)bold_italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d = 1 , … , italic_D ) in the Kronecker product is per definition

(7) 𝑨d={𝟏JjdTif jd𝒥,𝑰Jjdif jd𝒥.subscript𝑨𝑑casessuperscriptsubscript1subscript𝐽subscript𝑗𝑑𝑇if subscript𝑗𝑑𝒥subscript𝑰subscript𝐽subscript𝑗𝑑if subscript𝑗𝑑𝒥\bm{A}_{d}=\begin{cases}\bm{1}_{J_{j_{d}}}^{T}&\text{if }j_{d}\in\mathcal{J},% \\ \bm{I}_{J_{j_{d}}}&\text{if }j_{d}\notin\mathcal{J}.\end{cases}bold_italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = { start_ROW start_CELL bold_1 start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL if italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ caligraphic_J , end_CELL end_ROW start_ROW start_CELL bold_italic_I start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL if italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∉ caligraphic_J . end_CELL end_ROW

The Kronecker product in (6) has as its leftmost factor d=D𝑑𝐷d=Ditalic_d = italic_D and runs towards d=1𝑑1d=1italic_d = 1 due to the opposite ordering of indices in the Kronecker product.

Proof 2.6.

With the definitions of the 𝐀dsubscript𝐀𝑑\bm{A}_{d}bold_italic_A start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT matrices the sum over the relevant entries of 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W is written in terms of n-mode products [16, p. 460]

𝓦×1𝑨1×2×D𝑨Dsubscript𝐷subscript2subscript1𝓦subscript𝑨1subscript𝑨𝐷\displaystyle\bm{\mathcal{W}}\times_{1}\bm{A}_{1}\times_{2}\cdots\times_{D}\bm% {A}_{D}bold_caligraphic_W × start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋯ × start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT =𝓑.absent𝓑\displaystyle=\bm{\mathcal{B}}.= bold_caligraphic_B .

Using the vectorization operation this can be rewritten as

(𝑨D𝑨1)𝒘tensor-productsubscript𝑨𝐷subscript𝑨1𝒘\displaystyle\left(\bm{A}_{D}\otimes\cdots\otimes\bm{A}_{1}\right)\;\bm{w}( bold_italic_A start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ⊗ ⋯ ⊗ bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) bold_italic_w =𝒃,absent𝒃\displaystyle=\bm{b},= bold_italic_b ,

which finalizes the proof.

Example 2.7.

Let 𝐖2×3𝐖superscript23\bm{W}\in\mathbb{R}^{2\times 3}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 3 end_POSTSUPERSCRIPT be a matrix for which each each row sum equals to 1. Lemma 2.5 then implies that

𝑨𝑨\displaystyle\bm{A}bold_italic_A =(111)(1001),𝒃=𝟏2.formulae-sequenceabsenttensor-productmatrix111matrix1001𝒃subscript12\displaystyle=\begin{pmatrix}1&1&1\end{pmatrix}\otimes\begin{pmatrix}1&0\\ 0&1\end{pmatrix},\;\bm{b}=\bm{1}_{2}.= ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) ⊗ ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) , bold_italic_b = bold_1 start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

2.3 Eigenvector structure

Tensors whose vectorization is an eigenvector of a matrix 𝑷𝑷\bm{P}bold_italic_P with eigenvalue λ𝜆\lambdaitalic_λ are described by the constraint 𝑨=λ𝑰𝑷𝑨𝜆𝑰𝑷\bm{A}=\lambda\,\bm{I}-\bm{P}bold_italic_A = italic_λ bold_italic_I - bold_italic_P and 𝒃=𝟎𝒃0\bm{b}=\bm{0}bold_italic_b = bold_0. An important structure in this article is obtained when 𝑷𝑷\bm{P}bold_italic_P is a permutation matrix. Indeed, 𝑷𝒘=𝒘𝑷𝒘𝒘\bm{P}\,\bm{w}=\bm{w}bold_italic_P bold_italic_w = bold_italic_w then implies that the entries of 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W remain invariant under the permutation 𝑷𝑷\bm{P}bold_italic_P. The distinction between λ=1𝜆1\lambda=1italic_λ = 1 and λ=1𝜆1\lambda=-1italic_λ = - 1 is made explicit through the following two definitions.

Definition 2.8.

Let 𝐏JD×JD𝐏superscriptsuperscript𝐽𝐷superscript𝐽𝐷\bm{P}\in\mathbb{R}^{J^{D}\times J^{D}}bold_italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT be a permutation matrix. A 𝐏𝐏\bm{P}bold_italic_P-invariant tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W is defined by

(𝑰𝑷)𝒘𝑰𝑷𝒘\displaystyle\left(\bm{I}-\bm{P}\right)\bm{w}( bold_italic_I - bold_italic_P ) bold_italic_w =𝟎𝑷𝒘=𝒘.absent0𝑷𝒘𝒘\displaystyle=\bm{0}\Leftrightarrow\bm{P}\,\bm{w}=\bm{w}.= bold_0 ⇔ bold_italic_P bold_italic_w = bold_italic_w .

Likewise, a skew-𝐏𝐏\bm{P}bold_italic_P-invariant tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W satisfies per definition

(𝑰𝑷)𝒘𝑰𝑷𝒘\displaystyle\left(-\bm{I}-\bm{P}\right)\bm{w}( - bold_italic_I - bold_italic_P ) bold_italic_w =𝟎𝑷𝒘=𝒘.absent0𝑷𝒘𝒘\displaystyle=\bm{0}\Leftrightarrow\bm{P}\,\bm{w}=-\bm{w}.= bold_0 ⇔ bold_italic_P bold_italic_w = - bold_italic_w .

In this way any particular permutation matrix 𝑷𝑷\bm{P}bold_italic_P then defines a corresponding structured tensor. Next we discuss some prominent examples of 𝑷𝑷\bm{P}bold_italic_P-invariant tensor structures.

Definition 2.9.

(Cyclic Symmetric tensor [4]) The cyclic index shift permutation matrix 𝐂𝐂\bm{C}bold_italic_C of a D𝐷Ditalic_D-way tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W is the JD×JDsuperscript𝐽𝐷superscript𝐽𝐷J^{D}\times J^{D}italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT permutation matrix

𝑪=(𝑰(1:ID1:ID,:)𝑰(2:ID1:ID,:)𝑰(ID1:ID1:ID,:)),\displaystyle\bm{C}\;=\;\begin{pmatrix}\bm{I}(1:I^{D-1}:I^{D},:)\\ \bm{I}(2:I^{D-1}:I^{D},:)\\ \vdots\\ \bm{I}(I^{D-1}:I^{D-1}:I^{D},:)\\ \end{pmatrix},bold_italic_C = ( start_ARG start_ROW start_CELL bold_italic_I ( 1 : italic_I start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT : italic_I start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , : ) end_CELL end_ROW start_ROW start_CELL bold_italic_I ( 2 : italic_I start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT : italic_I start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , : ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_italic_I ( italic_I start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT : italic_I start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT : italic_I start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT , : ) end_CELL end_ROW end_ARG ) ,

where 𝐈𝐈\bm{I}bold_italic_I is the JD×JDsuperscript𝐽𝐷superscript𝐽𝐷J^{D}\times J^{D}italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT identity matrix and Matlab colon notation is used to denote submatrices. A 𝐂𝐂\bm{C}bold_italic_C-invariant tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W is then called a cyclic symmetric tensor.

Defining the vector 𝒘~:=𝑪vec(𝓦)assign~𝒘𝑪vec𝓦\tilde{\bm{w}}:=\bm{C}\;\textrm{vec}(\bm{\mathcal{W}})over~ start_ARG bold_italic_w end_ARG := bold_italic_C vec ( bold_caligraphic_W ) it can be verified that

w~jD,j1,jD1subscript~𝑤subscript𝑗𝐷subscript𝑗1subscript𝑗𝐷1\displaystyle\tilde{w}_{j_{D},j_{1},\ldots j_{D-1}}over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_j start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =wj1,jD1,jD.absentsubscript𝑤subscript𝑗1subscript𝑗𝐷1subscript𝑗𝐷\displaystyle=w_{j_{1},\ldots j_{D-1},j_{D}}.= italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … italic_j start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

In other words, 𝑪𝑪\bm{C}bold_italic_C performs a cyclic shift of the indices to the right. When D=2𝐷2D=2italic_D = 2, then 𝑪𝑪\bm{C}bold_italic_C uniquely defines J×J𝐽𝐽J\times Jitalic_J × italic_J symmetric matrices 𝑾𝑾\bm{W}bold_italic_W since the cyclic index shift property implies that w~j2,j1=wj1,j2subscript~𝑤subscript𝑗2subscript𝑗1subscript𝑤subscript𝑗1subscript𝑗2\tilde{w}_{j_{2},j_{1}}=w_{j_{1},j_{2}}over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [29]. The case D>2𝐷2D>2italic_D > 2 does not result in a fully symmetric tensor, as for example the required index permutation j1,j2,j3j1,j3,j2formulae-sequencesubscript𝑗1subscript𝑗2subscript𝑗3subscript𝑗1subscript𝑗3subscript𝑗2j_{1},j_{2},j_{3}\rightarrow j_{1},j_{3},j_{2}italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT → italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT would not be enforced by 𝑪𝑪\bm{C}bold_italic_C. 𝑪𝑪\bm{C}bold_italic_C-invariance is therefore a weaker constraint than full symmetry.

Definition 2.10.

(Symmetric tensor) Let 𝐒𝐒\bm{S}bold_italic_S be the permutation matrix such that all entries of 𝐰~:=𝐒vec(𝓦)assign~𝐰𝐒vec𝓦\tilde{\bm{w}}:=\bm{S}\;\textrm{vec}(\bm{\mathcal{W}})over~ start_ARG bold_italic_w end_ARG := bold_italic_S vec ( bold_caligraphic_W ) satisfy w~j1,,jD=wπ(j1,,jD)subscript~𝑤subscript𝑗1subscript𝑗𝐷subscript𝑤𝜋subscript𝑗1subscript𝑗𝐷\tilde{w}_{j_{1},\ldots,j_{D}}=w_{\pi(j_{1},\ldots,j_{D})}over~ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_π ( italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT, where π(j1,,jD)𝜋subscript𝑗1subscript𝑗𝐷\pi(j_{1},\ldots,j_{D})italic_π ( italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) is any permutation of the indices. A 𝐒𝐒\bm{S}bold_italic_S-invariant tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W is per definition a symmetric tensor.

Definition 2.11.

(Centrosymmetric tensor [4]) A 𝐉𝐉\bm{J}bold_italic_J-invariant tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W, where 𝐉𝐉\bm{J}bold_italic_J is the column-reversed identity matrix, is called a centrosymmetric tensor.

A centrosymmetric tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W satisfies

wj1,,jD=wJ1j1+1,,JDjD+1.subscript𝑤subscript𝑗1subscript𝑗𝐷subscript𝑤subscript𝐽1subscript𝑗11subscript𝐽𝐷subscript𝑗𝐷1\displaystyle w_{j_{1},\ldots,j_{D}}=w_{J_{1}-j_{1}+1,\ldots,J_{D}-j_{D}+1}.italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_w start_POSTSUBSCRIPT italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 , … , italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT - italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT .

Probably the most famous tensor that exhibits centrosymmetry is the matrix-matrix multiplication tensor [9].

Definition 2.12.

(Hankel Tensor) Let 𝐇JD×JD𝐇superscriptsuperscript𝐽𝐷superscript𝐽𝐷\bm{H}\in\mathbb{R}^{J^{D}\times J^{D}}bold_italic_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT be the permutation matrix that cyclically permutes all D𝐷Ditalic_D indices j1,,jDsubscript𝑗1subscript𝑗𝐷j_{1},\ldots,j_{D}italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT with constant index sum j1++jDsubscript𝑗1subscript𝑗𝐷j_{1}+\cdots+j_{D}italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT. A 𝐇𝐇\bm{H}bold_italic_H-invariant tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W is called a Hankel tensor.

The minimal index sum is D=1+1+1++1𝐷1111D=1+1+1+\cdots+1italic_D = 1 + 1 + 1 + ⋯ + 1 and maximal index sum is JD=J+J++J𝐽𝐷𝐽𝐽𝐽JD=J+J+\cdots+Jitalic_J italic_D = italic_J + italic_J + ⋯ + italic_J. This implies that 𝑯𝑯\bm{H}bold_italic_H consists of JDD+1𝐽𝐷𝐷1JD-D+1italic_J italic_D - italic_D + 1 permutation cycles and rank(𝑯)=JDD+1rank𝑯𝐽𝐷𝐷1\textrm{rank}(\bm{H})=JD-D+1rank ( bold_italic_H ) = italic_J italic_D - italic_D + 1.

Definition 2.13.

(Toeplitz Tensor) Let 𝐓JD×JD𝐓superscriptsuperscript𝐽𝐷superscript𝐽𝐷\bm{T}\in\mathbb{R}^{J^{D}\times J^{D}}bold_italic_T ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT be the permutation matrix that cyclically permutes all indices jdjd+1maps-tosubscript𝑗𝑑subscript𝑗𝑑1j_{d}\mapsto j_{d}+1italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ↦ italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + 1, where Jd+11(d=1,,D)maps-tosubscript𝐽𝑑11𝑑1𝐷J_{d}+1\mapsto 1\;(d=1,\ldots,D)italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + 1 ↦ 1 ( italic_d = 1 , … , italic_D ). A 𝐓𝐓\bm{T}bold_italic_T-invariant tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W is called a Toeplitz tensor.

A special case of a Toeplitz tensor is a circulant tensor.

Definition 2.14.

(Circulant Tensor) Let 𝐓JD×JD𝐓superscriptsuperscript𝐽𝐷superscript𝐽𝐷\bm{T}\in\mathbb{R}^{J^{D}\times J^{D}}bold_italic_T ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT be the permutation matrix that cyclically permutes all indices jdmod(jd+1,Jd)0j_{d}\mapsto\bmod(j_{d}+1,J_{d})\neq 0italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ↦ roman_mod ( italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + 1 , italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ≠ 0. If mod(jd+1,Jd)=0\bmod(j_{d}+1,J_{d})=0roman_mod ( italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT + 1 , italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = 0 then jdJd(d=1,,D)maps-tosubscript𝑗𝑑subscript𝐽𝑑𝑑1𝐷j_{d}\mapsto J_{d}\;(d=1,\ldots,D)italic_j start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ↦ italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d = 1 , … , italic_D ). A 𝐓𝐓\bm{T}bold_italic_T-invariant tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W is called a circulant tensor.

3 Full characterization of the prior distribution

In this section the Gaussian prior p(𝒘)𝑝𝒘p(\bm{w})italic_p ( bold_italic_w ) for (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensors is fully characterized. We also discuss how the square root covariance matrix 𝑷0subscript𝑷0\sqrt{\bm{P}}_{0}square-root start_ARG bold_italic_P end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be computed without explicitly constructing the matrix 𝑨𝑨\bm{A}bold_italic_A through a block-row partitioning of 𝑨𝑨\bm{A}bold_italic_A.

Theorem 3.1.

The Gaussian distribution of (𝐀,𝐛)𝐀𝐛(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensors 𝒩(𝐰0,𝐏0)𝒩subscript𝐰0subscript𝐏0\mathcal{N}(\bm{w}_{0},\bm{P}_{0})caligraphic_N ( bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is described by a mean vector 𝐰0subscript𝐰0\bm{w}_{0}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that 𝐀𝐰0=𝐛𝐀subscript𝐰0𝐛\bm{A}\,\bm{w}_{0}=\bm{b}bold_italic_A bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_b and by a covariance matrix 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT such that the columns of 𝐏0subscript𝐏0\sqrt{\bm{P}_{0}}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG span the right nullspace of 𝐀𝐀\bm{A}bold_italic_A.

Proof 3.2.

Let 𝐱J1JD𝐱superscriptsubscript𝐽1subscript𝐽𝐷\bm{x}\in\mathbb{R}^{J_{1}\ldots J_{D}}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_J start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUPERSCRIPT be a sample of the standard normal distribution 𝒩(𝟎,𝐈)𝒩0𝐈\mathcal{N}(\bm{0},\bm{I})caligraphic_N ( bold_0 , bold_italic_I ). A sample 𝐰𝐰\bm{w}bold_italic_w of the desired Gaussian distribution is then

𝒘𝒘\displaystyle\bm{w}bold_italic_w =𝒘0+𝑷0𝒙,absentsubscript𝒘0subscript𝑷0𝒙\displaystyle=\bm{w}_{0}+\sqrt{\bm{P}_{0}}\;\bm{x},= bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x ,

where 𝐏0subscript𝐏0\sqrt{\bm{P}_{0}}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG is the matrix square root of the covariance matrix 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Any sample 𝐰𝐰\bm{w}bold_italic_w being an (𝐀,𝐛)𝐀𝐛(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensor implies

(8) 𝑨𝒘=𝑨𝒘0+𝑨𝑷0𝒙𝑨𝒘𝑨subscript𝒘0𝑨subscript𝑷0𝒙\displaystyle\bm{A}\;\bm{w}=\bm{A}\;\bm{w}_{0}+\bm{A}\;\sqrt{\bm{P}_{0}}\;\bm{x}bold_italic_A bold_italic_w = bold_italic_A bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_A square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x =𝒃.absent𝒃\displaystyle=\bm{b}.= bold_italic_b .

Equation (8) can only be true for all random samples 𝐱𝐱\bm{x}bold_italic_x if and only if

𝑨𝒘0𝑨subscript𝒘0\displaystyle\bm{A}\;\bm{w}_{0}bold_italic_A bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =𝒃,absent𝒃\displaystyle=\bm{b},= bold_italic_b ,
𝑨𝑷0𝑨subscript𝑷0\displaystyle\bm{A}\;\sqrt{\bm{P}_{0}}bold_italic_A square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG =𝟎.absent0\displaystyle=\bm{0}.= bold_0 .

In other words, the mean 𝐰0subscript𝐰0\bm{w}_{0}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of the prior also has to satisfy the linear constraint and the columns of 𝐏0subscript𝐏0\sqrt{\bm{P}_{0}}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG span the right nullspace of 𝐀𝐀\bm{A}bold_italic_A.

3.1 Recursive nullspace computation

When the matrix 𝑨𝑨\bm{A}bold_italic_A is too large to construct explicitly then it is beneficial to compute a basis for its right nullspace recursively. This is possible when considering a partitioning into S𝑆Sitalic_S block-rows 𝑨=(𝑨1T𝑨2T𝑨ST)T.𝑨superscriptmatrixsuperscriptsubscript𝑨1𝑇superscriptsubscript𝑨2𝑇superscriptsubscript𝑨𝑆𝑇𝑇\bm{A}=\begin{pmatrix}\bm{A}_{1}^{T}&\bm{A}_{2}^{T}&\ldots&\bm{A}_{S}^{T}\end{% pmatrix}^{T}.bold_italic_A = ( start_ARG start_ROW start_CELL bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL bold_italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL … end_CELL start_CELL bold_italic_A start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT . Algorithm 1 recursively computes a basis for this nullspace without ever explicitly constructing 𝑨𝑨\bm{A}bold_italic_A using Theorem 6.4.1 from [12, p. 329].

Algorithm 1 Compute basis for nullspace 𝑽2subscript𝑽2{\bm{V}_{2}}bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for block-row partitioned 𝑨𝑨\bm{A}bold_italic_A matrix
0:  𝑨1,𝑨2,,𝑨Ssubscript𝑨1subscript𝑨2subscript𝑨𝑆\bm{A}_{1},\bm{A}_{2},\ldots,\bm{A}_{S}bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_italic_A start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT
  𝑽2null(𝑨1)subscript𝑽2nullsubscript𝑨1\bm{V}_{2}\leftarrow\textrm{null}(\bm{A}_{1})bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← null ( bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
  for s=2:S:𝑠2𝑆s=2:Sitalic_s = 2 : italic_S do
     𝒁snull(𝑨s𝑽2)subscript𝒁𝑠nullsubscript𝑨𝑠subscript𝑽2\bm{Z}_{s}\leftarrow\textrm{null}(\bm{A}_{s}\,\bm{V}_{2})bold_italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← null ( bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
     𝑽2𝑽2𝒁ssubscript𝑽2subscript𝑽2subscript𝒁𝑠\bm{V}_{2}\leftarrow\bm{V}_{2}\,\bm{Z}_{s}bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ← bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_Z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT
  end for
  return  𝑽2subscript𝑽2\bm{V}_{2}bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

4 Explicit covariance matrix construction for permutation-invariant tensors

Computing the covariance matrix 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT via Theorem 3.1 requires a basis for the nullspace of 𝑨𝑨\bm{A}bold_italic_A. For 𝑷𝑷\bm{P}bold_italic_P-invariant tensors it is possible to derive an explicit formula for 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as a function of the permutation matrix 𝑷𝑷\bm{P}bold_italic_P, which enables efficient sampling of the prior. Before we can state the main result in Theorem 4.6, we first need to discuss some facts about permutation matrices. An important concept tied to permutation matrices is its order. Any permutation can be written as a product of disjoint cycles. Each cycle has a particular length, also called the order of the cycle. In this article K𝐾Kitalic_K will denote the least common multiple of all orders of disjoint cycles of a given permutation.

Definition 4.1.

The order K𝐾K\in\mathbb{N}italic_K ∈ blackboard_N of a permutation matrix 𝐏𝐏\bm{P}bold_italic_P is defined as the smallest natural number such that 𝐏K=𝐈superscript𝐏𝐾𝐈\bm{P}^{K}=\bm{I}bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT = bold_italic_I.

Skew-𝑷𝑷\bm{P}bold_italic_P-invariant structures always have an even order K𝐾Kitalic_K.

Lemma 4.2.

A skew-𝐏𝐏\bm{P}bold_italic_P-invariant structure has an even order K𝐾Kitalic_K.

Proof 4.3.

Skew-𝐏𝐏\bm{P}bold_italic_P-invariance requires per definition that λ=1𝜆1\lambda=-1italic_λ = - 1. From 𝐏K𝐰=𝐈𝐰=(1)K𝐰superscript𝐏𝐾𝐰𝐈𝐰superscript1𝐾𝐰\bm{P}^{K}\,\bm{w}=\bm{I}\,\bm{w}=(-1)^{K}\,\bm{w}bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_w = bold_italic_I bold_italic_w = ( - 1 ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_w it follows that (1)K=1superscript1𝐾1(-1)^{K}=1( - 1 ) start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT = 1, which proves the desired.

Theorem 4.6 will express the desired covariance matrix 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as a function of powers of the permutation matrix 𝑷𝑷\bm{P}bold_italic_P. The following two lemmas relating powers of permutation matrices are easily proved.

Lemma 4.4.

Let 𝐏𝐏\bm{P}bold_italic_P be a permutation matrix of order K𝐾Kitalic_K, then for any 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K:

(9) 𝑷ksuperscript𝑷𝑘\displaystyle\bm{P}^{k}bold_italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT =𝑷K+k.absentsuperscript𝑷𝐾𝑘\displaystyle=\bm{P}^{K+k}.= bold_italic_P start_POSTSUPERSCRIPT italic_K + italic_k end_POSTSUPERSCRIPT .

Lemma 4.5.

Let 𝐏𝐏\bm{P}bold_italic_P be a permutation matrix of order K𝐾Kitalic_K, then for any 1kK1𝑘𝐾1\leq k\leq K1 ≤ italic_k ≤ italic_K:

(10) 𝑷Kksuperscript𝑷𝐾𝑘\displaystyle\bm{P}^{K-k}bold_italic_P start_POSTSUPERSCRIPT italic_K - italic_k end_POSTSUPERSCRIPT =(𝑷k)T.absentsuperscriptsuperscript𝑷𝑘𝑇\displaystyle=\left(\bm{P}^{k}\right)^{T}.= ( bold_italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

Lemma 4.4 follows from 𝑷K=𝑰superscript𝑷𝐾𝑰\bm{P}^{K}=\bm{I}bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT = bold_italic_I. Lemma 4.5 follows from the orthogonality of permutation matrices and from the fact that powers of permutation matrices are still permutation matrices. We now have all ingredients to describe the main result that provides an analytic solution for the covariance matrix 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as an average over powers of the permutation matrix P𝑃Pitalic_P.

Theorem 4.6.

Let 𝐏𝐏\bm{P}bold_italic_P be a permutation matrix of order K𝐾Kitalic_K. The Gaussian distribution of 𝐏𝐏\bm{P}bold_italic_P-invariant tensors 𝒩(𝐰0,𝐏0)𝒩subscript𝐰0subscript𝐏0\mathcal{N}(\bm{w}_{0},\bm{P}_{0})caligraphic_N ( bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is described by a mean vector 𝐰0subscript𝐰0\bm{w}_{0}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that is 𝐏𝐏\bm{P}bold_italic_P-invariant and covariance matrix

(11) 𝑷0subscript𝑷0\displaystyle\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =𝑷+𝑷2++𝑷KK.absent𝑷superscript𝑷2superscript𝑷𝐾𝐾\displaystyle=\frac{\bm{P}+\bm{P}^{2}+\cdots+\bm{P}^{K}}{K}.= divide start_ARG bold_italic_P + bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⋯ + bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG .

The 𝑷𝑷\bm{P}bold_italic_P-invariance of the mean 𝒘0subscript𝒘0\bm{w}_{0}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT follows directly from Theorem 3.1. The proof of Theorem 4.6 therefore requires showing that 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in (11) is the desired covariance matrix. A matrix 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a covariance matrix if it satisfies the following three sufficient conditions:

  1. 1.

    has positive diagonal entries,

  2. 2.

    is symmetric,

  3. 3.

    is positive (semi-)definite.

Short proofs will now be given for each of these three covariance conditions.

Lemma 4.7.

The matrix 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT has positive diagonal entries.

Proof 4.8.

𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is per definition a sum of permutation matrices, all diagonal entries of 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are therefore either zero or positive. Since 𝐏K=𝐈superscript𝐏𝐾𝐈\bm{P}^{K}=\bm{I}bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT = bold_italic_I we have that the diagonal entries are guaranteed to be positive.

Lemma 4.9.

The matrix 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is symmetric.

Proof 4.10.

The symmetry of 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT follows from

𝑷0Tsuperscriptsubscript𝑷0𝑇\displaystyle\bm{P}_{0}^{T}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT =𝑷T+(𝑷2)T++(𝑷K1)T+(𝑷K)TK,absentsuperscript𝑷𝑇superscriptsuperscript𝑷2𝑇superscriptsuperscript𝑷𝐾1𝑇superscriptsuperscript𝑷𝐾𝑇𝐾\displaystyle=\frac{\bm{P}^{T}+(\bm{P}^{2})^{T}+\cdots+(\bm{P}^{K-1})^{T}+(\bm% {P}^{K})^{T}}{K},= divide start_ARG bold_italic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + ( bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + ⋯ + ( bold_italic_P start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + ( bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG ,
=𝑷K1+𝑷K2++𝑷+𝑷KK,absentsuperscript𝑷𝐾1superscript𝑷𝐾2𝑷superscript𝑷𝐾𝐾\displaystyle=\frac{\bm{P}^{K-1}+\bm{P}^{K-2}+\cdots+\bm{P}+\bm{P}^{K}}{K},= divide start_ARG bold_italic_P start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT + bold_italic_P start_POSTSUPERSCRIPT italic_K - 2 end_POSTSUPERSCRIPT + ⋯ + bold_italic_P + bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG ,
=𝑷0,absentsubscript𝑷0\displaystyle=\bm{\bm{P}}_{0},= bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,

where the second line follows from Lemma 4.5.

The semi-positive definiteness of 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT follows from its idempotency.

Lemma 4.11.

The matrix 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is idempotent, that is 𝐏02=𝐏0superscriptsubscript𝐏02subscript𝐏0\bm{P}_{0}^{2}=\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Proof 4.12.

Writing out (K𝐏0)2superscript𝐾subscript𝐏02(K\bm{P}_{0})^{2}( italic_K bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in terms of 𝐏𝐏\bm{P}bold_italic_P and applying Lemma 4.4 results in

(𝑷+𝑷2++𝑷K)2,superscript𝑷superscript𝑷2superscript𝑷𝐾2\displaystyle(\bm{P}+\bm{P}^{2}+\cdots+\bm{P}^{K})^{2},( bold_italic_P + bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⋯ + bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
=𝑷2+2𝑷3++(K1)𝑷K+K𝑷K+1+(K1)𝑷K+2++2𝑷2K1+𝑷2K,absentsuperscript𝑷22superscript𝑷3𝐾1superscript𝑷𝐾𝐾superscript𝑷𝐾1𝐾1superscript𝑷𝐾22superscript𝑷2𝐾1superscript𝑷2𝐾\displaystyle=\bm{P}^{2}+2\;\bm{P}^{3}+\cdots+(K-1)\;\bm{P}^{K}+K\;\bm{P}^{K+1% }+(K-1)\;\bm{P}^{K+2}+\cdots+2\;\bm{P}^{2K-1}+\bm{P}^{2K},= bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 bold_italic_P start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + ⋯ + ( italic_K - 1 ) bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT + italic_K bold_italic_P start_POSTSUPERSCRIPT italic_K + 1 end_POSTSUPERSCRIPT + ( italic_K - 1 ) bold_italic_P start_POSTSUPERSCRIPT italic_K + 2 end_POSTSUPERSCRIPT + ⋯ + 2 bold_italic_P start_POSTSUPERSCRIPT 2 italic_K - 1 end_POSTSUPERSCRIPT + bold_italic_P start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT ,
=K𝑷+𝑷2+(K1)𝑷K+2K𝑷2++2𝑷2K1+(K2)𝑷K1K𝑷K1+(K1)𝑷K+𝑷2KK𝑷K,absent𝐾𝑷subscriptsuperscript𝑷2𝐾1superscript𝑷𝐾2𝐾superscript𝑷2subscript2superscript𝑷2𝐾1𝐾2superscript𝑷𝐾1𝐾superscript𝑷𝐾1subscript𝐾1superscript𝑷𝐾superscript𝑷2𝐾𝐾superscript𝑷𝐾\displaystyle=K\;\bm{P}+\underbrace{\bm{P}^{2}+(K-1)\;\bm{P}^{K+2}}_{K\;\bm{P}% ^{2}}+\cdots+\underbrace{2\;\bm{P}^{2K-1}+(K-2)\;\bm{P}^{K-1}}_{K\;\bm{P}^{K-1% }}+\underbrace{(K-1)\;\bm{P}^{K}+\;\bm{P}^{2K}}_{K\bm{P}^{K}},= italic_K bold_italic_P + under⏟ start_ARG bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_K - 1 ) bold_italic_P start_POSTSUPERSCRIPT italic_K + 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_K bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + ⋯ + under⏟ start_ARG 2 bold_italic_P start_POSTSUPERSCRIPT 2 italic_K - 1 end_POSTSUPERSCRIPT + ( italic_K - 2 ) bold_italic_P start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_K bold_italic_P start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG ( italic_K - 1 ) bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT + bold_italic_P start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_K bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ,
=K(𝑷+𝑷2+𝑷3++𝑷K),absent𝐾𝑷superscript𝑷2superscript𝑷3superscript𝑷𝐾\displaystyle=K\;(\bm{P}+\bm{P}^{2}+\bm{P}^{3}+\cdots+\bm{P}^{K}),= italic_K ( bold_italic_P + bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + bold_italic_P start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + ⋯ + bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ) ,
=K2𝑷0,absentsuperscript𝐾2subscript𝑷0\displaystyle=K^{2}\;\bm{P}_{0},= italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,

which proves that 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is idempotent.

The first consequence of 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT being idempotent is that it is positive semi-definite.

Lemma 4.13.

The matrix 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is positive semi-definite.

Proof 4.14.

The two eigenvalue equations

𝑷0𝒗=λ𝒗,(𝑷0)2𝒗=λ2𝒗\displaystyle\bm{P}_{0}\,\bm{v}=\lambda\,\bm{v}\quad,\quad(\bm{P}_{0})^{2}\,% \bm{v}=\lambda^{2}\,\bm{v}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_v = italic_λ bold_italic_v , ( bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_v = italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_v

are actually equal due to 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT being idempotent. It therefore follows that λ2λ=0superscript𝜆2𝜆0\lambda^{2}-\lambda=0italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_λ = 0, which implies that the eigenvalues are either 0 or 1. This proves the positive semi-definiteness of 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Having proved that 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a covariance matrix it remains to show that samples drawn from 𝒩(𝒘0,𝑷0)𝒩subscript𝒘0subscript𝑷0\mathcal{N}(\bm{w}_{0},\bm{P}_{0})caligraphic_N ( bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) are 𝑷𝑷\bm{P}bold_italic_P-invariant. From its symmetry and idempotency it follows that 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is its own matrix square root 𝑷0=𝑷0=𝑷0T=𝑷0Tsubscript𝑷0subscript𝑷0superscriptsubscript𝑷0𝑇superscriptsubscript𝑷0𝑇\bm{P}_{0}=\sqrt{\bm{P}_{0}}=\bm{P}_{0}^{T}=\sqrt{\bm{P}_{0}}^{T}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.

Lemma 4.15.

Every sample 𝐰𝐰\bm{w}bold_italic_w drawn from 𝒩(𝐰0,𝐏0)𝒩subscript𝐰0subscript𝐏0\mathcal{N}(\bm{w}_{0},\bm{P}_{0})caligraphic_N ( bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is 𝐏𝐏\bm{P}bold_italic_P-invariant.

Proof 4.16.

A sample 𝐰𝐰\bm{w}bold_italic_w from 𝒩(𝐰0,𝐏0)𝒩subscript𝐰0subscript𝐏0\mathcal{N}(\bm{w}_{0},\bm{P}_{0})caligraphic_N ( bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) can be drawn by computing

𝒘=𝒘0+𝑷0𝒙,𝒘subscript𝒘0subscript𝑷0𝒙\displaystyle\bm{w}=\bm{w}_{0}+\sqrt{\bm{P}_{0}}\;\bm{x},bold_italic_w = bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x ,

where 𝐱𝐱\bm{x}bold_italic_x is drawn from a standard normal distribution 𝒩(𝟎,𝐈)𝒩0𝐈\mathcal{N}(\bm{0},\bm{I})caligraphic_N ( bold_0 , bold_italic_I ). The 𝐏𝐏\bm{P}bold_italic_P-invariance of 𝐰𝐰\bm{w}bold_italic_w follows from

𝒘𝒘\displaystyle\bm{w}bold_italic_w =𝑷𝒘,absent𝑷𝒘\displaystyle=\bm{P}\;\bm{w},= bold_italic_P bold_italic_w ,
𝒘0+𝑷0𝒙subscript𝒘0subscript𝑷0𝒙\displaystyle\bm{w}_{0}+\sqrt{\bm{P}_{0}}\;\bm{x}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x =𝑷𝒘0+𝑷𝑷0𝒙,absent𝑷subscript𝒘0𝑷subscript𝑷0𝒙\displaystyle=\bm{P}\;\bm{w}_{0}+\bm{P}\;\sqrt{\bm{P}_{0}}\;\bm{x},= bold_italic_P bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + bold_italic_P square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x ,
𝑷0𝒙subscript𝑷0𝒙\displaystyle\sqrt{\bm{P}_{0}}\;\bm{x}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x =𝑷𝑷0𝒙,absent𝑷subscript𝑷0𝒙\displaystyle=\bm{P}\;\sqrt{\bm{P}_{0}}\;\bm{x},= bold_italic_P square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x ,
𝑷0𝒙subscript𝑷0𝒙\displaystyle\sqrt{\bm{P}_{0}}\;\bm{x}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x =𝑷(𝑷+𝑷2++𝑷K1+𝑷KK)𝒙,absent𝑷𝑷superscript𝑷2superscript𝑷𝐾1superscript𝑷𝐾𝐾𝒙\displaystyle=\bm{P}\;\left(\frac{\bm{P}+\bm{P}^{2}+\cdots+\bm{P}^{K-1}+\bm{P}% ^{K}}{K}\right)\;\bm{x},= bold_italic_P ( divide start_ARG bold_italic_P + bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⋯ + bold_italic_P start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT + bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG ) bold_italic_x ,
𝑷0𝒙subscript𝑷0𝒙\displaystyle\sqrt{\bm{P}_{0}}\;\bm{x}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x =(𝑷2+𝑷3++𝑷K+𝑷K)𝒙,absentsuperscript𝑷2superscript𝑷3superscript𝑷𝐾𝑷𝐾𝒙\displaystyle=\left(\frac{\bm{P}^{2}+\bm{P}^{3}+\cdots+\bm{P}^{K}+\bm{P}}{K}% \right)\;\bm{x},= ( divide start_ARG bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + bold_italic_P start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + ⋯ + bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT + bold_italic_P end_ARG start_ARG italic_K end_ARG ) bold_italic_x ,
𝑷0𝒙subscript𝑷0𝒙\displaystyle\sqrt{\bm{P}_{0}}\;\bm{x}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x =𝑷0𝒙.absentsubscript𝑷0𝒙\displaystyle=\sqrt{\bm{P}_{0}}\;\bm{x}.= square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x .

The terms that depend on 𝐰0subscript𝐰0\bm{w}_{0}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT cancel due to the 𝐏𝐏\bm{P}bold_italic_P-invariance of 𝐰0subscript𝐰0\bm{w}_{0}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Lemma 4.4 is used to go from line 4 to line 5.

Lemmas 4.7 up to 4.15 constitute the proof of Theorem 4.6. Another consequence from the idempotency of 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is that this matrix is its own pseudoinverse.

Lemma 4.17.

The pseudoinverse 𝐏0superscriptsubscript𝐏0\bm{P}_{0}^{\dagger}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT satisfies

𝑷0superscriptsubscript𝑷0\displaystyle\bm{P}_{0}^{\dagger}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT =𝑷0.absentsubscript𝑷0\displaystyle=\bm{P}_{0}.= bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

Proof 4.18.

The pseudoinverse 𝐏0superscriptsubscript𝐏0\bm{P}_{0}^{\dagger}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT needs to satisfy the following four properties:

  1. 1.

    𝑷0𝑷0𝑷0=𝑷0subscript𝑷0superscriptsubscript𝑷0subscript𝑷0subscript𝑷0\bm{P}_{0}\bm{P}_{0}^{\dagger}\bm{P}_{0}=\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,

  2. 2.

    𝑷0𝑷0𝑷0=𝑷0superscriptsubscript𝑷0subscript𝑷0superscriptsubscript𝑷0superscriptsubscript𝑷0\bm{P}_{0}^{\dagger}\bm{P}_{0}\bm{P}_{0}^{\dagger}=\bm{P}_{0}^{\dagger}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT,

  3. 3.

    (𝑷0𝑷0)T=𝑷0𝑷0superscriptsubscript𝑷0superscriptsubscript𝑷0𝑇subscript𝑷0superscriptsubscript𝑷0(\bm{P}_{0}\bm{P}_{0}^{\dagger})^{T}=\bm{P}_{0}\bm{P}_{0}^{\dagger}( bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT,

  4. 4.

    (𝑷0𝑷0)T=𝑷0𝑷0superscriptsuperscriptsubscript𝑷0subscript𝑷0𝑇superscriptsubscript𝑷0subscript𝑷0(\bm{P}_{0}^{\dagger}\bm{P}_{0})^{T}=\bm{P}_{0}^{\dagger}\bm{P}_{0}( bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

All these properties are satisfied when assuming 𝐏0=𝐏0superscriptsubscript𝐏0subscript𝐏0\bm{P}_{0}^{\dagger}=\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and they follow from the idempotency of 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. For example, Properties 1 and 2 follow from

𝑷0𝑷0𝑷0subscript𝑷0superscriptsubscript𝑷0subscript𝑷0\displaystyle\bm{P}_{0}\bm{P}_{0}^{\dagger}\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =𝑷0𝑷0𝑷0=(𝑷0)3=𝑷0=𝑷0.absentsuperscriptsubscript𝑷0subscript𝑷0superscriptsubscript𝑷0superscriptsubscript𝑷03subscript𝑷0superscriptsubscript𝑷0\displaystyle=\bm{P}_{0}^{\dagger}\bm{P}_{0}\bm{P}_{0}^{\dagger}=(\bm{P}_{0})^% {3}=\bm{P}_{0}=\bm{P}_{0}^{\dagger}.= bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = ( bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT .

Properties 3 and 4 follow from the symmetry of 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

The fact that 𝑷0=𝑷0=𝑷0=𝑷0subscript𝑷0subscript𝑷0superscriptsubscript𝑷0superscriptsubscript𝑷0\bm{P}_{0}=\sqrt{\bm{P}_{0}}=\bm{P}_{0}^{\dagger}=\sqrt{\bm{P}_{0}^{\dagger}}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT end_ARG is convenient for several reasons. First, no explicit 𝑷01superscriptsubscript𝑷01\bm{P}_{0}^{-1}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT computation is required in equations (3) and (4). Second, sampling 𝒩(𝒘0,𝑷0)𝒩subscript𝒘0subscript𝑷0\mathcal{N}(\bm{w}_{0},\bm{P}_{0})caligraphic_N ( bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) can be done without a matrix square-root computation and without any matrix-vector multiplications. Using Theorem 4.6 the product 𝑷0𝒙=𝑷0𝒙subscript𝑷0𝒙subscript𝑷0𝒙\sqrt{\bm{P}_{0}}\,\bm{x}=\bm{P}_{0}\,\bm{x}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_x can be implemented as a weighted sum of permuted versions of 𝒙𝒙\bm{x}bold_italic_x

𝑷𝒙+𝑷2𝒙++𝑷K𝒙K.𝑷𝒙superscript𝑷2𝒙superscript𝑷𝐾𝒙𝐾\displaystyle\frac{\bm{P}\,\bm{x}+\bm{P}^{2}\,\bm{x}+\cdots+\bm{P}^{K}\,\bm{x}% }{K}.divide start_ARG bold_italic_P bold_italic_x + bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_x + ⋯ + bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT bold_italic_x end_ARG start_ARG italic_K end_ARG .

All information of the permutation 𝑷𝑷\bm{P}bold_italic_P is contained in a vector 𝒑𝒑\bm{p}bold_italic_p of IDsuperscript𝐼𝐷I^{D}italic_I start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT elements that specifies how each entry gets mapped to the next. Each term 𝑷k𝒙superscript𝑷𝑘𝒙\bm{P}^{k}\,\bm{x}bold_italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_italic_x of the weighted sum is then computed by successive permutations of 𝒙𝒙\bm{x}bold_italic_x according to 𝒑𝒑\bm{p}bold_italic_p with computational complexity O(ID)𝑂superscript𝐼𝐷O(I^{D})italic_O ( italic_I start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ). The pseudocode for sampling the distribution is given in Algorithm 2.

Algorithm 2 Generate 𝑷𝑷\bm{P}bold_italic_P-invariant sample from 𝒩(𝒘0,𝑷0)𝒩subscript𝒘0subscript𝑷0\mathcal{N}(\bm{w}_{0},\bm{P}_{0})caligraphic_N ( bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )
0:  𝒘0subscript𝒘0\bm{w}_{0}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, index permutation vector 𝒑𝒑\bm{p}bold_italic_p
  𝒙randn(ID)𝒙randnsuperscript𝐼𝐷\bm{x}\leftarrow\textrm{randn}(I^{D})bold_italic_x ← randn ( italic_I start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) % sample standard normal 𝒩(𝟎,𝑰)𝒩0𝑰\mathcal{N}(\bm{0},\bm{I})caligraphic_N ( bold_0 , bold_italic_I )
  𝒘K𝒘0𝒘𝐾subscript𝒘0\bm{w}\leftarrow K\,\bm{w}_{0}bold_italic_w ← italic_K bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
  for k=1:K:𝑘1𝐾k=1:Kitalic_k = 1 : italic_K do
     𝒘𝒘+𝒙𝒘𝒘𝒙\bm{w}\leftarrow\bm{w}+\bm{x}bold_italic_w ← bold_italic_w + bold_italic_x
     𝒙𝒙[𝒑]𝒙𝒙delimited-[]𝒑\bm{x}\leftarrow\bm{x}[\bm{p}]bold_italic_x ← bold_italic_x [ bold_italic_p ] % permute entries of 𝒙𝒙\bm{x}bold_italic_x according to 𝒑𝒑\bm{p}bold_italic_p
  end for
  𝒘𝒘K𝒘𝒘𝐾\bm{w}\leftarrow\frac{\bm{w}}{K}bold_italic_w ← divide start_ARG bold_italic_w end_ARG start_ARG italic_K end_ARG
  return  𝒘𝒘\bm{w}bold_italic_w

A similar result as in Theorem 4.6 can be proven for 𝑷𝑷\bm{P}bold_italic_P-skew-invariant tensors.

Theorem 4.19.

For a permutation of even order K𝐾Kitalic_K, the Gaussian distribution of 𝐏𝐏\bm{P}bold_italic_P-skew-invariant tensors 𝒩(𝐰0,𝐏0)𝒩subscript𝐰0subscript𝐏0\mathcal{N}(\bm{w}_{0},\bm{P}_{0})caligraphic_N ( bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is described by a mean vector 𝐰0subscript𝐰0\bm{w}_{0}bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT that is 𝐏𝐏\bm{P}bold_italic_P-skew-invariant and covariance matrix

(12) 𝑷0subscript𝑷0\displaystyle\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT :=𝑷+𝑷2+𝑷KK=k=1K(1)k𝑷kK.assignabsent𝑷superscript𝑷2superscript𝑷𝐾𝐾superscriptsubscript𝑘1𝐾superscript1𝑘superscript𝑷𝑘𝐾\displaystyle:=\frac{-\bm{P}+\bm{P}^{2}-\cdots+\bm{P}^{K}}{K}=\frac{\sum_{k=1}% ^{K}\;(-1)^{k}\,\bm{P}^{k}}{K}.:= divide start_ARG - bold_italic_P + bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⋯ + bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG .

Proof 4.20.

The proof is very similar to that of Theorem 4.6. The diagonal entries being nonnegative can be derived from the following argument. The permutation matrix 𝐏𝐏\bm{P}bold_italic_P itself consists of cyclic permutations, with either even or odd order. If a cyclic permutation has an even order k𝑘kitalic_k, then 𝐏ksuperscript𝐏𝑘\bm{P}^{k}bold_italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT will have ones on the diagonal for elements of the cycle. This cycle will occur K/k𝐾𝑘K/kitalic_K / italic_k times in (12), always with a positive sign. If a cyclic permutation has odd order k𝑘kitalic_k, then the diagonal entries of 𝐏ksuperscript𝐏𝑘\bm{P}^{k}bold_italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT will come in equal amounts of K/(2k)𝐾2𝑘K/(2k)italic_K / ( 2 italic_k ) negative and K/(2k)𝐾2𝑘K/(2k)italic_K / ( 2 italic_k ) positive contributions, which results in a zero contribution to the diagonal. The total effect of all cyclic permutations then add up to either zero or positive diagonal entries. Symmetry is proven by using Corollary 4.5 and the fact that K𝐾Kitalic_K is even: an even order k𝑘kitalic_k gets mapped to another even order Kk𝐾𝑘K-kitalic_K - italic_k and an odd order k𝑘kitalic_k gets mapped to and odd order Kk𝐾𝑘K-kitalic_K - italic_k. Hence,

𝑷0Tsuperscriptsubscript𝑷0𝑇\displaystyle\bm{P}_{0}^{T}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT =k=1K(1)k(𝑷k)TK=k=1K(1)k𝑷KkK=𝑷0.absentsuperscriptsubscript𝑘1𝐾superscript1𝑘superscriptsuperscript𝑷𝑘𝑇𝐾superscriptsubscript𝑘1𝐾superscript1𝑘superscript𝑷𝐾𝑘𝐾subscript𝑷0\displaystyle=\frac{\sum_{k=1}^{K}\;(-1)^{k}\,(\bm{P}^{k})^{T}}{K}=\frac{\sum_% {k=1}^{K}\;(-1)^{k}\,\bm{P}^{K-k}}{K}=\bm{P}_{0}.= divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( bold_italic_P start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( - 1 ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT bold_italic_P start_POSTSUPERSCRIPT italic_K - italic_k end_POSTSUPERSCRIPT end_ARG start_ARG italic_K end_ARG = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT .

The idempotency of 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT follows a similar proof as for the case of 𝐏𝐏\bm{P}bold_italic_P-invariance. Writing out (K𝐏0)2superscript𝐾subscript𝐏02(K\bm{P}_{0})^{2}( italic_K bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in terms of 𝐏𝐏\bm{P}bold_italic_P and applying Corollary 4.4 results in

(𝑷+𝑷2+𝑷K)2superscript𝑷superscript𝑷2superscript𝑷𝐾2\displaystyle(-\bm{P}+\bm{P}^{2}-\cdots+\bm{P}^{K})^{2}( - bold_italic_P + bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⋯ + bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=𝑷22𝑷3++(K1)𝑷KK𝑷K+1+(K1)𝑷K+22𝑷2K1+𝑷2Kabsentsuperscript𝑷22superscript𝑷3𝐾1superscript𝑷𝐾𝐾superscript𝑷𝐾1𝐾1superscript𝑷𝐾22superscript𝑷2𝐾1superscript𝑷2𝐾\displaystyle=\bm{P}^{2}-2\;\bm{P}^{3}+\cdots+(K-1)\;\bm{P}^{K}-K\;\bm{P}^{K+1% }+(K-1)\;\bm{P}^{K+2}-\cdots-2\;\bm{P}^{2K-1}+\bm{P}^{2K}= bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 bold_italic_P start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + ⋯ + ( italic_K - 1 ) bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT - italic_K bold_italic_P start_POSTSUPERSCRIPT italic_K + 1 end_POSTSUPERSCRIPT + ( italic_K - 1 ) bold_italic_P start_POSTSUPERSCRIPT italic_K + 2 end_POSTSUPERSCRIPT - ⋯ - 2 bold_italic_P start_POSTSUPERSCRIPT 2 italic_K - 1 end_POSTSUPERSCRIPT + bold_italic_P start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT
=K𝑷+𝑷2+(K1)𝑷K+2K𝑷22𝑷2K1(K2)𝑷K1K𝑷K1+(K1)𝑷K+𝑷2KK𝑷Kabsent𝐾𝑷subscriptsuperscript𝑷2𝐾1superscript𝑷𝐾2𝐾superscript𝑷2subscript2superscript𝑷2𝐾1𝐾2superscript𝑷𝐾1𝐾superscript𝑷𝐾1subscript𝐾1superscript𝑷𝐾superscript𝑷2𝐾𝐾superscript𝑷𝐾\displaystyle=-K\;\bm{P}+\underbrace{\bm{P}^{2}+(K-1)\;\bm{P}^{K+2}}_{K\;\bm{P% }^{2}}-\cdots\underbrace{-2\;\bm{P}^{2K-1}-(K-2)\;\bm{P}^{K-1}}_{-K\;\bm{P}^{K% -1}}+\underbrace{(K-1)\;\bm{P}^{K}+\;\bm{P}^{2K}}_{K\bm{P}^{K}}= - italic_K bold_italic_P + under⏟ start_ARG bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_K - 1 ) bold_italic_P start_POSTSUPERSCRIPT italic_K + 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_K bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - ⋯ under⏟ start_ARG - 2 bold_italic_P start_POSTSUPERSCRIPT 2 italic_K - 1 end_POSTSUPERSCRIPT - ( italic_K - 2 ) bold_italic_P start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT - italic_K bold_italic_P start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + under⏟ start_ARG ( italic_K - 1 ) bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT + bold_italic_P start_POSTSUPERSCRIPT 2 italic_K end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_K bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
=K(𝑷+𝑷2𝑷3++𝑷K)absent𝐾𝑷superscript𝑷2superscript𝑷3superscript𝑷𝐾\displaystyle=K\;(-\bm{P}+\bm{P}^{2}-\bm{P}^{3}+\cdots+\bm{P}^{K})= italic_K ( - bold_italic_P + bold_italic_P start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - bold_italic_P start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + ⋯ + bold_italic_P start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT )
=K2𝑷0absentsuperscript𝐾2subscript𝑷0\displaystyle=K^{2}\;\bm{P}_{0}= italic_K start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

which proves that 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is idempotent.

Theorems 4.6 and 4.19 are practical when the order K𝐾Kitalic_K of the permutation matrix 𝑷𝑷\bm{P}bold_italic_P stays small compared to J𝐽Jitalic_J and D𝐷Ditalic_D. For Hankel structures this is unfortunately not the case. Consider for example a 20×20202020\times 2020 × 20 Hankel matrix. Its corresponding permutation matrix has permutation cycles ranging from length 1 up to 20 and K𝐾Kitalic_K is therefore the least common multiple of 1,2,,20=232,792,560formulae-sequence12202327925601,2,\ldots,20=232,792,5601 , 2 , … , 20 = 232 , 792 , 560. Fortunately, it is possible to explicitly construct a sparse matrix of orthogonal columns 𝑽𝑽\bm{V}bold_italic_V such that 𝑷0=𝑽subscript𝑷0𝑽\sqrt{\bm{P}_{0}}=\bm{V}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = bold_italic_V.

5 Sparse square root covariance matrix construction for permutation-invariant tensors

Every permutation 𝑷𝑷\bm{P}bold_italic_P can be decomposed in terms of R𝑅Ritalic_R cyclic permutations. These cyclic permutations partition the set of all tensor entries into R𝑅Ritalic_R disjoint sets and allow for an alternative construction of 𝑷0subscript𝑷0\sqrt{\bm{P}_{0}}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG, where the resulting matrix is sparse and consists of orthogonal columns.

Theorem 5.1.

Let 𝐏𝐏\bm{P}bold_italic_P be a permutation matrix that consists of R𝑅Ritalic_R permutation cycles and let Crsubscript𝐶𝑟C_{r}italic_C start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denote the r𝑟ritalic_rth cycle, where the number of tensor entries in Crsubscript𝐶𝑟C_{r}italic_C start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is denoted |Cr|subscript𝐶𝑟|C_{r}|| italic_C start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT |. Then the range of the matrix 𝐕JD×R𝐕superscriptsuperscript𝐽𝐷𝑅\bm{V}\in\mathbb{R}^{J^{D}\times R}bold_italic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × italic_R end_POSTSUPERSCRIPT such that

(13) vj1,j2,,jD¯,r={1|Cr|if wj1,j2,,jDCr,0otherwise, subscript𝑣¯subscript𝑗1subscript𝑗2subscript𝑗𝐷𝑟cases1subscript𝐶𝑟if subscript𝑤subscript𝑗1subscript𝑗2subscript𝑗𝐷subscript𝐶𝑟0otherwise, v_{\overline{j_{1},j_{2},\ldots,j_{D}},r}=\begin{cases}\frac{1}{\sqrt{|C_{r}|}% }&\text{if }w_{j_{1},j_{2},\ldots,j_{D}}\in{C}_{r},\\[8.61108pt] 0&\text{otherwise, }\end{cases}italic_v start_POSTSUBSCRIPT over¯ start_ARG italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_ARG , italic_r end_POSTSUBSCRIPT = { start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG square-root start_ARG | italic_C start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | end_ARG end_ARG end_CELL start_CELL if italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL otherwise, end_CELL end_ROW

spans the eigenspace of 𝐏𝐏\bm{P}bold_italic_P corresponding to an eigenvalue λ=1𝜆1\lambda=1italic_λ = 1. In other words, 𝐕=𝐏0𝐕subscript𝐏0\bm{V}=\sqrt{\bm{P}_{0}}bold_italic_V = square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG. Also, 𝐕T𝐕=𝐈Rsuperscript𝐕𝑇𝐕subscript𝐈𝑅\bm{V}^{T}\bm{V}=\bm{I}_{R}bold_italic_V start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_V = bold_italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT.

Proof 5.2.

The equality 𝐏𝐕=𝐕𝐏𝐕𝐕\bm{P}\bm{V}=\bm{V}bold_italic_P bold_italic_V = bold_italic_V follows from each column of 𝐕𝐕\bm{V}bold_italic_V containing nonzero values at tensor entries of a particular permutation cycle of 𝐏𝐏\bm{P}bold_italic_P. The orthogonality follows directly from the permutation cycles being disjoint and each column of 𝐕𝐕\bm{V}bold_italic_V being unit-norm due to the scaling with |Cr|subscript𝐶𝑟\sqrt{|C_{r}|}square-root start_ARG | italic_C start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | end_ARG.

A basis for the skew-𝑷𝑷\bm{P}bold_italic_P-invariant eigenspace can be built in a similar way by retaining the cycles of even order and alternating the sign of the entries vj1,j2,,jD¯,rsubscript𝑣¯subscript𝑗1subscript𝑗2subscript𝑗𝐷𝑟v_{\overline{j_{1},j_{2},\ldots,j_{D}},r}italic_v start_POSTSUBSCRIPT over¯ start_ARG italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_ARG , italic_r end_POSTSUBSCRIPT in each column.

Example 5.3.

Consider a 20×20202020\times 2020 × 20 Hankel matrix. Using Theorem 4.6 one would need to construct the 400×400400400400\times 400400 × 400 Hankel permutation matrix 𝐇𝐇\bm{H}bold_italic_H and construct 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by adding 232,792,560232792560232,792,560232 , 792 , 560 terms together. Using Theorem 5.1 the sparse 400×3940039400\times 39400 × 39 matrix 𝐕𝐕\bm{V}bold_italic_V can be constructed directly containing 400 nonzero entries.

6 Solving the inverse problem

In this section three different aspects when solving the inverse problem are discussed. First, we explain how the prior covariance matrices of (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensors can be parameterized. Second, we briefly discuss a change of variables, originally proposed in [8], to exploit fast implementations of the matrix vector product 𝑷0𝒘subscript𝑷0𝒘\bm{P}_{0}\bm{w}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_w. The third aspect relates to kernel methods, where (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensor priors are used to define new structured tensor kernel functions.

6.1 Parameterizing the prior covariance matrix

The covariance matrix 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as described in Theorems 3.14.6 and 5.1 encodes the structure of the (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensor without having any free parameters to quantify the importance of the prior p(𝒘)𝑝𝒘p(\bm{w})italic_p ( bold_italic_w ) relative to the likelihood p(𝒚|𝒘)𝑝conditional𝒚𝒘p(\bm{y}|\bm{w})italic_p ( bold_italic_y | bold_italic_w ). Such free parameters are often called hyperparameters. Suppose for example that through Theorem 3.1 an orthogonal basis for the nullspace 𝑽2JD×Rsubscript𝑽2superscriptsuperscript𝐽𝐷𝑅\bm{V}_{2}\in\mathbb{R}^{J^{D}\times R}bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × italic_R end_POSTSUPERSCRIPT of 𝑨𝑨\bm{A}bold_italic_A is computed from its singular value decomposition (SVD)

𝑨=(𝑼1𝑼2)(𝑺𝟎𝟎𝟎)(𝑽1T𝑽2T).𝑨matrixsubscript𝑼1subscript𝑼2matrix𝑺000matrixsuperscriptsubscript𝑽1𝑇superscriptsubscript𝑽2𝑇\displaystyle\bm{A}=\begin{pmatrix}\bm{U}_{1}&\bm{U}_{2}\end{pmatrix}\;\begin{% pmatrix}\bm{S}&\bm{0}\\ \bm{0}&\bm{0}\\ \end{pmatrix}\;\begin{pmatrix}\bm{V}_{1}^{T}\\ \bm{V}_{2}^{T}\end{pmatrix}.bold_italic_A = ( start_ARG start_ROW start_CELL bold_italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL bold_italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL bold_italic_S end_CELL start_CELL bold_0 end_CELL end_ROW start_ROW start_CELL bold_0 end_CELL start_CELL bold_0 end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL bold_italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) .

A desired square-root covariance matrix 𝑷0subscript𝑷0\sqrt{\bm{P}_{0}}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG is then 𝑽2𝑻subscript𝑽2𝑻\bm{V}_{2}\,\bm{T}bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT bold_italic_T, where 𝑻R×R𝑻superscript𝑅𝑅\bm{T}\in\mathbb{R}^{R\times R}bold_italic_T ∈ blackboard_R start_POSTSUPERSCRIPT italic_R × italic_R end_POSTSUPERSCRIPT is any invertible matrix. The nullity R𝑅Ritalic_R of 𝑨𝑨\bm{A}bold_italic_A can be interpreted as the total number of distinct elements in the (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W. The 𝑻𝑻\bm{T}bold_italic_T matrix can be interpreted as the square-root covariance matrix of those R𝑅Ritalic_R variables since

𝑷0subscript𝑷0\displaystyle\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT =𝑷0(𝑷0)T=𝑽2(𝑻𝑻T)𝑽2T.absentsubscript𝑷0superscriptsubscript𝑷0𝑇subscript𝑽2𝑻superscript𝑻𝑇superscriptsubscript𝑽2𝑇\displaystyle=\sqrt{\bm{P}_{0}}\;(\sqrt{\bm{P}_{0}})^{T}=\bm{V}_{2}\;\left(\bm% {T}\,\bm{T}^{T}\right)\;\bm{V}_{2}^{T}.= square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ( square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_T bold_italic_T start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT .

The matrix 𝑽2subscript𝑽2\bm{V}_{2}bold_italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is then to be understood as “projecting” the covariance matrix 𝑻𝑻T𝑻superscript𝑻𝑇\bm{T}\bm{T}^{T}bold_italic_T bold_italic_T start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT of the R𝑅Ritalic_R underlying variables to the JDsuperscript𝐽𝐷J^{D}italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT entries of the (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W tensor. Parameterizing 𝑻𝑻\bm{T}bold_italic_T in terms of a single hyperparameter σ+𝜎superscript\sigma\in\mathbb{R^{+}}italic_σ ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT as 𝑻=σ𝑰𝑻𝜎𝑰\bm{T}=\sigma\;\bm{I}bold_italic_T = italic_σ bold_italic_I implies that these R𝑅Ritalic_R variables are independent and have equal variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Correlations between the R𝑅Ritalic_R variables can be modeled by for example parameterizing 𝑻𝑻\bm{T}bold_italic_T as a lower triangular matrix. The values of these hyperparameters can be learned from data through cross-validation, marginal likelihood optimization or a hierarchical Bayesian approach [27, 32].

6.2 Change of variables

Squaring the condition number when solving the normal equation of (3) can be avoided by solving its square-root version

(𝚺1𝚽𝑷01)𝒘+matrixsuperscript𝚺1𝚽superscriptsubscript𝑷01subscript𝒘\displaystyle\begin{pmatrix}\sqrt{\bm{\Sigma}^{-1}}\bm{\Phi}\\ \sqrt{\bm{P}_{0}^{-1}}\end{pmatrix}\;\bm{w}_{+}( start_ARG start_ROW start_CELL square-root start_ARG bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG bold_Φ end_CELL end_ROW start_ROW start_CELL square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW end_ARG ) bold_italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT =(𝚺1𝒚𝑷01𝒘0)absentmatrixsuperscript𝚺1𝒚superscriptsubscript𝑷01subscript𝒘0\displaystyle=\begin{pmatrix}\sqrt{\bm{\Sigma}^{-1}}{\bm{y}}\\ \sqrt{\bm{P}_{0}^{-1}}\,\bm{w}_{0}\end{pmatrix}= ( start_ARG start_ROW start_CELL square-root start_ARG bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG bold_italic_y end_CELL end_ROW start_ROW start_CELL square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG )

instead. When constructing the square-root of the inverse prior covariance matrix is difficult then a change of variables can be used to avoid their construction [8]. By defining 𝒙:=𝑷01(𝒘+𝒘0)assign𝒙superscriptsubscript𝑷01subscript𝒘subscript𝒘0\bm{x}:=\bm{P}_{0}^{-1}\,(\bm{w}_{+}-\bm{w}_{0})bold_italic_x := bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT - bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and 𝒛:=𝒚𝚽𝒘0assign𝒛𝒚𝚽subscript𝒘0\bm{z}:=\bm{y}-\bm{\Phi}\bm{w}_{0}bold_italic_z := bold_italic_y - bold_Φ bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT the square-root linear system is transformed into

(𝚺1𝚽𝑷0𝑰)𝒙matrixsuperscript𝚺1𝚽subscript𝑷0𝑰𝒙\displaystyle\begin{pmatrix}\sqrt{\bm{\Sigma}^{-1}}\bm{\Phi}\bm{P}_{0}\\ \bm{I}\end{pmatrix}\;\bm{x}( start_ARG start_ROW start_CELL square-root start_ARG bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG bold_Φ bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_I end_CELL end_ROW end_ARG ) bold_italic_x =(𝚺1𝒛0).absentmatrixsuperscript𝚺1𝒛0\displaystyle=\begin{pmatrix}\sqrt{\bm{\Sigma}^{-1}}{\bm{z}}\\ 0\end{pmatrix}.= ( start_ARG start_ROW start_CELL square-root start_ARG bold_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG bold_italic_z end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW end_ARG ) .

The desired posterior mean 𝒘+subscript𝒘\bm{w}_{+}bold_italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT can then be recovered from 𝒘+=𝑷0𝒙+𝒘0subscript𝒘subscript𝑷0𝒙subscript𝒘0\bm{w}_{+}=\bm{P}_{0}\,\bm{x}+\bm{w}_{0}bold_italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_x + bold_italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This formulation is especially beneficial when the matrix vector product 𝑷0𝒙subscript𝑷0𝒙\bm{P}_{0}\,\bm{x}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_x can be implemented in a computationally efficient manner, for example using Algorithm 2.

6.3 Structured tensor kernel functions

When the tensor 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W is much larger than the data size N𝑁Nitalic_N then the O(J3D)𝑂superscript𝐽3𝐷O(J^{3D})italic_O ( italic_J start_POSTSUPERSCRIPT 3 italic_D end_POSTSUPERSCRIPT ) computational complexity of computing (3) is replaced with at least O(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) by solving the corresponding dual problem

(𝚽𝑷0𝚽T+𝚺)𝒗=𝒚.𝚽subscript𝑷0superscript𝚽𝑇𝚺𝒗𝒚\displaystyle(\bm{\Phi}\,\bm{P}_{0}\,\bm{\Phi}^{T}+\bm{\Sigma})\;\bm{v}=\bm{y}.( bold_Φ bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_Φ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + bold_Σ ) bold_italic_v = bold_italic_y .

An additional benefit is that no matrix inverse of 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is required so that Theorems 3.14.6 and 5.1 can be applied directly. The matrix 𝚽𝑷0𝚽T𝚽subscript𝑷0superscript𝚽𝑇\bm{\Phi}\,\bm{P}_{0}\,\bm{\Phi}^{T}bold_Φ bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_Φ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is called the kernel matrix 𝑲𝑲\bm{K}bold_italic_K and each entry ki,jsubscript𝑘𝑖𝑗k_{i,j}italic_k start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT is per definition the evaluation of a kernel function

ki,j=k(𝒙i,𝒙j):=𝝋(𝒙i)T𝑷0𝝋(𝒙j).subscript𝑘𝑖𝑗𝑘subscript𝒙𝑖subscript𝒙𝑗assign𝝋superscriptsubscript𝒙𝑖𝑇subscript𝑷0𝝋subscript𝒙𝑗k_{i,j}=k(\bm{x}_{i},\bm{x}_{j}):=\bm{\varphi}(\bm{x}_{i})^{T}\,\bm{P}_{0}\;% \bm{\varphi}(\bm{x}_{j}).italic_k start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_k ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) := bold_italic_φ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_φ ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) .

Choosing 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as a covariance matrix of an (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensor allows us to define new kernel functions. The kernel trick in machine learning refers to the fact where the kernel function can be evaluated without every explicitly computing the possibly large feature vectors 𝝋()𝝋\bm{\varphi}(\cdot)bold_italic_φ ( ⋅ ). In the case of 𝑷𝑷\bm{P}bold_italic_P-invariant tensors one can exploit the particular structure of 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as described in Theorem 4.6 or use Algorithm 2 to achieve this goal.

Example 6.1.

(Centrosymmetric polynomial kernel) Let c𝑐\sqrt{c}\in\mathbb{R}square-root start_ARG italic_c end_ARG ∈ blackboard_R and d𝑑d\in\mathbb{N}italic_d ∈ blackboard_N. The polynomial kernel function is defined as

k(𝒙i,𝒙j)𝑘subscript𝒙𝑖subscript𝒙𝑗\displaystyle k(\bm{x}_{i},\bm{x}_{j})italic_k ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) =𝝋(𝒙i)T𝑰𝝋(𝒙j),absent𝝋superscriptsubscript𝒙𝑖𝑇𝑰𝝋subscript𝒙𝑗\displaystyle=\bm{\varphi}(\bm{x}_{i})^{T}\;\bm{I}\;\bm{\varphi}(\bm{x}_{j}),= bold_italic_φ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_I bold_italic_φ ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,
=(c𝒙iT)(c𝒙iT)d times𝑰(c𝒙jT)T(c𝒙jT)Td timesabsentsubscripttensor-productmatrix𝑐superscriptsubscript𝒙𝑖𝑇matrix𝑐superscriptsubscript𝒙𝑖𝑇𝑑 times𝑰subscripttensor-productsuperscriptmatrix𝑐superscriptsubscript𝒙𝑗𝑇𝑇superscriptmatrix𝑐superscriptsubscript𝒙𝑗𝑇𝑇𝑑 times\displaystyle=\underbrace{\begin{pmatrix}\sqrt{c}&\bm{x}_{i}^{T}\end{pmatrix}% \otimes\cdots\otimes\begin{pmatrix}\sqrt{c}&\bm{x}_{i}^{T}\end{pmatrix}}_{d% \textrm{ times}}\;\bm{I}\;\underbrace{\begin{pmatrix}\sqrt{c}&\bm{x}_{j}^{T}% \end{pmatrix}^{T}\otimes\cdots\otimes\begin{pmatrix}\sqrt{c}&\bm{x}_{j}^{T}% \end{pmatrix}^{T}}_{d\textrm{ times}}= under⏟ start_ARG ( start_ARG start_ROW start_CELL square-root start_ARG italic_c end_ARG end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ⊗ ⋯ ⊗ ( start_ARG start_ROW start_CELL square-root start_ARG italic_c end_ARG end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_ARG start_POSTSUBSCRIPT italic_d times end_POSTSUBSCRIPT bold_italic_I under⏟ start_ARG ( start_ARG start_ROW start_CELL square-root start_ARG italic_c end_ARG end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⊗ ⋯ ⊗ ( start_ARG start_ROW start_CELL square-root start_ARG italic_c end_ARG end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_d times end_POSTSUBSCRIPT
=(c+𝒙iT𝒙j)d.absentsuperscript𝑐superscriptsubscript𝒙𝑖𝑇subscript𝒙𝑗𝑑\displaystyle=(c+\bm{x}_{i}^{T}\,\bm{x}_{j})^{d}.= ( italic_c + bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .

The expression (c+𝐱iT𝐱j)dsuperscript𝑐superscriptsubscript𝐱𝑖𝑇subscript𝐱𝑗𝑑(c+\bm{x}_{i}^{T}\,\bm{x}_{j})^{d}( italic_c + bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is obtained from writing the identity matrix 𝐈𝐈\bm{I}bold_italic_I as a Kronecker product of smaller identity matrices and applying the mixed product property. The polynomial kernel function can therefore be interpreted as using a unit covariance matrix 𝐏0subscript𝐏0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We can now define the centrosymmetric polynomial kernel function k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by using the polynomial feature vectors 𝛗()𝛗\bm{\varphi}(\cdot)bold_italic_φ ( ⋅ ) and replacing 𝐈𝐈\bm{I}bold_italic_I with the covariance matrix of centrosymmetric tensors. From Theorem 4.6 it then follows that

k2(𝒙i,𝒙j)subscript𝑘2subscript𝒙𝑖subscript𝒙𝑗\displaystyle k_{2}(\bm{x}_{i},\bm{x}_{j})italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) =𝝋(𝒙i)T𝑷0𝝋(𝒙j),absent𝝋superscriptsubscript𝒙𝑖𝑇subscript𝑷0𝝋subscript𝒙𝑗\displaystyle=\bm{\varphi}(\bm{x}_{i})^{T}\;\bm{P}_{0}\;\bm{\varphi}(\bm{x}_{j% }),= bold_italic_φ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT bold_italic_φ ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,
=12𝝋(𝒙i)T(𝑰+𝑱)𝝋(𝒙j),absent12𝝋superscriptsubscript𝒙𝑖𝑇𝑰𝑱𝝋subscript𝒙𝑗\displaystyle=\frac{1}{2}\,\bm{\varphi}(\bm{x}_{i})^{T}\;{(\bm{I}+\bm{J})}\;% \bm{\varphi}(\bm{x}_{j}),= divide start_ARG 1 end_ARG start_ARG 2 end_ARG bold_italic_φ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( bold_italic_I + bold_italic_J ) bold_italic_φ ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ,
=12(c𝒙iT)(c𝒙iT)d times(𝑰+𝑱)(c𝒙jT)T(c𝒙jT)Td times,absent12subscripttensor-productmatrix𝑐superscriptsubscript𝒙𝑖𝑇matrix𝑐superscriptsubscript𝒙𝑖𝑇𝑑 times𝑰𝑱subscripttensor-productsuperscriptmatrix𝑐superscriptsubscript𝒙𝑗𝑇𝑇superscriptmatrix𝑐superscriptsubscript𝒙𝑗𝑇𝑇𝑑 times\displaystyle=\frac{1}{2}\,\underbrace{\begin{pmatrix}\sqrt{c}&\bm{x}_{i}^{T}% \end{pmatrix}\otimes\cdots\otimes\begin{pmatrix}\sqrt{c}&\bm{x}_{i}^{T}\end{% pmatrix}}_{d\textrm{ times}}\;(\bm{I}+\bm{J})\;\underbrace{\begin{pmatrix}% \sqrt{c}&\bm{x}_{j}^{T}\end{pmatrix}^{T}\otimes\cdots\otimes\begin{pmatrix}% \sqrt{c}&\bm{x}_{j}^{T}\end{pmatrix}^{T}}_{d\textrm{ times}},= divide start_ARG 1 end_ARG start_ARG 2 end_ARG under⏟ start_ARG ( start_ARG start_ROW start_CELL square-root start_ARG italic_c end_ARG end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ⊗ ⋯ ⊗ ( start_ARG start_ROW start_CELL square-root start_ARG italic_c end_ARG end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) end_ARG start_POSTSUBSCRIPT italic_d times end_POSTSUBSCRIPT ( bold_italic_I + bold_italic_J ) under⏟ start_ARG ( start_ARG start_ROW start_CELL square-root start_ARG italic_c end_ARG end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⊗ ⋯ ⊗ ( start_ARG start_ROW start_CELL square-root start_ARG italic_c end_ARG end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT italic_d times end_POSTSUBSCRIPT ,
=12(c+𝒙iT𝒙j)d+12((c𝒙iT)𝑱d(c𝒙jT)T)d.absent12superscript𝑐superscriptsubscript𝒙𝑖𝑇subscript𝒙𝑗𝑑12superscriptmatrix𝑐superscriptsubscript𝒙𝑖𝑇subscript𝑱𝑑superscriptmatrix𝑐superscriptsubscript𝒙𝑗𝑇𝑇𝑑\displaystyle=\frac{1}{2}(c+\bm{x}_{i}^{T}\bm{x}_{j})^{d}+\frac{1}{2}\left(% \begin{pmatrix}\sqrt{c}&\bm{x}_{i}^{T}\end{pmatrix}\bm{J}_{d}\begin{pmatrix}% \sqrt{c}&\bm{x}_{j}^{T}\end{pmatrix}^{T}\right)^{d}.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_c + bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ( start_ARG start_ROW start_CELL square-root start_ARG italic_c end_ARG end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) bold_italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( start_ARG start_ROW start_CELL square-root start_ARG italic_c end_ARG end_CELL start_CELL bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .

Also here the explicit construction of 𝛗()𝛗\bm{\varphi}(\cdot)bold_italic_φ ( ⋅ ) is avoided by writing the matrix 𝐉JD×JD𝐉superscriptsuperscript𝐽𝐷superscript𝐽𝐷\bm{J}\in\mathbb{R}^{J^{D}\times J^{D}}bold_italic_J ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT as a Kronecker product of the smaller permutation matrix 𝐉dJ×Jsubscript𝐉𝑑superscript𝐽𝐽\bm{J}_{d}\in\mathbb{R}^{J\times J}bold_italic_J start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_J × italic_J end_POSTSUPERSCRIPT with itself d𝑑ditalic_d times and using the mixed-product property.

7 Applications

In this section we demonstrate the use of Theorems 3.14.6, and 5.1 in three different applications. Practical implementations on how to sample various (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensor priors are explained in Application 7.1. We consider lower triangular tensors, tensors for which the sum over the last index adds up to 1, symmetric tensors and Hankel tensors. Application 7.2 considers the problem of completing a Hankel matrix from noisy partial measurements by solving it as a Bayesian inverse problem. The estimate of the completed Hankel matrix when using a Hankel prior is compared to the estimate where no prior is used. In Application 7.3 learning a classifier for handwritten digits is solved as a Bayesian inverse problem. The classifier obtained with the commonly used Tikhonov prior is compared to several (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensor priors.

All applications have been implemented as reactive Pluto [28] notebooks in Julia [5] and are publicly available at https://github.com/TUDelft-DeTAIL/AbTensors. The notebook files can be freely downloaded and run on your local machine in Julia. An alternative way to use these notebooks that does not require the installation of Julia is to run them in the cloud via Binder [23]. This can be done by clicking on each of the links on the main Github page. Please note that it can take over 10 minutes for Binder to download and compile all required packages.

As discussed in section 6.1 we parameterized the prior covariance matrix 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with a single hyperparameter σPsubscript𝜎𝑃\sigma_{P}italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT in both Applications 7.2 and 7.3.

7.1 Sampling structured tensor priors

In this first application we demonstrate how Theorems 3.14.6 and 5.1 are used to sample the priors of different (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensors.

Example 7.1.

(Lower triangular tensors) A first example of an (𝐀,𝐛)𝐀𝐛(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensor considered here are lower triangular tensors. From Definition 2.1 we know that triangular tensors are described by

𝑨=(𝑨1𝑨2𝑨D1)=(𝑺𝑰J𝑰J𝑰J𝑺𝑰J𝑰J𝑰J𝑺)(D1)(J1)JD12×JD𝑨matrixsubscript𝑨1subscript𝑨2subscript𝑨𝐷1matrixtensor-product𝑺subscript𝑰𝐽subscript𝑰𝐽tensor-productsubscript𝑰𝐽𝑺subscript𝑰𝐽tensor-productsubscript𝑰𝐽subscript𝑰𝐽𝑺superscript𝐷1𝐽1superscript𝐽𝐷12superscript𝐽𝐷\displaystyle\bm{A}=\begin{pmatrix}\bm{A}_{1}\\ \bm{A}_{2}\\ \vdots\\ \bm{A}_{D-1}\end{pmatrix}=\begin{pmatrix}\bm{S}\otimes\bm{I}_{J}\otimes\cdots% \otimes\bm{I}_{J}\\ \bm{I}_{J}\otimes\bm{S}\otimes\cdots\otimes\bm{I}_{J}\\ \vdots\\ \bm{I}_{J}\otimes\bm{I}_{J}\otimes\cdots\otimes\bm{S}\end{pmatrix}\in\mathbb{R% }^{\frac{(D-1)(J-1)J^{D-1}}{2}\times J^{D}}bold_italic_A = ( start_ARG start_ROW start_CELL bold_italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_A start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_italic_A start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL bold_italic_S ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⊗ ⋯ ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⊗ bold_italic_S ⊗ ⋯ ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⊗ ⋯ ⊗ bold_italic_S end_CELL end_ROW end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG ( italic_D - 1 ) ( italic_J - 1 ) italic_J start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG × italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT

and zero vector 𝐛𝐛\bm{b}bold_italic_b. The square root of the covariance matrix is built up by applying Algorithm 1, which considers only 1 block row of 𝐀𝐀\bm{A}bold_italic_A at a time. The whole 𝐀𝐀\bm{A}bold_italic_A matrix is therefore never explicitly made. In the notebook it is possible to sample lower triangular tensors with orders ranging from 2 up to 5 and dimensions 2 up to 6 by moving the corresponding sliders.

Example 7.2.

(Tensors with known sum of entries) In this example we sample tensors 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W for which the sum over the last index always adds up to a value of 1:

j1,j2,,jD1:jDwj1,j2,,jD=bj1,j2,,jD1=1.:for-allsubscript𝑗1subscript𝑗2subscript𝑗𝐷1subscriptsubscript𝑗𝐷subscript𝑤subscript𝑗1subscript𝑗2subscript𝑗𝐷subscript𝑏subscript𝑗1subscript𝑗2subscript𝑗𝐷11\displaystyle\forall j_{1},j_{2},\ldots,j_{D-1}:\sum_{j_{D}}w_{j_{1},j_{2},% \ldots,j_{D}}=b_{j_{1},j_{2},\ldots,j_{D-1}}=1.∀ italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT : ∑ start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1 .

From Lemma 2.5 we know that in this case 𝐀=𝟏JT𝐈J𝐈J𝐀tensor-productsuperscriptsubscript1𝐽𝑇subscript𝐈𝐽subscript𝐈𝐽\bm{A}=\bm{1}_{J}^{T}\otimes\bm{I}_{J}\otimes\cdots\otimes\bm{I}_{J}bold_italic_A = bold_1 start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⊗ ⋯ ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT. It is straightforward to verify that a basis for the right nullspace of 𝐀𝐀\bm{A}bold_italic_A is

(111100010001)IJIJ.tensor-productmatrix111100010001subscript𝐼𝐽subscript𝐼𝐽\displaystyle\begin{pmatrix}1&1&\cdots&1\\ -1&0&\cdots&0\\ 0&-1&\cdots&0\\ 0&0&\cdots&-1\end{pmatrix}\otimes I_{J}\otimes\cdots\otimes I_{J}.( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ) ⊗ italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⊗ ⋯ ⊗ italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT .

Sampling the prior can now be done without every constructing a basis for the nullspace explicitly since

𝑷0𝒙subscript𝑷0𝒙\displaystyle\sqrt{\bm{P}_{0}}\;\bm{x}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_italic_x =((111100010001)𝑰J𝑰J)𝒙absenttensor-productmatrix111100010001subscript𝑰𝐽subscript𝑰𝐽𝒙\displaystyle=\left(\begin{pmatrix}1&1&\cdots&1\\ -1&0&\cdots&0\\ 0&-1&\cdots&0\\ 0&0&\cdots&-1\end{pmatrix}\otimes\bm{I}_{J}\otimes\cdots\otimes\bm{I}_{J}% \right)\;\bm{x}= ( ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL ⋯ end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ) ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ⊗ ⋯ ⊗ bold_italic_I start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT ) bold_italic_x
=(𝑰JD1𝑰JD1𝑰JD1𝑰JD1000𝑰JD1000𝑰JD1)(𝒙1𝒙2𝒙J1)=(𝒙1+𝒙2++𝒙J1𝒙1𝒙2𝒙J1).absentmatrixsubscript𝑰superscript𝐽𝐷1subscript𝑰superscript𝐽𝐷1subscript𝑰superscript𝐽𝐷1subscript𝑰superscript𝐽𝐷1000subscript𝑰superscript𝐽𝐷1000subscript𝑰superscript𝐽𝐷1matrixsubscript𝒙1subscript𝒙2subscript𝒙𝐽1matrixsubscript𝒙1subscript𝒙2subscript𝒙𝐽1subscript𝒙1subscript𝒙2subscript𝒙𝐽1\displaystyle=\begin{pmatrix}\bm{I}_{J^{D-1}}&\bm{I}_{J^{D-1}}&\cdots&\bm{I}_{% J^{D-1}}\\ -\bm{I}_{J^{D-1}}&0&\cdots&0\\ 0&-\bm{I}_{J^{D-1}}&\cdots&0\\ 0&0&\cdots&-\bm{I}_{J^{D-1}}\end{pmatrix}\;\begin{pmatrix}\bm{x}_{1}\\ \bm{x}_{2}\\ \vdots\\ \bm{x}_{J-1}\end{pmatrix}=\begin{pmatrix}\bm{x}_{1}+\bm{x}_{2}+\cdots+\bm{x}_{% J-1}\\ -\bm{x}_{1}\\ -\bm{x}_{2}\\ \vdots\\ -\bm{x}_{J-1}\end{pmatrix}.= ( start_ARG start_ROW start_CELL bold_italic_I start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL bold_italic_I start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_italic_I start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - bold_italic_I start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - bold_italic_I start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL ⋯ end_CELL start_CELL - bold_italic_I start_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) ( start_ARG start_ROW start_CELL bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_italic_x start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ⋯ + bold_italic_x start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL - bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL - bold_italic_x start_POSTSUBSCRIPT italic_J - 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) .

It is therefore sufficient to sample 𝐱(J1)JD1𝐱superscript𝐽1superscript𝐽𝐷1\bm{x}\in\mathbb{R}^{(J-1)\,J^{D-1}}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_J - 1 ) italic_J start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT from a standard normal distribution and do the operations on the J1𝐽1J-1italic_J - 1 partitions of 𝐱𝐱\bm{x}bold_italic_x as described above to generate the desired sample. In the notebook one can change the order of the sampled tensor from 2 up to 5 and dimension from 5 up to 10 by using the corresponding sliders.

Example 7.3.

(Symmetric tensors) Symmetric tensors 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W are tensors for which entries are invariant under any index permutation. The permutation matrix 𝐒𝐒\bm{S}bold_italic_S in the symmetric case consists of cyclic permutations where each each cycle contains the entry wj1,,jDsubscript𝑤subscript𝑗1subscript𝑗𝐷w_{j_{1},\ldots,j_{D}}italic_w start_POSTSUBSCRIPT italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT and all entries with corresponding index permutations wπ(j1,,jD)subscript𝑤𝜋subscript𝑗1subscript𝑗𝐷w_{\pi(j_{1},\ldots,j_{D})}italic_w start_POSTSUBSCRIPT italic_π ( italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT. For example, in the case D=2𝐷2D=2italic_D = 2 and J=2𝐽2J=2italic_J = 2 the permutation matrix 𝐒𝐒\bm{S}bold_italic_S consists of 3333 cyclic permutations

w1,1w1,1,w2,1w1,2,w1,2w2,1,w2,2w2,2.formulae-sequencemaps-tosubscript𝑤11subscript𝑤11formulae-sequencemaps-tosubscript𝑤21subscript𝑤12formulae-sequencemaps-tosubscript𝑤12subscript𝑤21maps-tosubscript𝑤22subscript𝑤22\displaystyle w_{1,1}\mapsto w_{1,1},\;w_{2,1}\mapsto w_{1,2},\;w_{1,2}\mapsto w% _{2,1},\;w_{2,2}\mapsto w_{2,2}.italic_w start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ↦ italic_w start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT ↦ italic_w start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ↦ italic_w start_POSTSUBSCRIPT 2 , 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT ↦ italic_w start_POSTSUBSCRIPT 2 , 2 end_POSTSUBSCRIPT .

The order K𝐾Kitalic_K of 𝐒𝐒\bm{S}bold_italic_S in this case is 2222 since 𝐒2=Isuperscript𝐒2𝐼\bm{S}^{2}=Ibold_italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_I. According to Theorem 4.6 we then have that the square root of the covariance matrix is 𝐏0=(𝐒+𝐒2)/2.subscript𝐏0𝐒superscript𝐒22\sqrt{\bm{P}_{0}}=\nicefrac{{(\bm{S}+\bm{S}^{2})}}{{2}}.square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = / start_ARG ( bold_italic_S + bold_italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 end_ARG . When D=3𝐷3D=3italic_D = 3, the order K𝐾Kitalic_K of the corresponding permutation matrix is 6666 and hence 𝐏0=(𝐒+𝐒2+𝐒3+𝐒4+𝐒5+𝐒6)/6.subscript𝐏0𝐒superscript𝐒2superscript𝐒3superscript𝐒4superscript𝐒5superscript𝐒66\sqrt{\bm{P}_{0}}=\nicefrac{{(\bm{S}+\bm{S}^{2}+\bm{S}^{3}+\bm{S}^{4}+\bm{S}^{% 5}+\bm{S}^{6})}}{{6}}.square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = / start_ARG ( bold_italic_S + bold_italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + bold_italic_S start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + bold_italic_S start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + bold_italic_S start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT + bold_italic_S start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT ) end_ARG start_ARG 6 end_ARG . Sampling from these priors is done via Algorithm 2 where a standard normal sample 𝐱JD𝐱superscriptsuperscript𝐽𝐷\bm{x}\in\mathbb{R}^{J^{D}}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is generated and permuted K𝐾Kitalic_K times. The notebook allows you to sample symmetric tensors of orders 2 and 3 and dimensions 3 up to 10.

Example 7.4.

(Hankel tensors) Hankel tensors 𝓦𝓦\bm{\mathcal{W}}bold_caligraphic_W are tensors for which entries with a constant index sum j1++jDsubscript𝑗1subscript𝑗𝐷j_{1}+\cdots+j_{D}italic_j start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ⋯ + italic_j start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT have the same numerical value. The order K𝐾Kitalic_K of the corresponding permutation matrix 𝐏𝐏\bm{P}bold_italic_P grows very quickly. For example, when D=2𝐷2D=2italic_D = 2 and J=20𝐽20J=20italic_J = 20 the order K𝐾Kitalic_K is the least common multiple of 1,2,,20=232,792,560formulae-sequence12202327925601,2,\ldots,20=232,792,5601 , 2 , … , 20 = 232 , 792 , 560. Theorem 5.1, however, allows us to construct a matrix 𝐏0JD×Rsubscript𝐏0superscriptsuperscript𝐽𝐷𝑅\sqrt{\bm{P}_{0}}\in\mathbb{R}^{J^{D}\times R}square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_J start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT × italic_R end_POSTSUPERSCRIPT, where R𝑅Ritalic_R is the number of permutation cycles. For Hankel tensors we have that R=D(J1)+1𝑅𝐷𝐽11R=D(J-1)+1italic_R = italic_D ( italic_J - 1 ) + 1. The notebook allows you to sample Hankel tensors of order 2 up to 4 and dimensions 3 up to 10.

7.2 Completion of a Hankel matrix from noisy measurements

Hankel matrices are very common in signal processing and control theory. In this application a Bayesian approach will be used to complete a Hankel matrix based on noisy incomplete measurements. For this we use the following forward model 𝒚=𝚽𝒘+ϵ𝒚𝚽𝒘bold-italic-ϵ\bm{y}=\bm{\Phi}\;\bm{w}+\bm{\epsilon}bold_italic_y = bold_Φ bold_italic_w + bold_italic_ϵ, where 𝒘102𝒘superscriptsuperscript102\bm{w}\in\mathbb{R}^{10^{2}}bold_italic_w ∈ blackboard_R start_POSTSUPERSCRIPT 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is the vectorization of the true underlying 10×10101010\times 1010 × 10 Hankel matrix. The I×102𝐼superscript102I\times 10^{2}italic_I × 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT matrix 𝚽𝚽\bm{\Phi}bold_Φ selects I𝐼Iitalic_I random entries of 𝒘𝒘\bm{w}bold_italic_w with equal probability. Each row of 𝚽𝚽\bm{\Phi}bold_Φ contains a single nonzero unit-valued entry at a random location. The number of measurements I𝐼Iitalic_I can be changed through a slider in the notebook. The vector ϵbold-italic-ϵ\bm{\epsilon}bold_italic_ϵ is a vector of zero-mean Gaussian noise. Given 𝒚𝒚\bm{y}bold_italic_y and 𝚽𝚽\bm{\Phi}bold_Φ, a Bayesian estimate of the underlying Hankel matrix 𝑾𝑾\bm{W}bold_italic_W can be obtained from (3) as the posterior mean 𝒘+subscript𝒘\bm{w}_{+}bold_italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Another commonly used estimate is the maximum likelihood estimate, which is the 𝒘𝒘\bm{w}bold_italic_w that maximizes the likelihood p(𝒚|𝒘)𝑝conditional𝒚𝒘p(\bm{y}|\bm{w})italic_p ( bold_italic_y | bold_italic_w ). We compare two posterior estimates with the maximum likelihood estimate under two different assumptions on the noise covariance. We fix the sampling rate at 50%percent5050\%50 % and choose σϵ2=1superscriptsubscript𝜎italic-ϵ21\sigma_{\epsilon}^{2}=1italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1. The prior covariance matrix is set to σP2𝑷0=106𝑷0superscriptsubscript𝜎𝑃2subscript𝑷0superscript106subscript𝑷0\sigma_{P}^{2}\,\bm{P}_{0}=10^{-6}\,\bm{P}_{0}italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where 𝑷0subscript𝑷0\bm{P}_{0}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is covariance matrix of the Hankel prior obtained via Theorem 5.1.

Example 7.5.

(White noise) First we consider white noise, which implies that 𝚺=σϵ2𝐈𝚺superscriptsubscript𝜎italic-ϵ2𝐈\bm{\Sigma}=\sigma_{\epsilon}^{2}\,\bm{I}bold_Σ = italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I. The singular values of the prior precision 𝐏01/σPsuperscriptsubscript𝐏01subscript𝜎𝑃\nicefrac{{\sqrt{\bm{P}_{0}^{-1}}}}{{\sigma_{P}}}/ start_ARG square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG, posterior precision (𝚽T/σϵ𝐏01T/σP)Tsuperscriptmatrixsuperscript𝚽𝑇subscript𝜎italic-ϵsuperscriptsuperscriptsubscript𝐏01𝑇subscript𝜎𝑃𝑇\begin{pmatrix}\nicefrac{{\bm{\Phi}^{T}}}{{\sigma_{\epsilon}}}&\nicefrac{{% \sqrt{\bm{P}_{0}^{-1}}^{T}}}{{\sigma_{P}}}\end{pmatrix}^{T}( start_ARG start_ROW start_CELL / start_ARG bold_Φ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT end_ARG end_CELL start_CELL / start_ARG square-root start_ARG bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and likelihood precision 𝚽/σϵ𝚽subscript𝜎italic-ϵ\nicefrac{{\bm{\Phi}}}{{\sigma_{\epsilon}}}/ start_ARG bold_Φ end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT end_ARG are shown in Figure 1(a). They provide us with insight on how the prior, posterior and likelihood relate to each other. The likelihood p(𝐲|𝐰)𝑝conditional𝐲𝐰p(\bm{y}|\bm{w})italic_p ( bold_italic_y | bold_italic_w ) only has 50 measurements and gives all of them equal weight. The prior p(𝐰)𝑝𝐰p(\bm{w})italic_p ( bold_italic_w ) on the other hand only considers 19 nonzero values as a 10×10101010\times 1010 × 10 Hankel matrix has 19 distinct entries. Given the relative high noise variance compared to the prior, the posterior p(𝐰|𝐲)𝑝conditional𝐰𝐲p(\bm{w}|\bm{y})italic_p ( bold_italic_w | bold_italic_y ) “follows” the prior for the first 19 singular values.

Refer to caption
(a) White-noise case. Given the relative high noise variance the posterior follows the prior for the first 19 singular values.
Refer to caption
(b) Hankel-noise case. Also in this case we have that the posterior follows the prior for the first 19 singular values.
Figure 1: Singular values of the square-root precision matrices of the prior, likelihood and posterior distribution. Only 50%percent5050\%50 % of the Hankel matrix 𝐖𝐖\bm{W}bold_italic_W was measured. The noise variance is 1 and the prior variance is 106superscript10610^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT.

A prior mean is obtained by averaging over the nonzero antidiagonals of the measurements and using those averages to construct a Hankel matrix. We now compute three different estimates and compare them to the ground truth. The first estimate is obtained from (3) with a backslash solve. A second estimate is computed by truncating the SVD of (ΦT/σϵP0T/σP)TsuperscriptmatrixsuperscriptΦ𝑇subscript𝜎italic-ϵsuperscriptsubscript𝑃0𝑇subscript𝜎𝑃𝑇\begin{pmatrix}\nicefrac{{\Phi^{T}}}{{\sigma_{\epsilon}}}&\nicefrac{{{P}_{0}^{% -T}}}{{\sigma_{P}}}\end{pmatrix}^{T}( start_ARG start_ROW start_CELL / start_ARG roman_Φ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT end_ARG end_CELL start_CELL / start_ARG italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_T end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_ARG end_CELL end_ROW end_ARG ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to rank 19 in equation (3). The third estimate is the maximum likelihood estimate. For each of these estimates we show the relative error in Table 1.

Table 1: Relative errors for three different Hankel matrix completion estimates 𝐰^bold-^𝐰\bm{\hat{w}}overbold_^ start_ARG bold_italic_w end_ARG. Smallest relative error is indicated in bold.
backslash truncated SVD max-likelihood
𝒘𝒘^2𝒘2subscriptnorm𝒘bold-^𝒘2subscriptnorm𝒘2\frac{||\bm{w}-\bm{\hat{w}}||_{2}}{||\bm{w}||_{2}}divide start_ARG | | bold_italic_w - overbold_^ start_ARG bold_italic_w end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG (white noise) 0.1600.1600.1600.160 0.137 0.6140.6140.6140.614
𝒘𝒘^2𝒘2subscriptnorm𝒘bold-^𝒘2subscriptnorm𝒘2\frac{||\bm{w}-\bm{\hat{w}}||_{2}}{||\bm{w}||_{2}}divide start_ARG | | bold_italic_w - overbold_^ start_ARG bold_italic_w end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG (Hankel noise) 0.2350.2350.2350.235 0.137 0.6040.6040.6040.604
𝑯𝒘^𝒘^2𝒘^2subscriptnorm𝑯bold-^𝒘bold-^𝒘2subscriptnormbold-^𝒘2\frac{||\bm{H}\bm{\hat{w}}-\bm{\hat{w}}||_{2}}{||\bm{\hat{w}}||_{2}}divide start_ARG | | bold_italic_H overbold_^ start_ARG bold_italic_w end_ARG - overbold_^ start_ARG bold_italic_w end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG | | overbold_^ start_ARG bold_italic_w end_ARG | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG 0.120.120.120.12 6.3e-76.3e-76.3\text{e-}76.3 e- 7 0.800.800.800.80

Adding the Hankel prior shows a clear improvement on the completed Hankel matrix. The relative error is 4 times smaller from the inclusion of the prior. Since the first 19 singular values of the posterior are equal to the singular values of the prior one could expect the estimated posterior mean 𝐰+subscript𝐰\bm{w}_{+}bold_italic_w start_POSTSUBSCRIPT + end_POSTSUBSCRIPT obtained from truncating the SVD to the first 19 singular values to be Hankel. In order to confirm this we also compute the relative Hankel error 𝐇𝐰𝐰2/𝐰2subscriptnorm𝐇𝐰𝐰2subscriptnorm𝐰2\nicefrac{{||\bm{H}\,\bm{w}-\bm{w}||_{2}}}{{||\bm{w}||_{2}}}/ start_ARG | | bold_italic_H bold_italic_w - bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG | | bold_italic_w | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG for the three estimates in Table 1, where 𝐇𝐇\bm{H}bold_italic_H is the Hankel permutation matrix. Restricting the posterior mean to lie in a subspace spanned by the first 19 right singular vectors indeed enforces a Hankel structure.

Example 7.6.

(Hankel distributed noise) To investigate the effect of the noise covariance on the estimates we now consider noise 𝐞𝐞\bm{e}bold_italic_e that also has a Hankel structure. In other words, the covariance matrix for p(𝐞)𝑝𝐞p(\bm{e})italic_p ( bold_italic_e ) is σϵ2𝐏0superscriptsubscript𝜎italic-ϵ2subscript𝐏0\sigma_{\epsilon}^{2}\,\bm{P}_{0}italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, whereas the prior covariance is σP2𝐏0superscriptsubscript𝜎𝑃2subscript𝐏0\sigma_{P}^{2}\,\bm{P}_{0}italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. With the noise being Hankel, this means that the perturbation ϵbold-ϵ\bm{\epsilon}bold_italic_ϵ of 𝐰𝐰\bm{w}bold_italic_w will have a Hankel structure as well. This can be modeled via the forward model 𝐲=Φ(𝐰+ϵ)𝐲Φ𝐰bold-ϵ\bm{y}=\Phi(\bm{w}+\bm{\epsilon})bold_italic_y = roman_Φ ( bold_italic_w + bold_italic_ϵ ), where now p(Φϵ)=𝒩(0,σϵ2ΦP0ΦT)𝑝Φbold-ϵ𝒩0superscriptsubscript𝜎italic-ϵ2Φsubscript𝑃0superscriptΦ𝑇p(\Phi\bm{\epsilon})=\mathcal{N}(0,\sigma_{\epsilon}^{2}\,\Phi\,P_{0}\,\Phi^{T})italic_p ( roman_Φ bold_italic_ϵ ) = caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Φ italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_Φ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ). Figure 1(b) shows the singular values of the square-root precision matrices. The number of nonzero singular values of the likelihood now consists of 2 plateaus. Again, the posterior follows the prior for the first 19 singular values. Since now measurements of entries along the same antidiagonal are identical, less information is to be extracted from the measurements. This explains the first drop of Figure 1(b) at the 19th singular value for both the likelihood and posterior. Less information also means that we can expect our estimate to be worse compared to the white noise case. The relative errors are now indeed higher, as seen in Table 1. Note however that the estimate obtained by truncating the SVD remains the same.

7.3 Bayesian learning of MNIST classifier

In this application we learn a classifier for images of 10101010 handwritten digits. The classifier is trained on the MNIST data [17], which consists of 60,0006000060,00060 , 000 pictures for training and 10,0001000010,00010 , 000 pictures for test. Each picture 𝒙nsubscript𝒙𝑛\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is of size 28×28282828\times 2828 × 28. We pick 10,0001000010,00010 , 000 random samples from the training set and convert each picture 𝒙nsubscript𝒙𝑛\bm{x}_{n}bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT into 252=625superscript25262525^{2}=62525 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 625 Random Fourier Features 𝝋(𝒙n)j=Re(ei𝒗jT𝒙n)𝝋subscriptsubscript𝒙𝑛𝑗Resuperscript𝑒𝑖superscriptsubscript𝒗𝑗𝑇subscript𝒙𝑛\bm{\varphi}(\bm{x}_{n})_{j}=\text{Re}(e^{-i\,\bm{v}_{j}^{T}\bm{x}_{n}})bold_italic_φ ( bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = Re ( italic_e start_POSTSUPERSCRIPT - italic_i bold_italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) [24]. The 625 frequency vectors 𝒗jsubscript𝒗𝑗\bm{v}_{j}bold_italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are sampled from a zero-mean Gaussian with variance 1/52𝑰1superscript52𝑰\nicefrac{{1}}{{5^{2}}}\,\bm{I}/ start_ARG 1 end_ARG start_ARG 5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_italic_I. We use a one-vs-all strategy by learning 10101010 classifiers at once. Each classifier is trained to distinguish between 1111 particular class versus all others. The forward model for our 10101010 classifiers is then 𝒚=𝝋(𝒙)𝑾+𝒆𝒚𝝋𝒙𝑾𝒆\bm{y}=\bm{\varphi}(\bm{x})\;\bm{W}+\bm{e}bold_italic_y = bold_italic_φ ( bold_italic_x ) bold_italic_W + bold_italic_e. Each column of 𝑾625×10𝑾superscript62510\bm{W}\in\mathbb{R}^{625\times 10}bold_italic_W ∈ blackboard_R start_POSTSUPERSCRIPT 625 × 10 end_POSTSUPERSCRIPT contains the model parameters of 1111 specific classifier. In order to predict the class of a sample 𝒙superscript𝒙\bm{x}^{*}bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT we compute 𝒚=𝝋(𝒙)𝑾superscript𝒚𝝋superscript𝒙𝑾\bm{y}^{*}=\bm{\varphi}(\bm{x}^{*})\,\bm{W}bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_italic_φ ( bold_italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) bold_italic_W and apply the softmax function

𝝈(𝒚)=e𝒚kke𝒚k10.𝝈superscript𝒚superscript𝑒subscriptsuperscript𝒚𝑘subscript𝑘superscript𝑒subscriptsuperscript𝒚𝑘superscript10\displaystyle\bm{\sigma}(\bm{y}^{*})=\frac{e^{\bm{y}^{*}_{k}}}{\sum_{k}e^{\bm{% y}^{*}_{k}}}\in\mathbb{R}^{10}.bold_italic_σ ( bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG italic_e start_POSTSUPERSCRIPT bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT .

The prediction is then the class with maximal 𝝈(𝒚)𝝈superscript𝒚\bm{\sigma}(\bm{y}^{*})bold_italic_σ ( bold_italic_y start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). The 10 classifiers are trained on a training data set of pictures 𝑿10,00×784𝑿superscript1000784\bm{X}\in\mathbb{R}^{10,00\times 784}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT 10 , 00 × 784 end_POSTSUPERSCRIPT and corresponding class labels 𝒀10,000×10𝒀superscript1000010\bm{Y}\in\mathbb{R}^{10,000\times 10}bold_italic_Y ∈ blackboard_R start_POSTSUPERSCRIPT 10 , 000 × 10 end_POSTSUPERSCRIPT. Our estimate for 𝑾𝑾\bm{W}bold_italic_W is the mean of the posterior p(𝑾|𝒀,𝑿)𝑝conditional𝑾𝒀𝑿p(\bm{W}|\bm{Y},\bm{X})italic_p ( bold_italic_W | bold_italic_Y , bold_italic_X ). The residual 𝒆𝒆\bm{e}bold_italic_e is most commonly assumed to be zero-mean white Gaussian noise p(𝒆)=𝒩(0,σϵ2𝑰)𝑝𝒆𝒩0superscriptsubscript𝜎italic-ϵ2𝑰p(\bm{e})=\mathcal{N}(0,\sigma_{\epsilon}^{2}\bm{I})italic_p ( bold_italic_e ) = caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ). Likewise, the prior p(𝑾)𝑝𝑾p(\bm{W})italic_p ( bold_italic_W ) is usually assumed to be a zero-mean normal distribution with a uniform scaling covariance matrix 𝑷0=σP2𝑰subscript𝑷0superscriptsubscript𝜎𝑃2𝑰\bm{P}_{0}=\sigma_{P}^{2}\;\bm{I}bold_italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I. Such a prior is also called Tikhonov regularization. We compare the performance of the Tikhonov prior to other zero-mean (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained tensor priors (symmetric, Hankel en circulant), constructed using either Theorem 4.5 or Theorem 5.1. The noise variance σϵ2superscriptsubscript𝜎italic-ϵ2\sigma_{\epsilon}^{2}italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is set to a fixed value of 1.

Refer to caption
(a) When σP2=106superscriptsubscript𝜎𝑃2superscript106\sigma_{P}^{2}=10^{-6}italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT large differences between the different posteriors are observed. The corresponding classifiers are therefore expected to also behave differently.
Refer to caption
(b) When σP2=103superscriptsubscript𝜎𝑃2superscript103\sigma_{P}^{2}=10^{-3}italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT all differences between the different posteriors have almost vanished. The corresponding classifiers are expected to also behave similarly.
Figure 2: Singular values of the square-root precision matrices of the posterior distribution for 4 different priors. The noise variance is fixed to 1.

The difference between these different priors can be investigated by looking at the singular value profiles of the square-root precision matrices of the corresponding posteriors. These are shown in Figure 2(a) for σP2=106superscriptsubscript𝜎𝑃2superscript106\sigma_{P}^{2}=10^{-6}italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and in Figure 2(b) for σP2=103superscriptsubscript𝜎𝑃2superscript103\sigma_{P}^{2}=10^{-3}italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. Being confident in the prior (σP2=106superscriptsubscript𝜎𝑃2superscript106\sigma_{P}^{2}=10^{-6}italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT) has a strong effect on the corresponding posterior, which explains the large differences in singular value profiles. The corresponding classifiers can then be expected to also differ a lot on unseen test data. Indeed, applying the obtained classifiers on 10,0001000010,00010 , 000 test images results in a relative number of correctly classified images shown in Table 2.

Table 2: Comparison of relative number of correctly classified images for classifiers learned with different priors. Best classifier indicated in bold.
Tikhonov symmetric Hankel circulant
σP2=106superscriptsubscript𝜎𝑃2superscript106\sigma_{P}^{2}=10^{-6}italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT 0.6500.6500.6500.650 0.8800.8800.8800.880 0.917 0.9150.9150.9150.915
σP2=103superscriptsubscript𝜎𝑃2superscript103\sigma_{P}^{2}=10^{-3}italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT 0.9170.9170.9170.917 0.9180.9180.9180.918 0.920 0.9190.9190.9190.919

All (𝑨,𝒃)𝑨𝒃(\bm{A},\bm{b})( bold_italic_A , bold_italic_b )-constrained priors outperform the conventional Tikhonov prior, with Hankel and circulant tensors having the best performance. By increasing the prior covariance to σP2=103superscriptsubscript𝜎𝑃2superscript103\sigma_{P}^{2}=10^{-3}italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT all singular value profiles become very similar. The corresponding classifiers have similar performance as seen in Table 2. No significant classification improvement is observed for the Hankel and circulant priors.

8 Conclusions

A whole new class of Bayesian priors has been worked-out which could be potentially applied to a variety of different inverse problems. The main focus of this article was mostly on the theoretical foundation and where possible we discussed practical implementations without going into much detail. Although the curse of dimensionality when considering tensors of large order and dimension can be completely resolved via the corresponding dual problem, the computational complexity can still become prohibitively large with increasing sample size. To tackle this complexity the possibility to represent the prior mean vector and covariance matrix of these priors as exact low-rank tensor decompositions could be investigated.

Acknowledgments

Many thanks to Frederiek Wesel for valuable discussions and feedback.

References

  • [1] J. M. Bardsley, Computational Uncertainty Quantification for Inverse Problems: An Introduction to Singular Integrals, SIAM, 2018.
  • [2] K. Batselier, Low-rank tensor decompositions for nonlinear system identification: A tutorial with examples, IEEE Control Systems Magazine, 42 (2022), pp. 54–74.
  • [3] K. Batselier, Z. Chen, and N. Wong, Tensor Network alternating linear scheme for MIMO Volterra system identification, Automatica, 84 (2017), pp. 26–35.
  • [4] K. Batselier and N. Wong, A constructive arbitrary-degree Kronecker product decomposition of tensors, Numerical Linear Algebra with Applications, 24 (2017), p. e2097.
  • [5] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, Julia: A fresh approach to numerical computing, SIAM review, 59 (2017), pp. 65–98.
  • [6] M. Blondel, M. Ishihata, A. Fu**o, and N. Ueda, Polynomial networks and factorization machines: New insights and efficient training algorithms, in International Conference on Machine Learning, PMLR, 2016, pp. 850–858.
  • [7] J. Chung and S. Gazzola, Computational Methods for Large-Scale Inverse Problems: A Survey on Hybrid Projection Methods, SIAM Review, 66 (2024), pp. 205–284.
  • [8] J. Chung and A. K. Saibaba, Generalized Hybrid Iterative Methods for Large-Scale Bayesian Inverse Problems, SIAM Journal on Scientific Computing, 39 (2017), pp. S24–S46.
  • [9] H. F. de Groote, On varieties of optimal algorithms for the computation of bilinear map**s i. the isotropy group of a bilinear map**, Theoretical Computer Science, 7 (1978), pp. 1–24.
  • [10] C. L. Epstein, Introduction to the mathematics of medical imaging, SIAM, 2007.
  • [11] D. F. Gleich, L.-H. Lim, and Y. Yu, Multilinear pagerank, SIAM Journal on Matrix Analysis and Applications, 36 (2015), pp. 1507–1541.
  • [12] G. H. Golub and C. F. Van Loan, Matrix computations, JHU press, 2013.
  • [13] P. C. Hansen, J. G. Nagy, and D. P. O’leary, Deblurring images: matrices, spectra, and filtering, SIAM, 2006.
  • [14] N. Kargas and N. D. Sidiropoulos, Supervised learning and canonical decomposition of multivariate functions, IEEE Transactions on Signal Processing, 69 (2021), pp. 1097–1107.
  • [15] C.-Y. Ko, K. Batselier, L. Daniel, W. Yu, and N. Wong, Fast and accurate tensor completion with total variation regularized tensor trains, IEEE Transactions on Image Processing, 29 (2020), pp. 6918–6931.
  • [16] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM review, 51 (2009), pp. 455–500.
  • [17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324.
  • [18] W. Li and M. K. Ng, On the limiting probability distribution of a transition probability tensor, Linear and Multilinear Algebra, 62 (2014), pp. 362–385.
  • [19] J. Liu, P. Musialski, P. Wonka, and J. Ye, Tensor completion for estimating missing values in visual data, IEEE transactions on pattern analysis and machine intelligence, 35 (2012), pp. 208–220.
  • [20] N. Mastronardi, P. Lemmerling, and S. Van Huffel, Fast structured total least squares algorithm for solving the basic deconvolution problem, SIAM Journal on Matrix Analysis and Applications, 22 (2000), pp. 533–553.
  • [21] A. Novikov, I. Oseledets, and M. Trofimov, Exponential machines, Bulletin of the Polish Academy of Sciences: Technical Sciences; 2018; 66; No 6 (Special Section on Deep Learning: Theory and Practice); 789-797, (2018).
  • [22] G. Pillonetto and G. De Nicolao, A new kernel-based approach for linear system identification, Automatica, 46 (2010), pp. 81–93.
  • [23] Project Jupyter, Matthias Bussonnier, Jessica Forde, Jeremy Freeman, Brian Granger, Tim Head, Chris Holdgraf, Kyle Kelley, Gladys Nalvarte, Andrew Osheroff, M. Pacer, Yuvi Panda, Fernando Perez, Benjamin Ragan Kelley, and Carol Willing, Binder 2.0 - Reproducible, interactive, sharable environments for science at scale, in Proceedings of the 17th Python in Science Conference, Fatih Akici, David Lippa, Dillon Niederhut, and M. Pacer, eds., 2018, pp. 113 – 120.
  • [24] A. Rahimi and B. Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 20 (2007).
  • [25] S. Särkkä and L. Svensson, Bayesian filtering and smoothing, vol. 17, Cambridge university press, 2023.
  • [26] E. Stoudenmire and D. J. Schwab, Supervised learning with tensor networks, Advances in neural information processing systems, 29 (2016).
  • [27] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Least Squares Support Vector Machines, World Scientific, Singapore, 2002.
  • [28] F. van der Plas and M. Bocheński, fonsp/pluto.jl: v0.19.42, May 2024.
  • [29] C. F. Van Loan, The ubiquitous Kronecker product, Journal of computational and applied mathematics, 123 (2000), pp. 85–100.
  • [30] S. Wahls, V. Koivunen, H. V. Poor, and M. Verhaegen, Learning multidimensional Fourier series with tensor trains, in 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), IEEE, 2014, pp. 394–398.
  • [31] F. Wesel and K. Batselier, Large-Scale Learning with Fourier Features and Tensor Decompositions, Advances in Neural Information Processing Systems, 34 (2021), pp. 17543–17554.
  • [32] C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning, vol. 2, MIT press Cambridge, MA, 2006.