\newsiamremark

remarkRemark \newsiamremarkexampleExample \newsiamremarkhypothesisHypothesis \newsiamthmclaimClaim \headersConstructing structured tensor priors for Bayesian inverse problemsK. Batselier \externaldocument[][nocite]ex_supplement

Constructing structured tensor priors for Bayesian inverse problems ^†^†thanks: \fundingThis publication is part of the project Sustainable learning for Artificial Intelligence from noisy large-scale data (with project number VI.Vidi.213.017) which is financed by the Dutch Research Council (NWO).

Kim Batselier Delft Center for Systems and Control, Delft University of Technology, The Netherlands () [email protected]

Abstract

Specifying a prior distribution is an essential part of solving Bayesian inverse problems. The prior encodes a belief on the nature of the solution and this regularizes the problem. In this article we completely characterize a Gaussian prior that encodes the belief that the solution is a structured tensor. We first define the notion of $(\bm{A},\bm{b})$ -constrained tensors and show that they describe a large variety of different structures such as Hankel, circulant, triangular, symmetric, and so on. Then we completely characterize the Gaussian probability distribution of such tensors by specifying its mean vector and covariance matrix. Furthermore, explicit expressions are proved for the covariance matrix of tensors whose entries are invariant under a permutation. These results unlock a whole new class of priors for Bayesian inverse problems. We illustrate how new kernel functions can be designed and efficiently computed and apply our results on two particular Bayesian inverse problems: completing a Hankel matrix from a few noisy measurements and learning an image classifier of handwritten digits. The effectiveness of the proposed priors is demonstrated for both problems. All applications have been implemented as reactive Pluto notebooks in Julia.

keywords:

Bayesian inverse problems, structured tensors, tensors, kernel methods

{MSCcodes}

15A29, 15A69, 62F15

1 Introduction

We consider a set of data samples $\{(\bm{x}_{n},y_{n})\,|\,\bm{x}_{n}\in\mathbb{R}^{D},\,y_{n}\in\mathbb{R}\}_{n% =1}^{N}$ and the following linear forward model

(1)

\displaystyle{y}_{n}=\langle\bm{\mathcal{P}}(\bm{x}_{n}),\bm{\mathcal{W}}% \rangle+\epsilon_{n}.

Each scalar measurement $y_{n}$ is obtained from an inner product of a data-dependent tensor $\bm{\mathcal{P}}(\bm{x}_{n})\in\mathbb{R}^{J_{1}\times\cdots\times J_{D}}$ with a tensor of unknown latent variables $\bm{\mathcal{W}}\in\mathbb{R}^{J_{1}\times\cdots\times J_{D}}$ , corrupted by measurement noise $\epsilon_{n}$ . Tensors in this context are $D$ -dimensional arrays, with vectors $(D=1)$ and matrices $(D=2)$ being the most common cases. Vectorizing all tensors and collecting the measurements $y_{1},\ldots,y_{N}$ into a vector $\bm{y}\in\mathbb{R}^{N}$ allows (1) to be rewritten into the linear system of equations

(2)

\displaystyle\bm{{y}}=\bm{\Phi}(\bm{x})\,\bm{w}+\bm{\epsilon}.

Row $n$ of the matrix $\bm{\Phi}(\bm{x})\in\mathbb{R}^{N\times J_{1}\cdots J_{D}}$ contains the vectorization of the tensor $\bm{\mathcal{P}}(\bm{x}_{n})$ . For notational convenience the indication that $\bm{\Phi}$ depends on $\bm{x}$ is dropped from here on. The inverse problem consists of inferring the latent variables $\bm{w}$ from the noisy measurements $\bm{y}$ . Inverse problems of this kind appear in many different applications fields such as machine learning [6, 26, 27, 31, 32] control [2, 3, 22, 25] and signal processing [10, 13, 14, 15, 19, 20, 30]. In this article a Bayesian approach [1] is considered by assuming that $\bm{w}$ and $\bm{\epsilon}$ are random variables. The goal is then to infer the posterior distribution $p(\bm{w}|\bm{{y}})$ of $\bm{w}$ conditioned on the measurements $\bm{y}$ using Bayes’ theorem

\displaystyle p(\bm{w}|\bm{{y}})=\frac{p(\bm{{y}}|\bm{w})\;p({\bm{w}})}{p(\bm{% {y}})}.

The distribution $p(\bm{w})$ is called the prior and encodes a belief on what $\bm{w}$ is before the measurements are known. The main contribution of this article is the complete characterization of a prior $p(\bm{w})$ that encodes the belief that the corresponding tensor $\bm{\mathcal{W}}$ is structured. A Gaussian distribution is assumed for the noise distribution $p(\bm{\epsilon})=\mathcal{N}(\bm{0},\bm{\Sigma})$ with mean vector $\bm{0}$ and covariance matrix $\bm{\Sigma}$ and likewise for the prior $p(\bm{w})=\mathcal{N}(\bm{w}_{0},\bm{P}_{0})$ . The linear forward model (2) combined with the Gaussian assumptions results in a Gaussian posterior $p(\bm{w}|\bm{y})=\mathcal{N}(\bm{w}_{+},\bm{P}_{+})$ with mean vector $\bm{w}_{+}$ and covariance matrix $\bm{P}_{+}$

(3)		$\displaystyle\bm{w}_{+}$	$\displaystyle=(\bm{P}_{0}^{-1}+\bm{\Phi}^{T}\bm{\Sigma}^{-1}\bm{\Phi})^{-1}\,(% \bm{\Phi}^{T}\bm{\Sigma}^{-1}{\bm{y}}+\bm{P}_{0}^{-1}\bm{w}_{0}),$
(4)		$\displaystyle\bm{P}_{+}$	$\displaystyle=(\bm{P}_{0}^{-1}+\bm{\Phi}^{T}\bm{\Sigma}^{-1}\bm{\Phi})^{-1}.$

The role of the prior $p(\bm{w})$ can now be understood from (3) and (4). In the absence of data ( $\bm{\Phi}=\bm{0}$ and $\bm{y}=\bm{0}$ ) the posterior equals the prior. In other words, the prior encodes a belief on what the solution $\bm{w}$ of (2) should be before any data is known. A natural question to ask is then what kind of prior to use. In this article we consider a prior encoding the belief that the tensor $\bm{\mathcal{W}}$ has a structure that is completely determined by a matrix $\bm{A}\in\mathbb{R}^{I\times J_{1}\cdots J_{D}}$ and vector $\bm{b}\in\mathbb{R}^{I}$ such that

\displaystyle\bm{A}\;\operatorname{vec}{(\bm{\mathcal{W}})}

\displaystyle=\bm{b},

which we will refer to as $(\bm{A},\bm{b})$ -constrained tensors. The contributions of this article are threefold.

1.

We show how the definition of $(\bm{A},\bm{b})$ -constrained tensors is well-motivated since it encompasses a wide variety of relevant structured tensors. Examples are given for tensors with fixed entries, tensors with known sums of entries and symmetric, Hankel, Toeplitz, circulant, and triangular tensors.
2.

In Theorem 3.1 we completely characterize the mean vector $\bm{w}_{0}$ and covariance matrix $\bm{P}_{0}$ of the prior $p(\bm{w})$ for $(\bm{A},\bm{b})$ -constrained tensors.
3.

In Theorems 4.6 and 5.1 we provide explicit expressions for $\bm{P}_{0}$ for $(\bm{A},\bm{b})$ -constrained tensors whose entries remain invariant under a permutation $\bm{P}$ . Such tensors will be called $\bm{P}$ -invariant or skew- $\bm{P}$ -invariant.

These three contributions are important because the prior mean $\bm{w}_{0}$ and covariance matrix $\bm{P}_{0}$ are necessary to solve the Bayesian inverse problem via equations (3) and (4). Contrary to most solution strategies for linear least squares problems the matrix inverse of $\bm{P}_{0}^{-1}+\bm{\Phi}^{T}\bm{\Sigma}^{-1}\bm{\Phi}$ is explicitly required as it forms the posterior covariance. Also note that the dimension of the matrix to invert is $J_{1}J_{2}\ldots J_{D}\times J_{1}J_{2}\ldots J_{D}$ , which limits the use of direct solvers to cases of small $J$ and $D$ . Hybrid projection methods [7, 8] are a viable alternative for cases where $J$ and $D$ are prohibitively large. Another alternative is to solve the corresponding dual problem, which is described in terms of the so-called kernel matrix $\bm{\Phi}\,\bm{P}_{0}\bm{\Phi}^{T}\in\mathbb{R}^{N\times N}$ . This approach is commonly used in least-squares support vector machines [27] and Gaussian Processes [32] and has a computational complexity of at least $O(N^{2})$ . When the tensor $\bm{\mathcal{P}}(\bm{x}_{n})$ exhibits a low-rank structure then another way to obtain low computational complexity of solving (3) is by imposing a low-rank tensor structure to $\bm{w}_{+}$ and $\bm{P}_{+}$ [3, 21, 26]. Develo** dedicated solution strategies for equations (3) and (4), however, lies outside the scope of this article.

1.1 Notation

Tensors in this article are multi-dimensional arrays with real entries. We denote scalars by italic letters $a,b,\ldots$ , vectors by boldface italic letters $\bm{a},\bm{b},\ldots$ , matrices by boldface capitalized italic letters $\bm{A},\bm{B},\ldots$ and higher-order tensors by boldface calligraphic italic letters $\bm{\mathcal{A}},\bm{\mathcal{B}},\ldots$ . The vector $\bm{e}_{j_{d}}\in\mathbb{R}^{J_{d}}$ denotes a canonical basis vector that has a single nonzero unit entry at position $j_{d}$ . The vector $\bm{1}_{J_{d}}\in\mathbb{R}^{J_{d}}$ denotes a vector of ones and $\bm{I}_{J_{d}}\in\mathbb{R}^{J_{d}\times J_{d}}$ is the unit matrix. The number of indices required to determine an entry of a tensor is called the order of the tensor. A $D$ th order or $D$ -way tensor is hence denoted $\bm{\mathcal{A}}\in\mathbb{R}^{J_{1}\times J_{2}\times\cdots\times J_{D}}$ . An index $j_{d}$ always satisfies $1\leq j_{d}\leq J_{d}$ , where $J_{d}$ is called the dimension of that particular mode. Tensor entries are denoted $w_{j_{1},j_{2},\cdots,j_{D}}$ . The merger of a set of $d$ separate indices $j_{1},j_{2},\ldots,j_{d}$ is denoted by the single index

\overline{j_{1}j_{2}\ldots j_{d}}=j_{1}+(j_{2}-1)\,J_{1}+\cdots+(j_{d}-1)J_{1}% \cdots J_{d-1}.

For a tensor $\bm{\mathcal{W}}$ we will always assume that the corresponding vector $\bm{w}=\textrm{vec}(\bm{\mathcal{W}})$ . The square root matrix $\sqrt{\bm{P}}$ of $\bm{P}$ satisfies per definition $\bm{P}=\sqrt{\bm{P}}\,(\sqrt{\bm{P}})^{T}.$

2 $(\boldsymbol{A},\boldsymbol{b})$ -constrained tensors

Before characterizing the prior $p(\bm{w})$ we first demonstrate the breadth of $(\bm{A},\bm{b})$ -constrained tensors through eight examples. These examples demonstrate that the definition of $(\bm{A},\bm{b})$ -constrained tensors is well-motivated in that it captures a wide variety of structured tensors.

2.1 Tensors with fixed entries

A tensor $\bm{\mathcal{W}}\in\mathbb{R}^{J_{1}\times J_{2}\times\cdots\times J_{D}}$ with $I$ fixed entries can be described as $\bm{A}\,\bm{w}=\bm{b}$ where row $i$ of the matrix $\bm{A}\in\mathbb{R}^{I\times J_{1}\cdots J_{D}}$ is a canonical basis vector $\bm{e}_{\overline{j_{1}\cdots j_{D}}}$ that selects entry $w_{j_{1},\ldots,j_{D}}$ . The corresponding fixed numerical value of $w_{j_{1},\ldots,j_{D}}$ is then given by $b_{i}$ . Such fixed values are in practice usually zero, for example in triangular or banded matrices. Such structures can also be generalized to the tensor case.

Definition 2.1.

A tensor $\bm{\mathcal{W}}\in\mathbb{R}^{J_{1}\times J_{2}\times\cdots\times J_{D}}$ is lower (upper) triangular when $w_{j_{1},j_{2},\cdots,j_{D}}=0$ holds for each consecutive index pair $j_{d},j_{d+1}$ such that $j_{d}-j_{d+1}<(>)\,0$ .

The characterization of a lower (upper) triangular tensor as an $(\bm{A},\bm{b})$ -constrained tensor is given in the following lemma.

Lemma 2.2.

Let $\bm{L}$ be the $J(J-1)/2\times J^{2}$ matrix that has on each row a single unit entry for each particular occurrence of $j_{1}-j_{2}<(>)\,0$ . Lower (upper) triangular tensors are then described by

\displaystyle\bm{A}

\displaystyle=\begin{pmatrix}\bm{L}\otimes\bm{I}_{J}\otimes\cdots\otimes\bm{I}% _{J}\\ \bm{I}_{J}\otimes\bm{L}\otimes\cdots\otimes\bm{I}_{J}\\ \vdots\\ \bm{I}_{J}\otimes\bm{I}_{J}\otimes\cdots\otimes\bm{L}\end{pmatrix}\in\mathbb{R% }^{\frac{(D-1)(J-1)J^{D-1}}{2}\times J^{D}},

and a vector $\bm{b}\in\mathbb{R}^{\frac{(D-1)(J-1)J^{D-1}}{2}}$ of zeros.

Proof 2.3.

The known fixed values of lower (upper) triangular tensors are zero and hence $\bm{b}$ is a vector of zeros. Each row of the matrix $\bm{A}$ has a single unit entry to select a particular tensor entry for which some consecutive indices $j_{d},j_{d+1}$ satisfy $j_{d}-j_{d-1}<(>)\,0$ . A tensor with $D$ indices has $D-1$ consecutive index pairs and therefore $\bm{A}$ is partitioned into $D-1$ block rows. Each block row is a Kronecker product of $D-2$ identity matrices with $\bm{L}$ . The Kronecker product of identity matrices generates all possible index combinations of $D-2$ index values. The $\bm{L}$ matrix factor in the Kronecker product adds the remaining 2 indices but only considers index pairs for which $j_{d}-j_{d-1}<(>)\,0$ .

The $\bm{A}$ matrix that describes tensors with known fixed entries in Lemma 2.2 is sparse and highly structured as demonstrated by the following example.

Example 2.4.

Consider a lower triangular tensor $\bm{\mathcal{W}}\in\mathbb{R}^{3\times 3\times 3}$ . The condition $j_{d}-j_{d+1}<0$ occurs in 3 cases $(j_{d},j_{d+1})\in\{(1,2),(1,3),(2,3)\}$ . Defining the matrix $\bm{L}\in\mathbb{R}^{3\times 9}$ with 3 nonzero entries

\displaystyle l_{1,\overline{12}}=l_{2,\overline{13}}=l_{3,\overline{23}}=1

allows us to describe the desired $\bm{A}$ matrix as

(5)

\displaystyle\bm{A}

\displaystyle=\begin{pmatrix}\bm{I}_{3}\otimes\bm{L}\\ \bm{L}\otimes\bm{I}_{3}\end{pmatrix}\in\mathbb{R}^{18\times 27}.

This particular sparse structure is exploited in Section 3 when a basis for the nullspace of $\bm{A}$ needs to be computed. Note that there are actually only 17 zero entries for which $j_{d}-j_{d+1}<0$ , which implies that the $\bm{A}$ matrix from equation (5) counts the case $j_{1}=1,j_{2}=2,j_{3}=3$ twice. This, however, does not negatively affect the resulting prior.

2.2 Known sum of entries

Tensors for which the sum over all or only particular entries add up to a known value are also quite common in applications. Stochastic tensors are a particular example [11, 18]. Knowing a particular sum of entries can be described as follows.

Lemma 2.5.

Tensors $\bm{\mathcal{W}}\in\mathbb{R}^{J_{1}\times\cdots\times J_{D}}$ for which the sum over the entries of an index set $\mathcal{J}$ is a tensor $\bm{\mathcal{B}}$ are described by

(6)

\displaystyle\bm{A}\;\bm{w}=\operatorname{vec}{(\bm{\mathcal{B}})}\;\textrm{ % with }\;\bm{A}=\bm{A}_{D}\otimes\cdots\otimes\bm{A}_{1},

where each matrix $\bm{A}_{d}\;(d=1,\ldots,D)$ in the Kronecker product is per definition

(7)

\bm{A}_{d}=\begin{cases}\bm{1}_{J_{j_{d}}}^{T}&\text{if }j_{d}\in\mathcal{J},% \\ \bm{I}_{J_{j_{d}}}&\text{if }j_{d}\notin\mathcal{J}.\end{cases}

The Kronecker product in (6) has as its leftmost factor $d=D$ and runs towards $d=1$ due to the opposite ordering of indices in the Kronecker product.

Proof 2.6.

With the definitions of the $\bm{A}_{d}$ matrices the sum over the relevant entries of $\bm{\mathcal{W}}$ is written in terms of n-mode products [16, p. 460]

\displaystyle\bm{\mathcal{W}}\times_{1}\bm{A}_{1}\times_{2}\cdots\times_{D}\bm% {A}_{D}

\displaystyle=\bm{\mathcal{B}}.

Using the vectorization operation this can be rewritten as

\displaystyle\left(\bm{A}_{D}\otimes\cdots\otimes\bm{A}_{1}\right)\;\bm{w}

\displaystyle=\bm{b},

which finalizes the proof.

Example 2.7.

Let $\bm{W}\in\mathbb{R}^{2\times 3}$ be a matrix for which each each row sum equals to 1. Lemma 2.5 then implies that

\displaystyle\bm{A}

\displaystyle=\begin{pmatrix}1&1&1\end{pmatrix}\otimes\begin{pmatrix}1&0\\ 0&1\end{pmatrix},\;\bm{b}=\bm{1}_{2}.

2.3 Eigenvector structure

Tensors whose vectorization is an eigenvector of a matrix $\bm{P}$ with eigenvalue $\lambda$ are described by the constraint $\bm{A}=\lambda\,\bm{I}-\bm{P}$ and $\bm{b}=\bm{0}$ . An important structure in this article is obtained when $\bm{P}$ is a permutation matrix. Indeed, $\bm{P}\,\bm{w}=\bm{w}$ then implies that the entries of $\bm{\mathcal{W}}$ remain invariant under the permutation $\bm{P}$ . The distinction between $\lambda=1$ and $\lambda=-1$ is made explicit through the following two definitions.

Definition 2.8.

Let $\bm{P}\in\mathbb{R}^{J^{D}\times J^{D}}$ be a permutation matrix. A $\bm{P}$ -invariant tensor $\bm{\mathcal{W}}$ is defined by

\displaystyle\left(\bm{I}-\bm{P}\right)\bm{w}

\displaystyle=\bm{0}\Leftrightarrow\bm{P}\,\bm{w}=\bm{w}.

Likewise, a skew- $\bm{P}$ -invariant tensor $\bm{\mathcal{W}}$ satisfies per definition

\displaystyle\left(-\bm{I}-\bm{P}\right)\bm{w}

\displaystyle=\bm{0}\Leftrightarrow\bm{P}\,\bm{w}=-\bm{w}.

In this way any particular permutation matrix $\bm{P}$ then defines a corresponding structured tensor. Next we discuss some prominent examples of $\bm{P}$ -invariant tensor structures.

Definition 2.9.

(Cyclic Symmetric tensor [4]) The cyclic index shift permutation matrix $\bm{C}$ of a $D$ -way tensor $\bm{\mathcal{W}}$ is the $J^{D}\times J^{D}$ permutation matrix

\displaystyle\bm{C}\;=\;\begin{pmatrix}\bm{I}(1:I^{D-1}:I^{D},:)\\ \bm{I}(2:I^{D-1}:I^{D},:)\\ \vdots\\ \bm{I}(I^{D-1}:I^{D-1}:I^{D},:)\\ \end{pmatrix},

where $\bm{I}$ is the $J^{D}\times J^{D}$ identity matrix and Matlab colon notation is used to denote submatrices. A $\bm{C}$ -invariant tensor $\bm{\mathcal{W}}$ is then called a cyclic symmetric tensor.

Defining the vector $\tilde{\bm{w}}:=\bm{C}\;\textrm{vec}(\bm{\mathcal{W}})$ it can be verified that

\displaystyle\tilde{w}_{j_{D},j_{1},\ldots j_{D-1}}

\displaystyle=w_{j_{1},\ldots j_{D-1},j_{D}}.

In other words, $\bm{C}$ performs a cyclic shift of the indices to the right. When $D=2$ , then $\bm{C}$ uniquely defines $J\times J$ symmetric matrices $\bm{W}$ since the cyclic index shift property implies that $\tilde{w}_{j_{2},j_{1}}=w_{j_{1},j_{2}}$ [29]. The case $D>2$ does not result in a fully symmetric tensor, as for example the required index permutation $j_{1},j_{2},j_{3}\rightarrow j_{1},j_{3},j_{2}$ would not be enforced by $\bm{C}$ . $\bm{C}$ -invariance is therefore a weaker constraint than full symmetry.

Definition 2.10.

(Symmetric tensor) Let $\bm{S}$ be the permutation matrix such that all entries of $\tilde{\bm{w}}:=\bm{S}\;\textrm{vec}(\bm{\mathcal{W}})$ satisfy $\tilde{w}_{j_{1},\ldots,j_{D}}=w_{\pi(j_{1},\ldots,j_{D})}$ , where $\pi(j_{1},\ldots,j_{D})$ is any permutation of the indices. A $\bm{S}$ -invariant tensor $\bm{\mathcal{W}}$ is per definition a symmetric tensor.

Definition 2.11.

(Centrosymmetric tensor [4]) A $\bm{J}$ -invariant tensor $\bm{\mathcal{W}}$ , where $\bm{J}$ is the column-reversed identity matrix, is called a centrosymmetric tensor.

A centrosymmetric tensor $\bm{\mathcal{W}}$ satisfies

\displaystyle w_{j_{1},\ldots,j_{D}}=w_{J_{1}-j_{1}+1,\ldots,J_{D}-j_{D}+1}.

Probably the most famous tensor that exhibits centrosymmetry is the matrix-matrix multiplication tensor [9].

Definition 2.12.

(Hankel Tensor) Let $\bm{H}\in\mathbb{R}^{J^{D}\times J^{D}}$ be the permutation matrix that cyclically permutes all $D$ indices $j_{1},\ldots,j_{D}$ with constant index sum $j_{1}+\cdots+j_{D}$ . A $\bm{H}$ -invariant tensor $\bm{\mathcal{W}}$ is called a Hankel tensor.

The minimal index sum is $D=1+1+1+\cdots+1$ and maximal index sum is $JD=J+J+\cdots+J$ . This implies that $\bm{H}$ consists of $JD-D+1$ permutation cycles and $\textrm{rank}(\bm{H})=JD-D+1$ .

Definition 2.13.

(Toeplitz Tensor) Let $\bm{T}\in\mathbb{R}^{J^{D}\times J^{D}}$ be the permutation matrix that cyclically permutes all indices $j_{d}\mapsto j_{d}+1$ , where $J_{d}+1\mapsto 1\;(d=1,\ldots,D)$ . A $\bm{T}$ -invariant tensor $\bm{\mathcal{W}}$ is called a Toeplitz tensor.

A special case of a Toeplitz tensor is a circulant tensor.

Definition 2.14.

(Circulant Tensor) Let $\bm{T}\in\mathbb{R}^{J^{D}\times J^{D}}$ be the permutation matrix that cyclically permutes all indices $j_{d}\mapsto\bmod(j_{d}+1,J_{d})\neq 0$ . If $\bmod(j_{d}+1,J_{d})=0$ then $j_{d}\mapsto J_{d}\;(d=1,\ldots,D)$ . A $\bm{T}$ -invariant tensor $\bm{\mathcal{W}}$ is called a circulant tensor.

3 Full characterization of the prior distribution

In this section the Gaussian prior $p(\bm{w})$ for $(\bm{A},\bm{b})$ -constrained tensors is fully characterized. We also discuss how the square root covariance matrix $\sqrt{\bm{P}}_{0}$ can be computed without explicitly constructing the matrix $\bm{A}$ through a block-row partitioning of $\bm{A}$ .

Theorem 3.1.

The Gaussian distribution of $(\bm{A},\bm{b})$ -constrained tensors $\mathcal{N}(\bm{w}_{0},\bm{P}_{0})$ is described by a mean vector $\bm{w}_{0}$ such that $\bm{A}\,\bm{w}_{0}=\bm{b}$ and by a covariance matrix $\bm{P}_{0}$ such that the columns of $\sqrt{\bm{P}_{0}}$ span the right nullspace of $\bm{A}$ .

Proof 3.2.

Let $\bm{x}\in\mathbb{R}^{J_{1}\ldots J_{D}}$ be a sample of the standard normal distribution $\mathcal{N}(\bm{0},\bm{I})$ . A sample $\bm{w}$ of the desired Gaussian distribution is then

\displaystyle\bm{w}

\displaystyle=\bm{w}_{0}+\sqrt{\bm{P}_{0}}\;\bm{x},

where $\sqrt{\bm{P}_{0}}$ is the matrix square root of the covariance matrix $\bm{P}_{0}$ . Any sample $\bm{w}$ being an $(\bm{A},\bm{b})$ -constrained tensor implies

(8)

\displaystyle\bm{A}\;\bm{w}=\bm{A}\;\bm{w}_{0}+\bm{A}\;\sqrt{\bm{P}_{0}}\;\bm{x}

\displaystyle=\bm{b}.

Equation (8) can only be true for all random samples $\bm{x}$ if and only if

	$\displaystyle\bm{A}\;\bm{w}_{0}$	$\displaystyle=\bm{b},$
	$\displaystyle\bm{A}\;\sqrt{\bm{P}_{0}}$	$\displaystyle=\bm{0}.$

In other words, the mean $\bm{w}_{0}$ of the prior also has to satisfy the linear constraint and the columns of $\sqrt{\bm{P}_{0}}$ span the right nullspace of $\bm{A}$ .

3.1 Recursive nullspace computation

When the matrix $\bm{A}$ is too large to construct explicitly then it is beneficial to compute a basis for its right nullspace recursively. This is possible when considering a partitioning into $S$ block-rows $\bm{A}=\begin{pmatrix}\bm{A}_{1}^{T}&\bm{A}_{2}^{T}&\ldots&\bm{A}_{S}^{T}\end{% pmatrix}^{T}.$ Algorithm 1 recursively computes a basis for this nullspace without ever explicitly constructing $\bm{A}$ using Theorem 6.4.1 from [12, p. 329].

Algorithm 1 Compute basis for nullspace

{\bm{V}_{2}}

for block-row partitioned

\bm{A}

matrix

\bm{A}_{1},\bm{A}_{2},\ldots,\bm{A}_{S}

\bm{V}_{2}\leftarrow\textrm{null}(\bm{A}_{1})

for

s=2:S

\bm{Z}_{s}\leftarrow\textrm{null}(\bm{A}_{s}\,\bm{V}_{2})

\bm{V}_{2}\leftarrow\bm{V}_{2}\,\bm{Z}_{s}

end for

return

\bm{V}_{2}

4 Explicit covariance matrix construction for permutation-invariant tensors

Computing the covariance matrix $\bm{P}_{0}$ via Theorem 3.1 requires a basis for the nullspace of $\bm{A}$ . For $\bm{P}$ -invariant tensors it is possible to derive an explicit formula for $\bm{P}_{0}$ as a function of the permutation matrix $\bm{P}$ , which enables efficient sampling of the prior. Before we can state the main result in Theorem 4.6, we first need to discuss some facts about permutation matrices. An important concept tied to permutation matrices is its order. Any permutation can be written as a product of disjoint cycles. Each cycle has a particular length, also called the order of the cycle. In this article $K$ will denote the least common multiple of all orders of disjoint cycles of a given permutation.

Definition 4.1.

The order $K\in\mathbb{N}$ of a permutation matrix $\bm{P}$ is defined as the smallest natural number such that $\bm{P}^{K}=\bm{I}$ .

Skew- $\bm{P}$ -invariant structures always have an even order $K$ .

Lemma 4.2.

A skew- $\bm{P}$ -invariant structure has an even order $K$ .

Proof 4.3.

Skew- $\bm{P}$ -invariance requires per definition that $\lambda=-1$ . From $\bm{P}^{K}\,\bm{w}=\bm{I}\,\bm{w}=(-1)^{K}\,\bm{w}$ it follows that $(-1)^{K}=1$ , which proves the desired.

Theorem 4.6 will express the desired covariance matrix $\bm{P}_{0}$ as a function of powers of the permutation matrix $\bm{P}$ . The following two lemmas relating powers of permutation matrices are easily proved.

Lemma 4.4.

Let $\bm{P}$ be a permutation matrix of order $K$ , then for any $1\leq k\leq K$ :

(9)

\displaystyle\bm{P}^{k}

\displaystyle=\bm{P}^{K+k}.

Lemma 4.5.

Let $\bm{P}$ be a permutation matrix of order $K$ , then for any $1\leq k\leq K$ :

(10)

\displaystyle\bm{P}^{K-k}

\displaystyle=\left(\bm{P}^{k}\right)^{T}.

Lemma 4.4 follows from $\bm{P}^{K}=\bm{I}$ . Lemma 4.5 follows from the orthogonality of permutation matrices and from the fact that powers of permutation matrices are still permutation matrices. We now have all ingredients to describe the main result that provides an analytic solution for the covariance matrix $\bm{P}_{0}$ as an average over powers of the permutation matrix $P$ .

Theorem 4.6.

Let $\bm{P}$ be a permutation matrix of order $K$ . The Gaussian distribution of $\bm{P}$ -invariant tensors $\mathcal{N}(\bm{w}_{0},\bm{P}_{0})$ is described by a mean vector $\bm{w}_{0}$ that is $\bm{P}$ -invariant and covariance matrix

(11)

\displaystyle\bm{P}_{0}

\displaystyle=\frac{\bm{P}+\bm{P}^{2}+\cdots+\bm{P}^{K}}{K}.

The $\bm{P}$ -invariance of the mean $\bm{w}_{0}$ follows directly from Theorem 3.1. The proof of Theorem 4.6 therefore requires showing that $\bm{P}_{0}$ in (11) is the desired covariance matrix. A matrix $\bm{P}_{0}$ is a covariance matrix if it satisfies the following three sufficient conditions:

1.

has positive diagonal entries,
2.

is symmetric,
3.

is positive (semi-)definite.

Short proofs will now be given for each of these three covariance conditions.

Lemma 4.7.

The matrix $\bm{P}_{0}$ has positive diagonal entries.

Proof 4.8.

$\bm{P}_{0}$ is per definition a sum of permutation matrices, all diagonal entries of $\bm{P}_{0}$ are therefore either zero or positive. Since $\bm{P}^{K}=\bm{I}$ we have that the diagonal entries are guaranteed to be positive.

Lemma 4.9.

The matrix $\bm{P}_{0}$ is symmetric.

Proof 4.10.

The symmetry of $\bm{P}_{0}$ follows from

	$\displaystyle\bm{P}_{0}^{T}$	$\displaystyle=\frac{\bm{P}^{T}+(\bm{P}^{2})^{T}+\cdots+(\bm{P}^{K-1})^{T}+(\bm% {P}^{K})^{T}}{K},$
		$\displaystyle=\frac{\bm{P}^{K-1}+\bm{P}^{K-2}+\cdots+\bm{P}+\bm{P}^{K}}{K},$
		$\displaystyle=\bm{\bm{P}}_{0},$

where the second line follows from Lemma 4.5.

The semi-positive definiteness of $\bm{P}_{0}$ follows from its idempotency.

Lemma 4.11.

The matrix $\bm{P}_{0}$ is idempotent, that is $\bm{P}_{0}^{2}=\bm{P}_{0}$ .

Proof 4.12.

Writing out $(K\bm{P}_{0})^{2}$ in terms of $\bm{P}$ and applying Lemma 4.4 results in

	$\displaystyle(\bm{P}+\bm{P}^{2}+\cdots+\bm{P}^{K})^{2},$
	$\displaystyle=\bm{P}^{2}+2\;\bm{P}^{3}+\cdots+(K-1)\;\bm{P}^{K}+K\;\bm{P}^{K+1% }+(K-1)\;\bm{P}^{K+2}+\cdots+2\;\bm{P}^{2K-1}+\bm{P}^{2K},$
	$\displaystyle=K\;\bm{P}+\underbrace{\bm{P}^{2}+(K-1)\;\bm{P}^{K+2}}_{K\;\bm{P}% ^{2}}+\cdots+\underbrace{2\;\bm{P}^{2K-1}+(K-2)\;\bm{P}^{K-1}}_{K\;\bm{P}^{K-1% }}+\underbrace{(K-1)\;\bm{P}^{K}+\;\bm{P}^{2K}}_{K\bm{P}^{K}},$
	$\displaystyle=K\;(\bm{P}+\bm{P}^{2}+\bm{P}^{3}+\cdots+\bm{P}^{K}),$
	$\displaystyle=K^{2}\;\bm{P}_{0},$

which proves that $\bm{P}_{0}$ is idempotent.

The first consequence of $\bm{P}_{0}$ being idempotent is that it is positive semi-definite.

Lemma 4.13.

The matrix $\bm{P}_{0}$ is positive semi-definite.

Proof 4.14.

The two eigenvalue equations

\displaystyle\bm{P}_{0}\,\bm{v}=\lambda\,\bm{v}\quad,\quad(\bm{P}_{0})^{2}\,% \bm{v}=\lambda^{2}\,\bm{v}

are actually equal due to $\bm{P}_{0}$ being idempotent. It therefore follows that $\lambda^{2}-\lambda=0$ , which implies that the eigenvalues are either 0 or 1. This proves the positive semi-definiteness of $\bm{P}_{0}$ .

Having proved that $\bm{P}_{0}$ is a covariance matrix it remains to show that samples drawn from $\mathcal{N}(\bm{w}_{0},\bm{P}_{0})$ are $\bm{P}$ -invariant. From its symmetry and idempotency it follows that $\bm{P}_{0}$ is its own matrix square root $\bm{P}_{0}=\sqrt{\bm{P}_{0}}=\bm{P}_{0}^{T}=\sqrt{\bm{P}_{0}}^{T}$ .

Lemma 4.15.

Every sample $\bm{w}$ drawn from $\mathcal{N}(\bm{w}_{0},\bm{P}_{0})$ is $\bm{P}$ -invariant.

Proof 4.16.

A sample $\bm{w}$ from $\mathcal{N}(\bm{w}_{0},\bm{P}_{0})$ can be drawn by computing

\displaystyle\bm{w}=\bm{w}_{0}+\sqrt{\bm{P}_{0}}\;\bm{x},

where $\bm{x}$ is drawn from a standard normal distribution $\mathcal{N}(\bm{0},\bm{I})$ . The $\bm{P}$ -invariance of $\bm{w}$ follows from

	$\displaystyle\bm{w}$	$\displaystyle=\bm{P}\;\bm{w},$
	$\displaystyle\bm{w}_{0}+\sqrt{\bm{P}_{0}}\;\bm{x}$	$\displaystyle=\bm{P}\;\bm{w}_{0}+\bm{P}\;\sqrt{\bm{P}_{0}}\;\bm{x},$
	$\displaystyle\sqrt{\bm{P}_{0}}\;\bm{x}$	$\displaystyle=\bm{P}\;\sqrt{\bm{P}_{0}}\;\bm{x},$
	$\displaystyle\sqrt{\bm{P}_{0}}\;\bm{x}$	$\displaystyle=\bm{P}\;\left(\frac{\bm{P}+\bm{P}^{2}+\cdots+\bm{P}^{K-1}+\bm{P}% ^{K}}{K}\right)\;\bm{x},$
	$\displaystyle\sqrt{\bm{P}_{0}}\;\bm{x}$	$\displaystyle=\left(\frac{\bm{P}^{2}+\bm{P}^{3}+\cdots+\bm{P}^{K}+\bm{P}}{K}% \right)\;\bm{x},$
	$\displaystyle\sqrt{\bm{P}_{0}}\;\bm{x}$	$\displaystyle=\sqrt{\bm{P}_{0}}\;\bm{x}.$

The terms that depend on $\bm{w}_{0}$ cancel due to the $\bm{P}$ -invariance of $\bm{w}_{0}$ . Lemma 4.4 is used to go from line 4 to line 5.

Lemmas 4.7 up to 4.15 constitute the proof of Theorem 4.6. Another consequence from the idempotency of $\bm{P}_{0}$ is that this matrix is its own pseudoinverse.

Lemma 4.17.

The pseudoinverse $\bm{P}_{0}^{\dagger}$ satisfies

\displaystyle\bm{P}_{0}^{\dagger}

\displaystyle=\bm{P}_{0}.

Proof 4.18.

The pseudoinverse $\bm{P}_{0}^{\dagger}$ needs to satisfy the following four properties:

1.

$\bm{P}_{0}\bm{P}_{0}^{\dagger}\bm{P}_{0}=\bm{P}_{0}$ ,
2.

$\bm{P}_{0}^{\dagger}\bm{P}_{0}\bm{P}_{0}^{\dagger}=\bm{P}_{0}^{\dagger}$ ,
3.

$(\bm{P}_{0}\bm{P}_{0}^{\dagger})^{T}=\bm{P}_{0}\bm{P}_{0}^{\dagger}$ ,
4.

$(\bm{P}_{0}^{\dagger}\bm{P}_{0})^{T}=\bm{P}_{0}^{\dagger}\bm{P}_{0}$ .

All these properties are satisfied when assuming $\bm{P}_{0}^{\dagger}=\bm{P}_{0}$ and they follow from the idempotency of $\bm{P}_{0}$ . For example, Properties 1 and 2 follow from

\displaystyle\bm{P}_{0}\bm{P}_{0}^{\dagger}\bm{P}_{0}

\displaystyle=\bm{P}_{0}^{\dagger}\bm{P}_{0}\bm{P}_{0}^{\dagger}=(\bm{P}_{0})^% {3}=\bm{P}_{0}=\bm{P}_{0}^{\dagger}.

Properties 3 and 4 follow from the symmetry of $\bm{P}_{0}$ .

The fact that $\bm{P}_{0}=\sqrt{\bm{P}_{0}}=\bm{P}_{0}^{\dagger}=\sqrt{\bm{P}_{0}^{\dagger}}$ is convenient for several reasons. First, no explicit $\bm{P}_{0}^{-1}$ computation is required in equations (3) and (4). Second, sampling $\mathcal{N}(\bm{w}_{0},\bm{P}_{0})$ can be done without a matrix square-root computation and without any matrix-vector multiplications. Using Theorem 4.6 the product $\sqrt{\bm{P}_{0}}\,\bm{x}=\bm{P}_{0}\,\bm{x}$ can be implemented as a weighted sum of permuted versions of $\bm{x}$

\displaystyle\frac{\bm{P}\,\bm{x}+\bm{P}^{2}\,\bm{x}+\cdots+\bm{P}^{K}\,\bm{x}% }{K}.

All information of the permutation $\bm{P}$ is contained in a vector $\bm{p}$ of $I^{D}$ elements that specifies how each entry gets mapped to the next. Each term $\bm{P}^{k}\,\bm{x}$ of the weighted sum is then computed by successive permutations of $\bm{x}$ according to $\bm{p}$ with computational complexity $O(I^{D})$ . The pseudocode for sampling the distribution is given in Algorithm 2.

Algorithm 2 Generate

\bm{P}

-invariant sample from

\mathcal{N}(\bm{w}_{0},\bm{P}_{0})

\bm{w}_{0}

, index permutation vector

\bm{p}

\bm{x}\leftarrow\textrm{randn}(I^{D})

% sample standard normal

\mathcal{N}(\bm{0},\bm{I})

\bm{w}\leftarrow K\,\bm{w}_{0}

for

k=1:K

\bm{w}\leftarrow\bm{w}+\bm{x}

\bm{x}\leftarrow\bm{x}[\bm{p}]

% permute entries of

\bm{x}

according to

\bm{p}

end for

\bm{w}\leftarrow\frac{\bm{w}}{K}

return

\bm{w}

A similar result as in Theorem 4.6 can be proven for $\bm{P}$ -skew-invariant tensors.

Theorem 4.19.

For a permutation of even order $K$ , the Gaussian distribution of $\bm{P}$ -skew-invariant tensors $\mathcal{N}(\bm{w}_{0},\bm{P}_{0})$ is described by a mean vector $\bm{w}_{0}$ that is $\bm{P}$ -skew-invariant and covariance matrix

(12)

\displaystyle\bm{P}_{0}

\displaystyle:=\frac{-\bm{P}+\bm{P}^{2}-\cdots+\bm{P}^{K}}{K}=\frac{\sum_{k=1}% ^{K}\;(-1)^{k}\,\bm{P}^{k}}{K}.

Proof 4.20.

The proof is very similar to that of Theorem 4.6. The diagonal entries being nonnegative can be derived from the following argument. The permutation matrix $\bm{P}$ itself consists of cyclic permutations, with either even or odd order. If a cyclic permutation has an even order $k$ , then $\bm{P}^{k}$ will have ones on the diagonal for elements of the cycle. This cycle will occur $K/k$ times in (12), always with a positive sign. If a cyclic permutation has odd order $k$ , then the diagonal entries of $\bm{P}^{k}$ will come in equal amounts of $K/(2k)$ negative and $K/(2k)$ positive contributions, which results in a zero contribution to the diagonal. The total effect of all cyclic permutations then add up to either zero or positive diagonal entries. Symmetry is proven by using Corollary 4.5 and the fact that $K$ is even: an even order $k$ gets mapped to another even order $K-k$ and an odd order $k$ gets mapped to and odd order $K-k$ . Hence,

\displaystyle\bm{P}_{0}^{T}

\displaystyle=\frac{\sum_{k=1}^{K}\;(-1)^{k}\,(\bm{P}^{k})^{T}}{K}=\frac{\sum_% {k=1}^{K}\;(-1)^{k}\,\bm{P}^{K-k}}{K}=\bm{P}_{0}.

The idempotency of $\bm{P}_{0}$ follows a similar proof as for the case of $\bm{P}$ -invariance. Writing out $(K\bm{P}_{0})^{2}$ in terms of $\bm{P}$ and applying Corollary 4.4 results in

	$\displaystyle(-\bm{P}+\bm{P}^{2}-\cdots+\bm{P}^{K})^{2}$
	$\displaystyle=\bm{P}^{2}-2\;\bm{P}^{3}+\cdots+(K-1)\;\bm{P}^{K}-K\;\bm{P}^{K+1% }+(K-1)\;\bm{P}^{K+2}-\cdots-2\;\bm{P}^{2K-1}+\bm{P}^{2K}$
	$\displaystyle=-K\;\bm{P}+\underbrace{\bm{P}^{2}+(K-1)\;\bm{P}^{K+2}}_{K\;\bm{P% }^{2}}-\cdots\underbrace{-2\;\bm{P}^{2K-1}-(K-2)\;\bm{P}^{K-1}}_{-K\;\bm{P}^{K% -1}}+\underbrace{(K-1)\;\bm{P}^{K}+\;\bm{P}^{2K}}_{K\bm{P}^{K}}$
	$\displaystyle=K\;(-\bm{P}+\bm{P}^{2}-\bm{P}^{3}+\cdots+\bm{P}^{K})$
	$\displaystyle=K^{2}\;\bm{P}_{0}$

which proves that $\bm{P}_{0}$ is idempotent.

Theorems 4.6 and 4.19 are practical when the order $K$ of the permutation matrix $\bm{P}$ stays small compared to $J$ and $D$ . For Hankel structures this is unfortunately not the case. Consider for example a $20\times 20$ Hankel matrix. Its corresponding permutation matrix has permutation cycles ranging from length 1 up to 20 and $K$ is therefore the least common multiple of $1,2,\ldots,20=232,792,560$ . Fortunately, it is possible to explicitly construct a sparse matrix of orthogonal columns $\bm{V}$ such that $\sqrt{\bm{P}_{0}}=\bm{V}$ .

5 Sparse square root covariance matrix construction for permutation-invariant tensors

Every permutation $\bm{P}$ can be decomposed in terms of $R$ cyclic permutations. These cyclic permutations partition the set of all tensor entries into $R$ disjoint sets and allow for an alternative construction of $\sqrt{\bm{P}_{0}}$ , where the resulting matrix is sparse and consists of orthogonal columns.

Theorem 5.1.

Let $\bm{P}$ be a permutation matrix that consists of $R$ permutation cycles and let $C_{r}$ denote the $r$ th cycle, where the number of tensor entries in $C_{r}$ is denoted $|C_{r}|$ . Then the range of the matrix $\bm{V}\in\mathbb{R}^{J^{D}\times R}$ such that

(13)

v_{\overline{j_{1},j_{2},\ldots,j_{D}},r}=\begin{cases}\frac{1}{\sqrt{|C_{r}|}% }&\text{if }w_{j_{1},j_{2},\ldots,j_{D}}\in{C}_{r},\\[8.61108pt] 0&\text{otherwise, }\end{cases}

spans the eigenspace of $\bm{P}$ corresponding to an eigenvalue $\lambda=1$ . In other words, $\bm{V}=\sqrt{\bm{P}_{0}}$ . Also, $\bm{V}^{T}\bm{V}=\bm{I}_{R}$ .

Proof 5.2.

The equality $\bm{P}\bm{V}=\bm{V}$ follows from each column of $\bm{V}$ containing nonzero values at tensor entries of a particular permutation cycle of $\bm{P}$ . The orthogonality follows directly from the permutation cycles being disjoint and each column of $\bm{V}$ being unit-norm due to the scaling with $\sqrt{|C_{r}|}$ .

A basis for the skew- $\bm{P}$ -invariant eigenspace can be built in a similar way by retaining the cycles of even order and alternating the sign of the entries $v_{\overline{j_{1},j_{2},\ldots,j_{D}},r}$ in each column.

Example 5.3.

Consider a $20\times 20$ Hankel matrix. Using Theorem 4.6 one would need to construct the $400\times 400$ Hankel permutation matrix $\bm{H}$ and construct $\bm{P}_{0}$ by adding $232,792,560$ terms together. Using Theorem 5.1 the sparse $400\times 39$ matrix $\bm{V}$ can be constructed directly containing 400 nonzero entries.

6 Solving the inverse problem

In this section three different aspects when solving the inverse problem are discussed. First, we explain how the prior covariance matrices of $(\bm{A},\bm{b})$ -constrained tensors can be parameterized. Second, we briefly discuss a change of variables, originally proposed in [8], to exploit fast implementations of the matrix vector product $\bm{P}_{0}\bm{w}$ . The third aspect relates to kernel methods, where $(\bm{A},\bm{b})$ -constrained tensor priors are used to define new structured tensor kernel functions.

6.1 Parameterizing the prior covariance matrix

The covariance matrix $\bm{P}_{0}$ as described in Theorems 3.1, 4.6 and 5.1 encodes the structure of the $(\bm{A},\bm{b})$ -constrained tensor without having any free parameters to quantify the importance of the prior $p(\bm{w})$ relative to the likelihood $p(\bm{y}|\bm{w})$ . Such free parameters are often called hyperparameters. Suppose for example that through Theorem 3.1 an orthogonal basis for the nullspace $\bm{V}_{2}\in\mathbb{R}^{J^{D}\times R}$ of $\bm{A}$ is computed from its singular value decomposition (SVD)

\displaystyle\bm{A}=\begin{pmatrix}\bm{U}_{1}&\bm{U}_{2}\end{pmatrix}\;\begin{% pmatrix}\bm{S}&\bm{0}\\ \bm{0}&\bm{0}\\ \end{pmatrix}\;\begin{pmatrix}\bm{V}_{1}^{T}\\ \bm{V}_{2}^{T}\end{pmatrix}.

A desired square-root covariance matrix $\sqrt{\bm{P}_{0}}$ is then $\bm{V}_{2}\,\bm{T}$ , where $\bm{T}\in\mathbb{R}^{R\times R}$ is any invertible matrix. The nullity $R$ of $\bm{A}$ can be interpreted as the total number of distinct elements in the $(\bm{A},\bm{b})$ -constrained tensor $\bm{\mathcal{W}}$ . The $\bm{T}$ matrix can be interpreted as the square-root covariance matrix of those $R$ variables since

\displaystyle\bm{P}_{0}

\displaystyle=\sqrt{\bm{P}_{0}}\;(\sqrt{\bm{P}_{0}})^{T}=\bm{V}_{2}\;\left(\bm% {T}\,\bm{T}^{T}\right)\;\bm{V}_{2}^{T}.

The matrix $\bm{V}_{2}$ is then to be understood as “projecting” the covariance matrix $\bm{T}\bm{T}^{T}$ of the $R$ underlying variables to the $J^{D}$ entries of the $(\bm{A},\bm{b})$ -constrained $\bm{\mathcal{W}}$ tensor. Parameterizing $\bm{T}$ in terms of a single hyperparameter $\sigma\in\mathbb{R^{+}}$ as $\bm{T}=\sigma\;\bm{I}$ implies that these $R$ variables are independent and have equal variance $\sigma^{2}$ . Correlations between the $R$ variables can be modeled by for example parameterizing $\bm{T}$ as a lower triangular matrix. The values of these hyperparameters can be learned from data through cross-validation, marginal likelihood optimization or a hierarchical Bayesian approach [27, 32].

6.2 Change of variables

Squaring the condition number when solving the normal equation of (3) can be avoided by solving its square-root version

\displaystyle\begin{pmatrix}\sqrt{\bm{\Sigma}^{-1}}\bm{\Phi}\\ \sqrt{\bm{P}_{0}^{-1}}\end{pmatrix}\;\bm{w}_{+}

\displaystyle=\begin{pmatrix}\sqrt{\bm{\Sigma}^{-1}}{\bm{y}}\\ \sqrt{\bm{P}_{0}^{-1}}\,\bm{w}_{0}\end{pmatrix}

instead. When constructing the square-root of the inverse prior covariance matrix is difficult then a change of variables can be used to avoid their construction [8]. By defining $\bm{x}:=\bm{P}_{0}^{-1}\,(\bm{w}_{+}-\bm{w}_{0})$ and $\bm{z}:=\bm{y}-\bm{\Phi}\bm{w}_{0}$ the square-root linear system is transformed into

\displaystyle\begin{pmatrix}\sqrt{\bm{\Sigma}^{-1}}\bm{\Phi}\bm{P}_{0}\\ \bm{I}\end{pmatrix}\;\bm{x}

\displaystyle=\begin{pmatrix}\sqrt{\bm{\Sigma}^{-1}}{\bm{z}}\\ 0\end{pmatrix}.

The desired posterior mean $\bm{w}_{+}$ can then be recovered from $\bm{w}_{+}=\bm{P}_{0}\,\bm{x}+\bm{w}_{0}$ . This formulation is especially beneficial when the matrix vector product $\bm{P}_{0}\,\bm{x}$ can be implemented in a computationally efficient manner, for example using Algorithm 2.

6.3 Structured tensor kernel functions

When the tensor $\bm{\mathcal{W}}$ is much larger than the data size $N$ then the $O(J^{3D})$ computational complexity of computing (3) is replaced with at least $O(N^{2})$ by solving the corresponding dual problem

\displaystyle(\bm{\Phi}\,\bm{P}_{0}\,\bm{\Phi}^{T}+\bm{\Sigma})\;\bm{v}=\bm{y}.

An additional benefit is that no matrix inverse of $\bm{P}_{0}$ is required so that Theorems 3.1, 4.6 and 5.1 can be applied directly. The matrix $\bm{\Phi}\,\bm{P}_{0}\,\bm{\Phi}^{T}$ is called the kernel matrix $\bm{K}$ and each entry $k_{i,j}$ is per definition the evaluation of a kernel function

k_{i,j}=k(\bm{x}_{i},\bm{x}_{j}):=\bm{\varphi}(\bm{x}_{i})^{T}\,\bm{P}_{0}\;% \bm{\varphi}(\bm{x}_{j}).

Choosing $\bm{P}_{0}$ as a covariance matrix of an $(\bm{A},\bm{b})$ -constrained tensor allows us to define new kernel functions. The kernel trick in machine learning refers to the fact where the kernel function can be evaluated without every explicitly computing the possibly large feature vectors $\bm{\varphi}(\cdot)$ . In the case of $\bm{P}$ -invariant tensors one can exploit the particular structure of $\bm{P}_{0}$ as described in Theorem 4.6 or use Algorithm 2 to achieve this goal.

Example 6.1.

(Centrosymmetric polynomial kernel) Let $\sqrt{c}\in\mathbb{R}$ and $d\in\mathbb{N}$ . The polynomial kernel function is defined as

	$\displaystyle k(\bm{x}_{i},\bm{x}_{j})$	$\displaystyle=\bm{\varphi}(\bm{x}_{i})^{T}\;\bm{I}\;\bm{\varphi}(\bm{x}_{j}),$
		$\displaystyle=\underbrace{\begin{pmatrix}\sqrt{c}&\bm{x}_{i}^{T}\end{pmatrix}% \otimes\cdots\otimes\begin{pmatrix}\sqrt{c}&\bm{x}_{i}^{T}\end{pmatrix}}_{d% \textrm{ times}}\;\bm{I}\;\underbrace{\begin{pmatrix}\sqrt{c}&\bm{x}_{j}^{T}% \end{pmatrix}^{T}\otimes\cdots\otimes\begin{pmatrix}\sqrt{c}&\bm{x}_{j}^{T}% \end{pmatrix}^{T}}_{d\textrm{ times}}$
		$\displaystyle=(c+\bm{x}_{i}^{T}\,\bm{x}_{j})^{d}.$

The expression $(c+\bm{x}_{i}^{T}\,\bm{x}_{j})^{d}$ is obtained from writing the identity matrix $\bm{I}$ as a Kronecker product of smaller identity matrices and applying the mixed product property. The polynomial kernel function can therefore be interpreted as using a unit covariance matrix $\bm{P}_{0}$ . We can now define the centrosymmetric polynomial kernel function $k_{2}$ by using the polynomial feature vectors $\bm{\varphi}(\cdot)$ and replacing $\bm{I}$ with the covariance matrix of centrosymmetric tensors. From Theorem 4.6 it then follows that

	$\displaystyle k_{2}(\bm{x}_{i},\bm{x}_{j})$	$\displaystyle=\bm{\varphi}(\bm{x}_{i})^{T}\;\bm{P}_{0}\;\bm{\varphi}(\bm{x}_{j% }),$
		$\displaystyle=\frac{1}{2}\,\bm{\varphi}(\bm{x}_{i})^{T}\;{(\bm{I}+\bm{J})}\;% \bm{\varphi}(\bm{x}_{j}),$
		$\displaystyle=\frac{1}{2}\,\underbrace{\begin{pmatrix}\sqrt{c}&\bm{x}_{i}^{T}% \end{pmatrix}\otimes\cdots\otimes\begin{pmatrix}\sqrt{c}&\bm{x}_{i}^{T}\end{% pmatrix}}_{d\textrm{ times}}\;(\bm{I}+\bm{J})\;\underbrace{\begin{pmatrix}% \sqrt{c}&\bm{x}_{j}^{T}\end{pmatrix}^{T}\otimes\cdots\otimes\begin{pmatrix}% \sqrt{c}&\bm{x}_{j}^{T}\end{pmatrix}^{T}}_{d\textrm{ times}},$
		$\displaystyle=\frac{1}{2}(c+\bm{x}_{i}^{T}\bm{x}_{j})^{d}+\frac{1}{2}\left(% \begin{pmatrix}\sqrt{c}&\bm{x}_{i}^{T}\end{pmatrix}\bm{J}_{d}\begin{pmatrix}% \sqrt{c}&\bm{x}_{j}^{T}\end{pmatrix}^{T}\right)^{d}.$

Also here the explicit construction of $\bm{\varphi}(\cdot)$ is avoided by writing the matrix $\bm{J}\in\mathbb{R}^{J^{D}\times J^{D}}$ as a Kronecker product of the smaller permutation matrix $\bm{J}_{d}\in\mathbb{R}^{J\times J}$ with itself $d$ times and using the mixed-product property.

7 Applications

In this section we demonstrate the use of Theorems 3.1, 4.6, and 5.1 in three different applications. Practical implementations on how to sample various $(\bm{A},\bm{b})$ -constrained tensor priors are explained in Application 7.1. We consider lower triangular tensors, tensors for which the sum over the last index adds up to 1, symmetric tensors and Hankel tensors. Application 7.2 considers the problem of completing a Hankel matrix from noisy partial measurements by solving it as a Bayesian inverse problem. The estimate of the completed Hankel matrix when using a Hankel prior is compared to the estimate where no prior is used. In Application 7.3 learning a classifier for handwritten digits is solved as a Bayesian inverse problem. The classifier obtained with the commonly used Tikhonov prior is compared to several $(\bm{A},\bm{b})$ -constrained tensor priors.

All applications have been implemented as reactive Pluto [28] notebooks in Julia [5] and are publicly available at https://github.com/TUDelft-DeTAIL/AbTensors. The notebook files can be freely downloaded and run on your local machine in Julia. An alternative way to use these notebooks that does not require the installation of Julia is to run them in the cloud via Binder [23]. This can be done by clicking on each of the links on the main Github page. Please note that it can take over 10 minutes for Binder to download and compile all required packages.

As discussed in section 6.1 we parameterized the prior covariance matrix $\bm{P}_{0}$ with a single hyperparameter $\sigma_{P}$ in both Applications 7.2 and 7.3.

7.1 Sampling structured tensor priors

In this first application we demonstrate how Theorems 3.1, 4.6 and 5.1 are used to sample the priors of different $(\bm{A},\bm{b})$ -constrained tensors.

Example 7.1.

(Lower triangular tensors) A first example of an $(\bm{A},\bm{b})$ -constrained tensor considered here are lower triangular tensors. From Definition 2.1 we know that triangular tensors are described by

\displaystyle\bm{A}=\begin{pmatrix}\bm{A}_{1}\\ \bm{A}_{2}\\ \vdots\\ \bm{A}_{D-1}\end{pmatrix}=\begin{pmatrix}\bm{S}\otimes\bm{I}_{J}\otimes\cdots% \otimes\bm{I}_{J}\\ \bm{I}_{J}\otimes\bm{S}\otimes\cdots\otimes\bm{I}_{J}\\ \vdots\\ \bm{I}_{J}\otimes\bm{I}_{J}\otimes\cdots\otimes\bm{S}\end{pmatrix}\in\mathbb{R% }^{\frac{(D-1)(J-1)J^{D-1}}{2}\times J^{D}}

and zero vector $\bm{b}$ . The square root of the covariance matrix is built up by applying Algorithm 1, which considers only 1 block row of $\bm{A}$ at a time. The whole $\bm{A}$ matrix is therefore never explicitly made. In the notebook it is possible to sample lower triangular tensors with orders ranging from 2 up to 5 and dimensions 2 up to 6 by moving the corresponding sliders.

Example 7.2.

(Tensors with known sum of entries) In this example we sample tensors $\bm{\mathcal{W}}$ for which the sum over the last index always adds up to a value of 1:

\displaystyle\forall j_{1},j_{2},\ldots,j_{D-1}:\sum_{j_{D}}w_{j_{1},j_{2},% \ldots,j_{D}}=b_{j_{1},j_{2},\ldots,j_{D-1}}=1.

From Lemma 2.5 we know that in this case $\bm{A}=\bm{1}_{J}^{T}\otimes\bm{I}_{J}\otimes\cdots\otimes\bm{I}_{J}$ . It is straightforward to verify that a basis for the right nullspace of $\bm{A}$ is

\displaystyle\begin{pmatrix}1&1&\cdots&1\\ -1&0&\cdots&0\\ 0&-1&\cdots&0\\ 0&0&\cdots&-1\end{pmatrix}\otimes I_{J}\otimes\cdots\otimes I_{J}.

Sampling the prior can now be done without every constructing a basis for the nullspace explicitly since

	$\displaystyle\sqrt{\bm{P}_{0}}\;\bm{x}$	$\displaystyle=\left(\begin{pmatrix}1&1&\cdots&1\\ -1&0&\cdots&0\\ 0&-1&\cdots&0\\ 0&0&\cdots&-1\end{pmatrix}\otimes\bm{I}_{J}\otimes\cdots\otimes\bm{I}_{J}% \right)\;\bm{x}$
		$\displaystyle=\begin{pmatrix}\bm{I}_{J^{D-1}}&\bm{I}_{J^{D-1}}&\cdots&\bm{I}_{% J^{D-1}}\\ -\bm{I}_{J^{D-1}}&0&\cdots&0\\ 0&-\bm{I}_{J^{D-1}}&\cdots&0\\ 0&0&\cdots&-\bm{I}_{J^{D-1}}\end{pmatrix}\;\begin{pmatrix}\bm{x}_{1}\\ \bm{x}_{2}\\ \vdots\\ \bm{x}_{J-1}\end{pmatrix}=\begin{pmatrix}\bm{x}_{1}+\bm{x}_{2}+\cdots+\bm{x}_{% J-1}\\ -\bm{x}_{1}\\ -\bm{x}_{2}\\ \vdots\\ -\bm{x}_{J-1}\end{pmatrix}.$

It is therefore sufficient to sample $\bm{x}\in\mathbb{R}^{(J-1)\,J^{D-1}}$ from a standard normal distribution and do the operations on the $J-1$ partitions of $\bm{x}$ as described above to generate the desired sample. In the notebook one can change the order of the sampled tensor from 2 up to 5 and dimension from 5 up to 10 by using the corresponding sliders.

Example 7.3.

(Symmetric tensors) Symmetric tensors $\bm{\mathcal{W}}$ are tensors for which entries are invariant under any index permutation. The permutation matrix $\bm{S}$ in the symmetric case consists of cyclic permutations where each each cycle contains the entry $w_{j_{1},\ldots,j_{D}}$ and all entries with corresponding index permutations $w_{\pi(j_{1},\ldots,j_{D})}$ . For example, in the case $D=2$ and $J=2$ the permutation matrix $\bm{S}$ consists of $3$ cyclic permutations

\displaystyle w_{1,1}\mapsto w_{1,1},\;w_{2,1}\mapsto w_{1,2},\;w_{1,2}\mapsto w% _{2,1},\;w_{2,2}\mapsto w_{2,2}.

The order $K$ of $\bm{S}$ in this case is $2$ since $\bm{S}^{2}=I$ . According to Theorem 4.6 we then have that the square root of the covariance matrix is $\sqrt{\bm{P}_{0}}=\nicefrac{{(\bm{S}+\bm{S}^{2})}}{{2}}.$ When $D=3$ , the order $K$ of the corresponding permutation matrix is $6$ and hence $\sqrt{\bm{P}_{0}}=\nicefrac{{(\bm{S}+\bm{S}^{2}+\bm{S}^{3}+\bm{S}^{4}+\bm{S}^{% 5}+\bm{S}^{6})}}{{6}}.$ Sampling from these priors is done via Algorithm 2 where a standard normal sample $\bm{x}\in\mathbb{R}^{J^{D}}$ is generated and permuted $K$ times. The notebook allows you to sample symmetric tensors of orders 2 and 3 and dimensions 3 up to 10.

Example 7.4.

(Hankel tensors) Hankel tensors $\bm{\mathcal{W}}$ are tensors for which entries with a constant index sum $j_{1}+\cdots+j_{D}$ have the same numerical value. The order $K$ of the corresponding permutation matrix $\bm{P}$ grows very quickly. For example, when $D=2$ and $J=20$ the order $K$ is the least common multiple of $1,2,\ldots,20=232,792,560$ . Theorem 5.1, however, allows us to construct a matrix $\sqrt{\bm{P}_{0}}\in\mathbb{R}^{J^{D}\times R}$ , where $R$ is the number of permutation cycles. For Hankel tensors we have that $R=D(J-1)+1$ . The notebook allows you to sample Hankel tensors of order 2 up to 4 and dimensions 3 up to 10.

7.2 Completion of a Hankel matrix from noisy measurements

Hankel matrices are very common in signal processing and control theory. In this application a Bayesian approach will be used to complete a Hankel matrix based on noisy incomplete measurements. For this we use the following forward model $\bm{y}=\bm{\Phi}\;\bm{w}+\bm{\epsilon}$ , where $\bm{w}\in\mathbb{R}^{10^{2}}$ is the vectorization of the true underlying $10\times 10$ Hankel matrix. The $I\times 10^{2}$ matrix $\bm{\Phi}$ selects $I$ random entries of $\bm{w}$ with equal probability. Each row of $\bm{\Phi}$ contains a single nonzero unit-valued entry at a random location. The number of measurements $I$ can be changed through a slider in the notebook. The vector $\bm{\epsilon}$ is a vector of zero-mean Gaussian noise. Given $\bm{y}$ and $\bm{\Phi}$ , a Bayesian estimate of the underlying Hankel matrix $\bm{W}$ can be obtained from (3) as the posterior mean $\bm{w}_{+}$ . Another commonly used estimate is the maximum likelihood estimate, which is the $\bm{w}$ that maximizes the likelihood $p(\bm{y}|\bm{w})$ . We compare two posterior estimates with the maximum likelihood estimate under two different assumptions on the noise covariance. We fix the sampling rate at $50\%$ and choose $\sigma_{\epsilon}^{2}=1$ . The prior covariance matrix is set to $\sigma_{P}^{2}\,\bm{P}_{0}=10^{-6}\,\bm{P}_{0}$ , where $\bm{P}_{0}$ is covariance matrix of the Hankel prior obtained via Theorem 5.1.

Example 7.5.

(White noise) First we consider white noise, which implies that $\bm{\Sigma}=\sigma_{\epsilon}^{2}\,\bm{I}$ . The singular values of the prior precision $\nicefrac{{\sqrt{\bm{P}_{0}^{-1}}}}{{\sigma_{P}}}$ , posterior precision $\begin{pmatrix}\nicefrac{{\bm{\Phi}^{T}}}{{\sigma_{\epsilon}}}&\nicefrac{{% \sqrt{\bm{P}_{0}^{-1}}^{T}}}{{\sigma_{P}}}\end{pmatrix}^{T}$ , and likelihood precision $\nicefrac{{\bm{\Phi}}}{{\sigma_{\epsilon}}}$ are shown in Figure 1(a). They provide us with insight on how the prior, posterior and likelihood relate to each other. The likelihood $p(\bm{y}|\bm{w})$ only has 50 measurements and gives all of them equal weight. The prior $p(\bm{w})$ on the other hand only considers 19 nonzero values as a $10\times 10$ Hankel matrix has 19 distinct entries. Given the relative high noise variance compared to the prior, the posterior $p(\bm{w}|\bm{y})$ “follows” the prior for the first 19 singular values.

Refer to caption — (a) White-noise case. Given the relative high noise variance the posterior follows the prior for the first 19 singular values.

A prior mean is obtained by averaging over the nonzero antidiagonals of the measurements and using those averages to construct a Hankel matrix. We now compute three different estimates and compare them to the ground truth. The first estimate is obtained from (3) with a backslash solve. A second estimate is computed by truncating the SVD of $\begin{pmatrix}\nicefrac{{\Phi^{T}}}{{\sigma_{\epsilon}}}&\nicefrac{{{P}_{0}^{% -T}}}{{\sigma_{P}}}\end{pmatrix}^{T}$ to rank 19 in equation (3). The third estimate is the maximum likelihood estimate. For each of these estimates we show the relative error in Table 1.

Table 1: Relative errors for three different Hankel matrix completion estimates

\bm{\hat{w}}

. Smallest relative error is indicated in bold.

	backslash	truncated SVD	max-likelihood
$\frac{\|\|\bm{w}-\bm{\hat{w}}\|\|_{2}}{\|\|\bm{w}\|\|_{2}}$ (white noise)	$0.160$	0.137	$0.614$
$\frac{\|\|\bm{w}-\bm{\hat{w}}\|\|_{2}}{\|\|\bm{w}\|\|_{2}}$ (Hankel noise)	$0.235$	0.137	$0.604$
$\frac{\|\|\bm{H}\bm{\hat{w}}-\bm{\hat{w}}\|\|_{2}}{\|\|\bm{\hat{w}}\|\|_{2}}$	$0.12$	$6.3\text{e-}7$	$0.80$

Adding the Hankel prior shows a clear improvement on the completed Hankel matrix. The relative error is 4 times smaller from the inclusion of the prior. Since the first 19 singular values of the posterior are equal to the singular values of the prior one could expect the estimated posterior mean $\bm{w}_{+}$ obtained from truncating the SVD to the first 19 singular values to be Hankel. In order to confirm this we also compute the relative Hankel error $\nicefrac{{||\bm{H}\,\bm{w}-\bm{w}||_{2}}}{{||\bm{w}||_{2}}}$ for the three estimates in Table 1, where $\bm{H}$ is the Hankel permutation matrix. Restricting the posterior mean to lie in a subspace spanned by the first 19 right singular vectors indeed enforces a Hankel structure.

Example 7.6.

(Hankel distributed noise) To investigate the effect of the noise covariance on the estimates we now consider noise $\bm{e}$ that also has a Hankel structure. In other words, the covariance matrix for $p(\bm{e})$ is $\sigma_{\epsilon}^{2}\,\bm{P}_{0}$ , whereas the prior covariance is $\sigma_{P}^{2}\,\bm{P}_{0}$ . With the noise being Hankel, this means that the perturbation $\bm{\epsilon}$ of $\bm{w}$ will have a Hankel structure as well. This can be modeled via the forward model $\bm{y}=\Phi(\bm{w}+\bm{\epsilon})$ , where now $p(\Phi\bm{\epsilon})=\mathcal{N}(0,\sigma_{\epsilon}^{2}\,\Phi\,P_{0}\,\Phi^{T})$ . Figure 1(b) shows the singular values of the square-root precision matrices. The number of nonzero singular values of the likelihood now consists of 2 plateaus. Again, the posterior follows the prior for the first 19 singular values. Since now measurements of entries along the same antidiagonal are identical, less information is to be extracted from the measurements. This explains the first drop of Figure 1(b) at the 19th singular value for both the likelihood and posterior. Less information also means that we can expect our estimate to be worse compared to the white noise case. The relative errors are now indeed higher, as seen in Table 1. Note however that the estimate obtained by truncating the SVD remains the same.

7.3 Bayesian learning of MNIST classifier

In this application we learn a classifier for images of $10$ handwritten digits. The classifier is trained on the MNIST data [17], which consists of $60,000$ pictures for training and $10,000$ pictures for test. Each picture $\bm{x}_{n}$ is of size $28\times 28$ . We pick $10,000$ random samples from the training set and convert each picture $\bm{x}_{n}$ into $25^{2}=625$ Random Fourier Features $\bm{\varphi}(\bm{x}_{n})_{j}=\text{Re}(e^{-i\,\bm{v}_{j}^{T}\bm{x}_{n}})$ [24]. The 625 frequency vectors $\bm{v}_{j}$ are sampled from a zero-mean Gaussian with variance $\nicefrac{{1}}{{5^{2}}}\,\bm{I}$ . We use a one-vs-all strategy by learning $10$ classifiers at once. Each classifier is trained to distinguish between $1$ particular class versus all others. The forward model for our $10$ classifiers is then $\bm{y}=\bm{\varphi}(\bm{x})\;\bm{W}+\bm{e}$ . Each column of $\bm{W}\in\mathbb{R}^{625\times 10}$ contains the model parameters of $1$ specific classifier. In order to predict the class of a sample $\bm{x}^{*}$ we compute $\bm{y}^{*}=\bm{\varphi}(\bm{x}^{*})\,\bm{W}$ and apply the softmax function

\displaystyle\bm{\sigma}(\bm{y}^{*})=\frac{e^{\bm{y}^{*}_{k}}}{\sum_{k}e^{\bm{% y}^{*}_{k}}}\in\mathbb{R}^{10}.

The prediction is then the class with maximal $\bm{\sigma}(\bm{y}^{*})$ . The 10 classifiers are trained on a training data set of pictures $\bm{X}\in\mathbb{R}^{10,00\times 784}$ and corresponding class labels $\bm{Y}\in\mathbb{R}^{10,000\times 10}$ . Our estimate for $\bm{W}$ is the mean of the posterior $p(\bm{W}|\bm{Y},\bm{X})$ . The residual $\bm{e}$ is most commonly assumed to be zero-mean white Gaussian noise $p(\bm{e})=\mathcal{N}(0,\sigma_{\epsilon}^{2}\bm{I})$ . Likewise, the prior $p(\bm{W})$ is usually assumed to be a zero-mean normal distribution with a uniform scaling covariance matrix $\bm{P}_{0}=\sigma_{P}^{2}\;\bm{I}$ . Such a prior is also called Tikhonov regularization. We compare the performance of the Tikhonov prior to other zero-mean $(\bm{A},\bm{b})$ -constrained tensor priors (symmetric, Hankel en circulant), constructed using either Theorem 4.5 or Theorem 5.1. The noise variance $\sigma_{\epsilon}^{2}$ is set to a fixed value of 1.

The difference between these different priors can be investigated by looking at the singular value profiles of the square-root precision matrices of the corresponding posteriors. These are shown in Figure 2(a) for $\sigma_{P}^{2}=10^{-6}$ and in Figure 2(b) for $\sigma_{P}^{2}=10^{-3}$ . Being confident in the prior ( $\sigma_{P}^{2}=10^{-6}$ ) has a strong effect on the corresponding posterior, which explains the large differences in singular value profiles. The corresponding classifiers can then be expected to also differ a lot on unseen test data. Indeed, applying the obtained classifiers on $10,000$ test images results in a relative number of correctly classified images shown in Table 2.

Table 2: Comparison of relative number of correctly classified images for classifiers learned with different priors. Best classifier indicated in bold.

	Tikhonov	symmetric	Hankel	circulant
$\sigma_{P}^{2}=10^{-6}$	$0.650$	$0.880$	0.917	$0.915$
$\sigma_{P}^{2}=10^{-3}$	$0.917$	$0.918$	0.920	$0.919$

All $(\bm{A},\bm{b})$ -constrained priors outperform the conventional Tikhonov prior, with Hankel and circulant tensors having the best performance. By increasing the prior covariance to $\sigma_{P}^{2}=10^{-3}$ all singular value profiles become very similar. The corresponding classifiers have similar performance as seen in Table 2. No significant classification improvement is observed for the Hankel and circulant priors.

8 Conclusions

A whole new class of Bayesian priors has been worked-out which could be potentially applied to a variety of different inverse problems. The main focus of this article was mostly on the theoretical foundation and where possible we discussed practical implementations without going into much detail. Although the curse of dimensionality when considering tensors of large order and dimension can be completely resolved via the corresponding dual problem, the computational complexity can still become prohibitively large with increasing sample size. To tackle this complexity the possibility to represent the prior mean vector and covariance matrix of these priors as exact low-rank tensor decompositions could be investigated.

Acknowledgments

Many thanks to Frederiek Wesel for valuable discussions and feedback.

References

[1] J. M. Bardsley, Computational Uncertainty Quantification for Inverse Problems: An Introduction to Singular Integrals, SIAM, 2018.
[2] K. Batselier, Low-rank tensor decompositions for nonlinear system identification: A tutorial with examples, IEEE Control Systems Magazine, 42 (2022), pp. 54–74.
[3] K. Batselier, Z. Chen, and N. Wong, Tensor Network alternating linear scheme for MIMO Volterra system identification, Automatica, 84 (2017), pp. 26–35.
[4] K. Batselier and N. Wong, A constructive arbitrary-degree Kronecker product decomposition of tensors, Numerical Linear Algebra with Applications, 24 (2017), p. e2097.
[5] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, Julia: A fresh approach to numerical computing, SIAM review, 59 (2017), pp. 65–98.
[6] M. Blondel, M. Ishihata, A. Fu**o, and N. Ueda, Polynomial networks and factorization machines: New insights and efficient training algorithms, in International Conference on Machine Learning, PMLR, 2016, pp. 850–858.
[7] J. Chung and S. Gazzola, Computational Methods for Large-Scale Inverse Problems: A Survey on Hybrid Projection Methods, SIAM Review, 66 (2024), pp. 205–284.
[8] J. Chung and A. K. Saibaba, Generalized Hybrid Iterative Methods for Large-Scale Bayesian Inverse Problems, SIAM Journal on Scientific Computing, 39 (2017), pp. S24–S46.
[9] H. F. de Groote, On varieties of optimal algorithms for the computation of bilinear map**s i. the isotropy group of a bilinear map**, Theoretical Computer Science, 7 (1978), pp. 1–24.
[10] C. L. Epstein, Introduction to the mathematics of medical imaging, SIAM, 2007.
[11] D. F. Gleich, L.-H. Lim, and Y. Yu, Multilinear pagerank, SIAM Journal on Matrix Analysis and Applications, 36 (2015), pp. 1507–1541.
[12] G. H. Golub and C. F. Van Loan, Matrix computations, JHU press, 2013.
[13] P. C. Hansen, J. G. Nagy, and D. P. O’leary, Deblurring images: matrices, spectra, and filtering, SIAM, 2006.
[14] N. Kargas and N. D. Sidiropoulos, Supervised learning and canonical decomposition of multivariate functions, IEEE Transactions on Signal Processing, 69 (2021), pp. 1097–1107.
[15] C.-Y. Ko, K. Batselier, L. Daniel, W. Yu, and N. Wong, Fast and accurate tensor completion with total variation regularized tensor trains, IEEE Transactions on Image Processing, 29 (2020), pp. 6918–6931.
[16] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM review, 51 (2009), pp. 455–500.
[17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324.
[18] W. Li and M. K. Ng, On the limiting probability distribution of a transition probability tensor, Linear and Multilinear Algebra, 62 (2014), pp. 362–385.
[19] J. Liu, P. Musialski, P. Wonka, and J. Ye, Tensor completion for estimating missing values in visual data, IEEE transactions on pattern analysis and machine intelligence, 35 (2012), pp. 208–220.
[20] N. Mastronardi, P. Lemmerling, and S. Van Huffel, Fast structured total least squares algorithm for solving the basic deconvolution problem, SIAM Journal on Matrix Analysis and Applications, 22 (2000), pp. 533–553.
[21] A. Novikov, I. Oseledets, and M. Trofimov, Exponential machines, Bulletin of the Polish Academy of Sciences: Technical Sciences; 2018; 66; No 6 (Special Section on Deep Learning: Theory and Practice); 789-797, (2018).
[22] G. Pillonetto and G. De Nicolao, A new kernel-based approach for linear system identification, Automatica, 46 (2010), pp. 81–93.
[23] Project Jupyter, Matthias Bussonnier, Jessica Forde, Jeremy Freeman, Brian Granger, Tim Head, Chris Holdgraf, Kyle Kelley, Gladys Nalvarte, Andrew Osheroff, M. Pacer, Yuvi Panda, Fernando Perez, Benjamin Ragan Kelley, and Carol Willing, Binder 2.0 - Reproducible, interactive, sharable environments for science at scale, in Proceedings of the 17th Python in Science Conference, Fatih Akici, David Lippa, Dillon Niederhut, and M. Pacer, eds., 2018, pp. 113 – 120.
[24] A. Rahimi and B. Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 20 (2007).
[25] S. Särkkä and L. Svensson, Bayesian filtering and smoothing, vol. 17, Cambridge university press, 2023.
[26] E. Stoudenmire and D. J. Schwab, Supervised learning with tensor networks, Advances in neural information processing systems, 29 (2016).
[27] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Least Squares Support Vector Machines, World Scientific, Singapore, 2002.
[28] F. van der Plas and M. Bocheński, fonsp/pluto.jl: v0.19.42, May 2024.
[29] C. F. Van Loan, The ubiquitous Kronecker product, Journal of computational and applied mathematics, 123 (2000), pp. 85–100.
[30] S. Wahls, V. Koivunen, H. V. Poor, and M. Verhaegen, Learning multidimensional Fourier series with tensor trains, in 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), IEEE, 2014, pp. 394–398.
[31] F. Wesel and K. Batselier, Large-Scale Learning with Fourier Features and Tensor Decompositions, Advances in Neural Information Processing Systems, 34 (2021), pp. 17543–17554.
[32] C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning, vol. 2, MIT press Cambridge, MA, 2006.

Constructing structured tensor priors for Bayesian inverse problems ††thanks: \fundingThis publication is part of the project Sustainable learning for Artificial Intelligence from noisy large-scale data (with project number VI.Vidi.213.017) which is financed by the Dutch Research Council (NWO).