remarkRemark \newsiamremarkexampleExample \newsiamremarkhypothesisHypothesis \newsiamthmclaimClaim \headersConstructing structured tensor priors for Bayesian inverse problemsK. Batselier \externaldocument[][nocite]ex_supplement
Constructing structured tensor priors for Bayesian inverse problems ††thanks: \fundingThis publication is part of the project Sustainable learning for Artificial Intelligence from noisy large-scale data (with project number VI.Vidi.213.017) which is financed by the Dutch Research Council (NWO).
Abstract
Specifying a prior distribution is an essential part of solving Bayesian inverse problems. The prior encodes a belief on the nature of the solution and this regularizes the problem. In this article we completely characterize a Gaussian prior that encodes the belief that the solution is a structured tensor. We first define the notion of -constrained tensors and show that they describe a large variety of different structures such as Hankel, circulant, triangular, symmetric, and so on. Then we completely characterize the Gaussian probability distribution of such tensors by specifying its mean vector and covariance matrix. Furthermore, explicit expressions are proved for the covariance matrix of tensors whose entries are invariant under a permutation. These results unlock a whole new class of priors for Bayesian inverse problems. We illustrate how new kernel functions can be designed and efficiently computed and apply our results on two particular Bayesian inverse problems: completing a Hankel matrix from a few noisy measurements and learning an image classifier of handwritten digits. The effectiveness of the proposed priors is demonstrated for both problems. All applications have been implemented as reactive Pluto notebooks in Julia.
keywords:
Bayesian inverse problems, structured tensors, tensors, kernel methods15A29, 15A69, 62F15
1 Introduction
We consider a set of data samples and the following linear forward model
(1) |
Each scalar measurement is obtained from an inner product of a data-dependent tensor with a tensor of unknown latent variables , corrupted by measurement noise . Tensors in this context are -dimensional arrays, with vectors and matrices being the most common cases. Vectorizing all tensors and collecting the measurements into a vector allows (1) to be rewritten into the linear system of equations
(2) |
Row of the matrix contains the vectorization of the tensor . For notational convenience the indication that depends on is dropped from here on. The inverse problem consists of inferring the latent variables from the noisy measurements . Inverse problems of this kind appear in many different applications fields such as machine learning [6, 26, 27, 31, 32] control [2, 3, 22, 25] and signal processing [10, 13, 14, 15, 19, 20, 30]. In this article a Bayesian approach [1] is considered by assuming that and are random variables. The goal is then to infer the posterior distribution of conditioned on the measurements using Bayes’ theorem
The distribution is called the prior and encodes a belief on what is before the measurements are known. The main contribution of this article is the complete characterization of a prior that encodes the belief that the corresponding tensor is structured. A Gaussian distribution is assumed for the noise distribution with mean vector and covariance matrix and likewise for the prior . The linear forward model (2) combined with the Gaussian assumptions results in a Gaussian posterior with mean vector and covariance matrix
(3) | ||||
(4) |
The role of the prior can now be understood from (3) and (4). In the absence of data ( and ) the posterior equals the prior. In other words, the prior encodes a belief on what the solution of (2) should be before any data is known. A natural question to ask is then what kind of prior to use. In this article we consider a prior encoding the belief that the tensor has a structure that is completely determined by a matrix and vector such that
which we will refer to as -constrained tensors. The contributions of this article are threefold.
-
1.
We show how the definition of -constrained tensors is well-motivated since it encompasses a wide variety of relevant structured tensors. Examples are given for tensors with fixed entries, tensors with known sums of entries and symmetric, Hankel, Toeplitz, circulant, and triangular tensors.
-
2.
In Theorem 3.1 we completely characterize the mean vector and covariance matrix of the prior for -constrained tensors.
- 3.
These three contributions are important because the prior mean and covariance matrix are necessary to solve the Bayesian inverse problem via equations (3) and (4). Contrary to most solution strategies for linear least squares problems the matrix inverse of is explicitly required as it forms the posterior covariance. Also note that the dimension of the matrix to invert is , which limits the use of direct solvers to cases of small and . Hybrid projection methods [7, 8] are a viable alternative for cases where and are prohibitively large. Another alternative is to solve the corresponding dual problem, which is described in terms of the so-called kernel matrix . This approach is commonly used in least-squares support vector machines [27] and Gaussian Processes [32] and has a computational complexity of at least . When the tensor exhibits a low-rank structure then another way to obtain low computational complexity of solving (3) is by imposing a low-rank tensor structure to and [3, 21, 26]. Develo** dedicated solution strategies for equations (3) and (4), however, lies outside the scope of this article.
1.1 Notation
Tensors in this article are multi-dimensional arrays with real entries. We denote scalars by italic letters , vectors by boldface italic letters , matrices by boldface capitalized italic letters and higher-order tensors by boldface calligraphic italic letters . The vector denotes a canonical basis vector that has a single nonzero unit entry at position . The vector denotes a vector of ones and is the unit matrix. The number of indices required to determine an entry of a tensor is called the order of the tensor. A th order or -way tensor is hence denoted . An index always satisfies , where is called the dimension of that particular mode. Tensor entries are denoted . The merger of a set of separate indices is denoted by the single index
For a tensor we will always assume that the corresponding vector . The square root matrix of satisfies per definition
2 -constrained tensors
Before characterizing the prior we first demonstrate the breadth of -constrained tensors through eight examples. These examples demonstrate that the definition of -constrained tensors is well-motivated in that it captures a wide variety of structured tensors.
2.1 Tensors with fixed entries
A tensor with fixed entries can be described as where row of the matrix is a canonical basis vector that selects entry . The corresponding fixed numerical value of is then given by . Such fixed values are in practice usually zero, for example in triangular or banded matrices. Such structures can also be generalized to the tensor case.
Definition 2.1.
A tensor is lower (upper) triangular when holds for each consecutive index pair such that .
The characterization of a lower (upper) triangular tensor as an -constrained tensor is given in the following lemma.
Lemma 2.2.
Let be the matrix that has on each row a single unit entry for each particular occurrence of . Lower (upper) triangular tensors are then described by
and a vector of zeros.
Proof 2.3.
The known fixed values of lower (upper) triangular tensors are zero and hence is a vector of zeros. Each row of the matrix has a single unit entry to select a particular tensor entry for which some consecutive indices satisfy . A tensor with indices has consecutive index pairs and therefore is partitioned into block rows. Each block row is a Kronecker product of identity matrices with . The Kronecker product of identity matrices generates all possible index combinations of index values. The matrix factor in the Kronecker product adds the remaining 2 indices but only considers index pairs for which .
The matrix that describes tensors with known fixed entries in Lemma 2.2 is sparse and highly structured as demonstrated by the following example.
Example 2.4.
Consider a lower triangular tensor . The condition occurs in 3 cases . Defining the matrix with 3 nonzero entries
allows us to describe the desired matrix as
(5) |
This particular sparse structure is exploited in Section 3 when a basis for the nullspace of needs to be computed. Note that there are actually only 17 zero entries for which , which implies that the matrix from equation (5) counts the case twice. This, however, does not negatively affect the resulting prior.
2.2 Known sum of entries
Tensors for which the sum over all or only particular entries add up to a known value are also quite common in applications. Stochastic tensors are a particular example [11, 18]. Knowing a particular sum of entries can be described as follows.
Lemma 2.5.
Tensors for which the sum over the entries of an index set is a tensor are described by
(6) |
where each matrix in the Kronecker product is per definition
(7) |
The Kronecker product in (6) has as its leftmost factor and runs towards due to the opposite ordering of indices in the Kronecker product.
Proof 2.6.
With the definitions of the matrices the sum over the relevant entries of is written in terms of n-mode products [16, p. 460]
Using the vectorization operation this can be rewritten as
which finalizes the proof.
Example 2.7.
2.3 Eigenvector structure
Tensors whose vectorization is an eigenvector of a matrix with eigenvalue are described by the constraint and . An important structure in this article is obtained when is a permutation matrix. Indeed, then implies that the entries of remain invariant under the permutation . The distinction between and is made explicit through the following two definitions.
Definition 2.8.
Let be a permutation matrix. A -invariant tensor is defined by
Likewise, a skew--invariant tensor satisfies per definition
In this way any particular permutation matrix then defines a corresponding structured tensor. Next we discuss some prominent examples of -invariant tensor structures.
Definition 2.9.
(Cyclic Symmetric tensor [4]) The cyclic index shift permutation matrix of a -way tensor is the permutation matrix
where is the identity matrix and Matlab colon notation is used to denote submatrices. A -invariant tensor is then called a cyclic symmetric tensor.
Defining the vector it can be verified that
In other words, performs a cyclic shift of the indices to the right. When , then uniquely defines symmetric matrices since the cyclic index shift property implies that [29]. The case does not result in a fully symmetric tensor, as for example the required index permutation would not be enforced by . -invariance is therefore a weaker constraint than full symmetry.
Definition 2.10.
(Symmetric tensor) Let be the permutation matrix such that all entries of satisfy , where is any permutation of the indices. A -invariant tensor is per definition a symmetric tensor.
Definition 2.11.
(Centrosymmetric tensor [4]) A -invariant tensor , where is the column-reversed identity matrix, is called a centrosymmetric tensor.
A centrosymmetric tensor satisfies
Probably the most famous tensor that exhibits centrosymmetry is the matrix-matrix multiplication tensor [9].
Definition 2.12.
(Hankel Tensor) Let be the permutation matrix that cyclically permutes all indices with constant index sum . A -invariant tensor is called a Hankel tensor.
The minimal index sum is and maximal index sum is . This implies that consists of permutation cycles and .
Definition 2.13.
(Toeplitz Tensor) Let be the permutation matrix that cyclically permutes all indices , where . A -invariant tensor is called a Toeplitz tensor.
A special case of a Toeplitz tensor is a circulant tensor.
Definition 2.14.
(Circulant Tensor) Let be the permutation matrix that cyclically permutes all indices . If then . A -invariant tensor is called a circulant tensor.
3 Full characterization of the prior distribution
In this section the Gaussian prior for -constrained tensors is fully characterized. We also discuss how the square root covariance matrix can be computed without explicitly constructing the matrix through a block-row partitioning of .
Theorem 3.1.
The Gaussian distribution of -constrained tensors is described by a mean vector such that and by a covariance matrix such that the columns of span the right nullspace of .
Proof 3.2.
Let be a sample of the standard normal distribution . A sample of the desired Gaussian distribution is then
where is the matrix square root of the covariance matrix . Any sample being an -constrained tensor implies
(8) |
Equation (8) can only be true for all random samples if and only if
In other words, the mean of the prior also has to satisfy the linear constraint and the columns of span the right nullspace of .
3.1 Recursive nullspace computation
When the matrix is too large to construct explicitly then it is beneficial to compute a basis for its right nullspace recursively. This is possible when considering a partitioning into block-rows Algorithm 1 recursively computes a basis for this nullspace without ever explicitly constructing using Theorem 6.4.1 from [12, p. 329].
4 Explicit covariance matrix construction for permutation-invariant tensors
Computing the covariance matrix via Theorem 3.1 requires a basis for the nullspace of . For -invariant tensors it is possible to derive an explicit formula for as a function of the permutation matrix , which enables efficient sampling of the prior. Before we can state the main result in Theorem 4.6, we first need to discuss some facts about permutation matrices. An important concept tied to permutation matrices is its order. Any permutation can be written as a product of disjoint cycles. Each cycle has a particular length, also called the order of the cycle. In this article will denote the least common multiple of all orders of disjoint cycles of a given permutation.
Definition 4.1.
The order of a permutation matrix is defined as the smallest natural number such that .
Skew--invariant structures always have an even order .
Lemma 4.2.
A skew--invariant structure has an even order .
Proof 4.3.
Skew--invariance requires per definition that . From it follows that , which proves the desired.
Theorem 4.6 will express the desired covariance matrix as a function of powers of the permutation matrix . The following two lemmas relating powers of permutation matrices are easily proved.
Lemma 4.4.
Let be a permutation matrix of order , then for any :
(9) |
Lemma 4.5.
Let be a permutation matrix of order , then for any :
(10) |
Lemma 4.4 follows from . Lemma 4.5 follows from the orthogonality of permutation matrices and from the fact that powers of permutation matrices are still permutation matrices. We now have all ingredients to describe the main result that provides an analytic solution for the covariance matrix as an average over powers of the permutation matrix .
Theorem 4.6.
Let be a permutation matrix of order . The Gaussian distribution of -invariant tensors is described by a mean vector that is -invariant and covariance matrix
(11) |
The -invariance of the mean follows directly from Theorem 3.1. The proof of Theorem 4.6 therefore requires showing that in (11) is the desired covariance matrix. A matrix is a covariance matrix if it satisfies the following three sufficient conditions:
-
1.
has positive diagonal entries,
-
2.
is symmetric,
-
3.
is positive (semi-)definite.
Short proofs will now be given for each of these three covariance conditions.
Lemma 4.7.
The matrix has positive diagonal entries.
Proof 4.8.
is per definition a sum of permutation matrices, all diagonal entries of are therefore either zero or positive. Since we have that the diagonal entries are guaranteed to be positive.
Lemma 4.9.
The matrix is symmetric.
Proof 4.10.
The semi-positive definiteness of follows from its idempotency.
Lemma 4.11.
The matrix is idempotent, that is .
Proof 4.12.
The first consequence of being idempotent is that it is positive semi-definite.
Lemma 4.13.
The matrix is positive semi-definite.
Proof 4.14.
The two eigenvalue equations
are actually equal due to being idempotent. It therefore follows that , which implies that the eigenvalues are either 0 or 1. This proves the positive semi-definiteness of .
Having proved that is a covariance matrix it remains to show that samples drawn from are -invariant. From its symmetry and idempotency it follows that is its own matrix square root .
Lemma 4.15.
Every sample drawn from is -invariant.
Proof 4.16.
A sample from can be drawn by computing
where is drawn from a standard normal distribution . The -invariance of follows from
The terms that depend on cancel due to the -invariance of . Lemma 4.4 is used to go from line 4 to line 5.
Lemmas 4.7 up to 4.15 constitute the proof of Theorem 4.6. Another consequence from the idempotency of is that this matrix is its own pseudoinverse.
Lemma 4.17.
The pseudoinverse satisfies
Proof 4.18.
The pseudoinverse needs to satisfy the following four properties:
-
1.
,
-
2.
,
-
3.
,
-
4.
.
All these properties are satisfied when assuming and they follow from the idempotency of . For example, Properties 1 and 2 follow from
Properties 3 and 4 follow from the symmetry of .
The fact that is convenient for several reasons. First, no explicit computation is required in equations (3) and (4). Second, sampling can be done without a matrix square-root computation and without any matrix-vector multiplications. Using Theorem 4.6 the product can be implemented as a weighted sum of permuted versions of
All information of the permutation is contained in a vector of elements that specifies how each entry gets mapped to the next. Each term of the weighted sum is then computed by successive permutations of according to with computational complexity . The pseudocode for sampling the distribution is given in Algorithm 2.
A similar result as in Theorem 4.6 can be proven for -skew-invariant tensors.
Theorem 4.19.
For a permutation of even order , the Gaussian distribution of -skew-invariant tensors is described by a mean vector that is -skew-invariant and covariance matrix
(12) |
Proof 4.20.
The proof is very similar to that of Theorem 4.6. The diagonal entries being nonnegative can be derived from the following argument. The permutation matrix itself consists of cyclic permutations, with either even or odd order. If a cyclic permutation has an even order , then will have ones on the diagonal for elements of the cycle. This cycle will occur times in (12), always with a positive sign. If a cyclic permutation has odd order , then the diagonal entries of will come in equal amounts of negative and positive contributions, which results in a zero contribution to the diagonal. The total effect of all cyclic permutations then add up to either zero or positive diagonal entries. Symmetry is proven by using Corollary 4.5 and the fact that is even: an even order gets mapped to another even order and an odd order gets mapped to and odd order . Hence,
The idempotency of follows a similar proof as for the case of -invariance. Writing out in terms of and applying Corollary 4.4 results in
which proves that is idempotent.
Theorems 4.6 and 4.19 are practical when the order of the permutation matrix stays small compared to and . For Hankel structures this is unfortunately not the case. Consider for example a Hankel matrix. Its corresponding permutation matrix has permutation cycles ranging from length 1 up to 20 and is therefore the least common multiple of . Fortunately, it is possible to explicitly construct a sparse matrix of orthogonal columns such that .
5 Sparse square root covariance matrix construction for permutation-invariant tensors
Every permutation can be decomposed in terms of cyclic permutations. These cyclic permutations partition the set of all tensor entries into disjoint sets and allow for an alternative construction of , where the resulting matrix is sparse and consists of orthogonal columns.
Theorem 5.1.
Let be a permutation matrix that consists of permutation cycles and let denote the th cycle, where the number of tensor entries in is denoted . Then the range of the matrix such that
(13) |
spans the eigenspace of corresponding to an eigenvalue . In other words, . Also, .
Proof 5.2.
The equality follows from each column of containing nonzero values at tensor entries of a particular permutation cycle of . The orthogonality follows directly from the permutation cycles being disjoint and each column of being unit-norm due to the scaling with .
A basis for the skew--invariant eigenspace can be built in a similar way by retaining the cycles of even order and alternating the sign of the entries in each column.
6 Solving the inverse problem
In this section three different aspects when solving the inverse problem are discussed. First, we explain how the prior covariance matrices of -constrained tensors can be parameterized. Second, we briefly discuss a change of variables, originally proposed in [8], to exploit fast implementations of the matrix vector product . The third aspect relates to kernel methods, where -constrained tensor priors are used to define new structured tensor kernel functions.
6.1 Parameterizing the prior covariance matrix
The covariance matrix as described in Theorems 3.1, 4.6 and 5.1 encodes the structure of the -constrained tensor without having any free parameters to quantify the importance of the prior relative to the likelihood . Such free parameters are often called hyperparameters. Suppose for example that through Theorem 3.1 an orthogonal basis for the nullspace of is computed from its singular value decomposition (SVD)
A desired square-root covariance matrix is then , where is any invertible matrix. The nullity of can be interpreted as the total number of distinct elements in the -constrained tensor . The matrix can be interpreted as the square-root covariance matrix of those variables since
The matrix is then to be understood as “projecting” the covariance matrix of the underlying variables to the entries of the -constrained tensor. Parameterizing in terms of a single hyperparameter as implies that these variables are independent and have equal variance . Correlations between the variables can be modeled by for example parameterizing as a lower triangular matrix. The values of these hyperparameters can be learned from data through cross-validation, marginal likelihood optimization or a hierarchical Bayesian approach [27, 32].
6.2 Change of variables
Squaring the condition number when solving the normal equation of (3) can be avoided by solving its square-root version
instead. When constructing the square-root of the inverse prior covariance matrix is difficult then a change of variables can be used to avoid their construction [8]. By defining and the square-root linear system is transformed into
The desired posterior mean can then be recovered from . This formulation is especially beneficial when the matrix vector product can be implemented in a computationally efficient manner, for example using Algorithm 2.
6.3 Structured tensor kernel functions
When the tensor is much larger than the data size then the computational complexity of computing (3) is replaced with at least by solving the corresponding dual problem
An additional benefit is that no matrix inverse of is required so that Theorems 3.1, 4.6 and 5.1 can be applied directly. The matrix is called the kernel matrix and each entry is per definition the evaluation of a kernel function
Choosing as a covariance matrix of an -constrained tensor allows us to define new kernel functions. The kernel trick in machine learning refers to the fact where the kernel function can be evaluated without every explicitly computing the possibly large feature vectors . In the case of -invariant tensors one can exploit the particular structure of as described in Theorem 4.6 or use Algorithm 2 to achieve this goal.
Example 6.1.
(Centrosymmetric polynomial kernel) Let and . The polynomial kernel function is defined as
The expression is obtained from writing the identity matrix as a Kronecker product of smaller identity matrices and applying the mixed product property. The polynomial kernel function can therefore be interpreted as using a unit covariance matrix . We can now define the centrosymmetric polynomial kernel function by using the polynomial feature vectors and replacing with the covariance matrix of centrosymmetric tensors. From Theorem 4.6 it then follows that
Also here the explicit construction of is avoided by writing the matrix as a Kronecker product of the smaller permutation matrix with itself times and using the mixed-product property.
7 Applications
In this section we demonstrate the use of Theorems 3.1, 4.6, and 5.1 in three different applications. Practical implementations on how to sample various -constrained tensor priors are explained in Application 7.1. We consider lower triangular tensors, tensors for which the sum over the last index adds up to 1, symmetric tensors and Hankel tensors. Application 7.2 considers the problem of completing a Hankel matrix from noisy partial measurements by solving it as a Bayesian inverse problem. The estimate of the completed Hankel matrix when using a Hankel prior is compared to the estimate where no prior is used. In Application 7.3 learning a classifier for handwritten digits is solved as a Bayesian inverse problem. The classifier obtained with the commonly used Tikhonov prior is compared to several -constrained tensor priors.
All applications have been implemented as reactive Pluto [28] notebooks in Julia [5] and are publicly available at https://github.com/TUDelft-DeTAIL/AbTensors. The notebook files can be freely downloaded and run on your local machine in Julia. An alternative way to use these notebooks that does not require the installation of Julia is to run them in the cloud via Binder [23]. This can be done by clicking on each of the links on the main Github page. Please note that it can take over 10 minutes for Binder to download and compile all required packages.
As discussed in section 6.1 we parameterized the prior covariance matrix with a single hyperparameter in both Applications 7.2 and 7.3.
7.1 Sampling structured tensor priors
In this first application we demonstrate how Theorems 3.1, 4.6 and 5.1 are used to sample the priors of different -constrained tensors.
Example 7.1.
(Lower triangular tensors) A first example of an -constrained tensor considered here are lower triangular tensors. From Definition 2.1 we know that triangular tensors are described by
and zero vector . The square root of the covariance matrix is built up by applying Algorithm 1, which considers only 1 block row of at a time. The whole matrix is therefore never explicitly made. In the notebook it is possible to sample lower triangular tensors with orders ranging from 2 up to 5 and dimensions 2 up to 6 by moving the corresponding sliders.
Example 7.2.
(Tensors with known sum of entries) In this example we sample tensors for which the sum over the last index always adds up to a value of 1:
From Lemma 2.5 we know that in this case . It is straightforward to verify that a basis for the right nullspace of is
Sampling the prior can now be done without every constructing a basis for the nullspace explicitly since
It is therefore sufficient to sample from a standard normal distribution and do the operations on the partitions of as described above to generate the desired sample. In the notebook one can change the order of the sampled tensor from 2 up to 5 and dimension from 5 up to 10 by using the corresponding sliders.
Example 7.3.
(Symmetric tensors) Symmetric tensors are tensors for which entries are invariant under any index permutation. The permutation matrix in the symmetric case consists of cyclic permutations where each each cycle contains the entry and all entries with corresponding index permutations . For example, in the case and the permutation matrix consists of cyclic permutations
The order of in this case is since . According to Theorem 4.6 we then have that the square root of the covariance matrix is When , the order of the corresponding permutation matrix is and hence Sampling from these priors is done via Algorithm 2 where a standard normal sample is generated and permuted times. The notebook allows you to sample symmetric tensors of orders 2 and 3 and dimensions 3 up to 10.
Example 7.4.
(Hankel tensors) Hankel tensors are tensors for which entries with a constant index sum have the same numerical value. The order of the corresponding permutation matrix grows very quickly. For example, when and the order is the least common multiple of . Theorem 5.1, however, allows us to construct a matrix , where is the number of permutation cycles. For Hankel tensors we have that . The notebook allows you to sample Hankel tensors of order 2 up to 4 and dimensions 3 up to 10.
7.2 Completion of a Hankel matrix from noisy measurements
Hankel matrices are very common in signal processing and control theory. In this application a Bayesian approach will be used to complete a Hankel matrix based on noisy incomplete measurements. For this we use the following forward model , where is the vectorization of the true underlying Hankel matrix. The matrix selects random entries of with equal probability. Each row of contains a single nonzero unit-valued entry at a random location. The number of measurements can be changed through a slider in the notebook. The vector is a vector of zero-mean Gaussian noise. Given and , a Bayesian estimate of the underlying Hankel matrix can be obtained from (3) as the posterior mean . Another commonly used estimate is the maximum likelihood estimate, which is the that maximizes the likelihood . We compare two posterior estimates with the maximum likelihood estimate under two different assumptions on the noise covariance. We fix the sampling rate at and choose . The prior covariance matrix is set to , where is covariance matrix of the Hankel prior obtained via Theorem 5.1.
Example 7.5.
(White noise) First we consider white noise, which implies that . The singular values of the prior precision , posterior precision , and likelihood precision are shown in Figure 1(a). They provide us with insight on how the prior, posterior and likelihood relate to each other. The likelihood only has 50 measurements and gives all of them equal weight. The prior on the other hand only considers 19 nonzero values as a Hankel matrix has 19 distinct entries. Given the relative high noise variance compared to the prior, the posterior “follows” the prior for the first 19 singular values.
![Refer to caption](extracted/5690821/svals_precision.png)
![Refer to caption](extracted/5690821/svals_precision2.png)
A prior mean is obtained by averaging over the nonzero antidiagonals of the measurements and using those averages to construct a Hankel matrix. We now compute three different estimates and compare them to the ground truth. The first estimate is obtained from (3) with a backslash solve. A second estimate is computed by truncating the SVD of to rank 19 in equation (3). The third estimate is the maximum likelihood estimate. For each of these estimates we show the relative error in Table 1.
backslash | truncated SVD | max-likelihood | |
---|---|---|---|
(white noise) | 0.137 | ||
(Hankel noise) | 0.137 | ||
Adding the Hankel prior shows a clear improvement on the completed Hankel matrix. The relative error is 4 times smaller from the inclusion of the prior. Since the first 19 singular values of the posterior are equal to the singular values of the prior one could expect the estimated posterior mean obtained from truncating the SVD to the first 19 singular values to be Hankel. In order to confirm this we also compute the relative Hankel error for the three estimates in Table 1, where is the Hankel permutation matrix. Restricting the posterior mean to lie in a subspace spanned by the first 19 right singular vectors indeed enforces a Hankel structure.
Example 7.6.
(Hankel distributed noise) To investigate the effect of the noise covariance on the estimates we now consider noise that also has a Hankel structure. In other words, the covariance matrix for is , whereas the prior covariance is . With the noise being Hankel, this means that the perturbation of will have a Hankel structure as well. This can be modeled via the forward model , where now . Figure 1(b) shows the singular values of the square-root precision matrices. The number of nonzero singular values of the likelihood now consists of 2 plateaus. Again, the posterior follows the prior for the first 19 singular values. Since now measurements of entries along the same antidiagonal are identical, less information is to be extracted from the measurements. This explains the first drop of Figure 1(b) at the 19th singular value for both the likelihood and posterior. Less information also means that we can expect our estimate to be worse compared to the white noise case. The relative errors are now indeed higher, as seen in Table 1. Note however that the estimate obtained by truncating the SVD remains the same.
7.3 Bayesian learning of MNIST classifier
In this application we learn a classifier for images of handwritten digits. The classifier is trained on the MNIST data [17], which consists of pictures for training and pictures for test. Each picture is of size . We pick random samples from the training set and convert each picture into Random Fourier Features [24]. The 625 frequency vectors are sampled from a zero-mean Gaussian with variance . We use a one-vs-all strategy by learning classifiers at once. Each classifier is trained to distinguish between particular class versus all others. The forward model for our classifiers is then . Each column of contains the model parameters of specific classifier. In order to predict the class of a sample we compute and apply the softmax function
The prediction is then the class with maximal . The 10 classifiers are trained on a training data set of pictures and corresponding class labels . Our estimate for is the mean of the posterior . The residual is most commonly assumed to be zero-mean white Gaussian noise . Likewise, the prior is usually assumed to be a zero-mean normal distribution with a uniform scaling covariance matrix . Such a prior is also called Tikhonov regularization. We compare the performance of the Tikhonov prior to other zero-mean -constrained tensor priors (symmetric, Hankel en circulant), constructed using either Theorem 4.5 or Theorem 5.1. The noise variance is set to a fixed value of 1.
![Refer to caption](extracted/5690821/svals_precision3.png)
![Refer to caption](extracted/5690821/svals_precision4.png)
The difference between these different priors can be investigated by looking at the singular value profiles of the square-root precision matrices of the corresponding posteriors. These are shown in Figure 2(a) for and in Figure 2(b) for . Being confident in the prior () has a strong effect on the corresponding posterior, which explains the large differences in singular value profiles. The corresponding classifiers can then be expected to also differ a lot on unseen test data. Indeed, applying the obtained classifiers on test images results in a relative number of correctly classified images shown in Table 2.
Tikhonov | symmetric | Hankel | circulant | |
---|---|---|---|---|
0.917 | ||||
0.920 |
All -constrained priors outperform the conventional Tikhonov prior, with Hankel and circulant tensors having the best performance. By increasing the prior covariance to all singular value profiles become very similar. The corresponding classifiers have similar performance as seen in Table 2. No significant classification improvement is observed for the Hankel and circulant priors.
8 Conclusions
A whole new class of Bayesian priors has been worked-out which could be potentially applied to a variety of different inverse problems. The main focus of this article was mostly on the theoretical foundation and where possible we discussed practical implementations without going into much detail. Although the curse of dimensionality when considering tensors of large order and dimension can be completely resolved via the corresponding dual problem, the computational complexity can still become prohibitively large with increasing sample size. To tackle this complexity the possibility to represent the prior mean vector and covariance matrix of these priors as exact low-rank tensor decompositions could be investigated.
Acknowledgments
Many thanks to Frederiek Wesel for valuable discussions and feedback.
References
- [1] J. M. Bardsley, Computational Uncertainty Quantification for Inverse Problems: An Introduction to Singular Integrals, SIAM, 2018.
- [2] K. Batselier, Low-rank tensor decompositions for nonlinear system identification: A tutorial with examples, IEEE Control Systems Magazine, 42 (2022), pp. 54–74.
- [3] K. Batselier, Z. Chen, and N. Wong, Tensor Network alternating linear scheme for MIMO Volterra system identification, Automatica, 84 (2017), pp. 26–35.
- [4] K. Batselier and N. Wong, A constructive arbitrary-degree Kronecker product decomposition of tensors, Numerical Linear Algebra with Applications, 24 (2017), p. e2097.
- [5] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, Julia: A fresh approach to numerical computing, SIAM review, 59 (2017), pp. 65–98.
- [6] M. Blondel, M. Ishihata, A. Fu**o, and N. Ueda, Polynomial networks and factorization machines: New insights and efficient training algorithms, in International Conference on Machine Learning, PMLR, 2016, pp. 850–858.
- [7] J. Chung and S. Gazzola, Computational Methods for Large-Scale Inverse Problems: A Survey on Hybrid Projection Methods, SIAM Review, 66 (2024), pp. 205–284.
- [8] J. Chung and A. K. Saibaba, Generalized Hybrid Iterative Methods for Large-Scale Bayesian Inverse Problems, SIAM Journal on Scientific Computing, 39 (2017), pp. S24–S46.
- [9] H. F. de Groote, On varieties of optimal algorithms for the computation of bilinear map**s i. the isotropy group of a bilinear map**, Theoretical Computer Science, 7 (1978), pp. 1–24.
- [10] C. L. Epstein, Introduction to the mathematics of medical imaging, SIAM, 2007.
- [11] D. F. Gleich, L.-H. Lim, and Y. Yu, Multilinear pagerank, SIAM Journal on Matrix Analysis and Applications, 36 (2015), pp. 1507–1541.
- [12] G. H. Golub and C. F. Van Loan, Matrix computations, JHU press, 2013.
- [13] P. C. Hansen, J. G. Nagy, and D. P. O’leary, Deblurring images: matrices, spectra, and filtering, SIAM, 2006.
- [14] N. Kargas and N. D. Sidiropoulos, Supervised learning and canonical decomposition of multivariate functions, IEEE Transactions on Signal Processing, 69 (2021), pp. 1097–1107.
- [15] C.-Y. Ko, K. Batselier, L. Daniel, W. Yu, and N. Wong, Fast and accurate tensor completion with total variation regularized tensor trains, IEEE Transactions on Image Processing, 29 (2020), pp. 6918–6931.
- [16] T. G. Kolda and B. W. Bader, Tensor decompositions and applications, SIAM review, 51 (2009), pp. 455–500.
- [17] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE, 86 (1998), pp. 2278–2324.
- [18] W. Li and M. K. Ng, On the limiting probability distribution of a transition probability tensor, Linear and Multilinear Algebra, 62 (2014), pp. 362–385.
- [19] J. Liu, P. Musialski, P. Wonka, and J. Ye, Tensor completion for estimating missing values in visual data, IEEE transactions on pattern analysis and machine intelligence, 35 (2012), pp. 208–220.
- [20] N. Mastronardi, P. Lemmerling, and S. Van Huffel, Fast structured total least squares algorithm for solving the basic deconvolution problem, SIAM Journal on Matrix Analysis and Applications, 22 (2000), pp. 533–553.
- [21] A. Novikov, I. Oseledets, and M. Trofimov, Exponential machines, Bulletin of the Polish Academy of Sciences: Technical Sciences; 2018; 66; No 6 (Special Section on Deep Learning: Theory and Practice); 789-797, (2018).
- [22] G. Pillonetto and G. De Nicolao, A new kernel-based approach for linear system identification, Automatica, 46 (2010), pp. 81–93.
- [23] Project Jupyter, Matthias Bussonnier, Jessica Forde, Jeremy Freeman, Brian Granger, Tim Head, Chris Holdgraf, Kyle Kelley, Gladys Nalvarte, Andrew Osheroff, M. Pacer, Yuvi Panda, Fernando Perez, Benjamin Ragan Kelley, and Carol Willing, Binder 2.0 - Reproducible, interactive, sharable environments for science at scale, in Proceedings of the 17th Python in Science Conference, Fatih Akici, David Lippa, Dillon Niederhut, and M. Pacer, eds., 2018, pp. 113 – 120.
- [24] A. Rahimi and B. Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 20 (2007).
- [25] S. Särkkä and L. Svensson, Bayesian filtering and smoothing, vol. 17, Cambridge university press, 2023.
- [26] E. Stoudenmire and D. J. Schwab, Supervised learning with tensor networks, Advances in neural information processing systems, 29 (2016).
- [27] J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Least Squares Support Vector Machines, World Scientific, Singapore, 2002.
- [28] F. van der Plas and M. Bocheński, fonsp/pluto.jl: v0.19.42, May 2024.
- [29] C. F. Van Loan, The ubiquitous Kronecker product, Journal of computational and applied mathematics, 123 (2000), pp. 85–100.
- [30] S. Wahls, V. Koivunen, H. V. Poor, and M. Verhaegen, Learning multidimensional Fourier series with tensor trains, in 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP), IEEE, 2014, pp. 394–398.
- [31] F. Wesel and K. Batselier, Large-Scale Learning with Fourier Features and Tensor Decompositions, Advances in Neural Information Processing Systems, 34 (2021), pp. 17543–17554.
- [32] C. K. Williams and C. E. Rasmussen, Gaussian processes for machine learning, vol. 2, MIT press Cambridge, MA, 2006.