Convolutional Bayesian Filtering

Wenhan Cao [email protected] Shiqi Liu [email protected] Chang Liu [email protected] Zeyu He [email protected] Stephen S.-T. Yau [email protected] Shengbo Eben Li [email protected] School of Vehicle and Mobility, Tsinghua University, Bei**g 100084, China Department of Advanced Manufacturing and Robotics, Peking University, Bei**g 100871, China Department of Mathematical Sciences, Tsinghua University, Bei**g 100084, China Bei**g Institute of Mathematical Sciences and Applications (BIMSA), Bei**g 101408, China

Abstract

Bayesian filtering serves as the mainstream framework of state estimation in dynamic systems. Its standard version utilizes total probability rule and Bayes’ law alternatively, where how to define and compute conditional probability is critical to state distribution inference. Previously, the conditional probability is assumed to be exactly known, which represents a measure of the occurrence probability of one event, given the second event. In this paper, we find that by adding an additional event that stipulates an inequality condition, we can transform the conditional probability into a special integration that is analogous to convolution. Based on this transformation, we show that both transition probability and output probability can be generalized to convolutional forms, resulting in a more general filtering framework that we call convolutional Bayesian filtering. This new framework encompasses standard Bayesian filtering as a special case when the distance metric of the inequality condition is selected as Dirac delta function. It also allows for a more nuanced consideration of model mismatch by choosing different types of inequality conditions. For instance, when the distance metric is defined in a distributional sense, the transition probability and output probability can be approximated by simply rescaling them into fractional powers. Under this framework, a robust version of Kalman filter can be constructed by only altering the noise covariance matrix, while maintaining the conjugate nature of Gaussian distributions. Finally, we exemplify the effectiveness of our approach by resha** classic filtering algorithms into convolutional versions, including Kalman filter, extended Kalman filter, unscented Kalman filter and particle filter.

keywords:

Bayesian filtering; conditional probability; convolution; model mismatch

^†^†thanks: This work is supported by Tsinghua University Education Foundation fund (042202008). This paper was not presented at any meeting. Corresponding author: Shengbo Eben Li.

, , , , , ,

1 Introduction

Accurately estimating the state value of dynamic systems is a crucial task in science and engineering, such as robotics, power systems, aerospace engineering, and manufacturing. Since the 1960s, Bayesian filtering has become a principled framework for optimal state estimation. The essence of this framework is to find a balance between uncertain system model and noisy state measurement. Its associated algorithm iteratively updates the probability density function of system state using the prior from the last step and the likelihood of the new observation. Afterward, typical estimation criteria like minimum mean-square error and maximum a posteriori are utilized to acquire the optimal point estimate.

In mathematics, Bayesian filtering relies on two conditional probabilities: transition probability and output probability. The transition probability describes how the system state evolves over time, and the output probability depicts the relationship between noisy measurement and ground truth state. To incorporate the information of those two probabilities, each iteration of Bayes filter is composed of two steps [11, 25]: prediction and update. The prediction step employs total probability rule to integrate the product of transition probability and the state distribution of previous time to obtain the prior distribution. The update step employs Bayes’ law to calculate the posterior distribution by adding information of the current measurement, where the output probability is used as a likelihood term. Since being proposed, Bayesian filtering has become the foundation of optimal filtering algorithms, including the well-known Kalman filter family, particle filter, and variational Bayesian filter.

The origin of optimal filtering theory can be traced back to the early 1940s, marked by the groundbreaking contribution of Norbert Wiener [37] and Kolmogorov [21]. This field reached a significant milestone in 1960 with the invention of discrete-time Kalman filter [18], followed by its continuous-time version published one year later [19]. Unlike Wiener’s work, which deals with stationary processes in the frequency domain, Kalman filter addresses dynamic processes in the time domain. The Kalman filter is a direct consequence of applying Bayesian filtering to linear Gaussian systems. In its prediction step, the conditional probability, i.e., transition probability of system model, is Gaussian. Due to the closure property of Gaussian distributions in linear transformation, the resulting expectation after applying total probability rule is also Gaussian, which is referred to as the prior distribution. In the update step, the conditional probability, i.e., output probability of measurement model, is naturally Gaussian. Owing to the conjugacy property of Gaussian distributions, the resulting posterior of Bayes’ law keeps Gaussian. Since both the prior and posterior are proven to be Gaussian, Kalman filter can be solved analytically by solely computing the mean and variance of Gaussian distribution using closed-form formulas.

When facing nonlinear systems, one big issue of Bayesian filtering is that the closure and conjugacy properties no longer hold. As a result, the calculation of total probability rule and Bayes’ law in almost all nonlinear systems has no analytical solution. So far, several approximation methods have been proposed to replace the accurate calculation of the two rules. A notable early advancement in this area was extended Kalman filter (EKF), pioneered by NASA for spacecraft navigation [33]. This extended filter employs the Taylor series expansion to linearize nonlinear dynamics around the current state. Its state estimation can achieve the so-called first-order accuracy, i.e., EKF is perfect if the dynamics is linear with respect to the state. In contrast, unscented Kalman filter (UKF) employs the unscented transform, a deterministic sampling technique, to achieve a third-order accuracy in approximating nonlinear dynamics for symmetric noise distributions [17]. This technique acquires the transformed distribution by generating a set of sigma points around the mean state estimate and then propagating them through the known nonlinear function, offering the advantage of preserving second-moment information compared to EKF. Obviously, both EKF and UKF implicitly calculate the solution of total probability rule and Bayes’ law using Gaussian distributions. Due to the adoption of approximation techniques, neither of them can offer formal guarantees on the estimation accuracy in highly nonlinear systems.

Instead of approximate prior and posterior as Gaussian distributions, particle filter (PF) represents them with a group of particles by the Monte Carlo method [26]. In the prediction step, particles are propagated according to the transition probability of system model to predict the distribution of next state. The resulting particles constitute a discrete approximation of the prior. In the update step, each particle receives a weight related to the output probability of observed data. All the particles are resampled according to their weights, which builds a discrete approximation of the posterior. It has been proven that these kinds of discrete distributions can converge to real distributions as the number of particles becomes sufficiently large. Nevertheless, PF often requires substantial computational resources due to the use of Monte carlo sampling, which limits its application in many real-time scenarios.

The variational Bayesian filter addresses the intensive computation associated with particles by adopting variational inference as an alternative approximation technique [31, 22]. The prior of its prediction step is assumed to be in a Gaussian form. This assumption is achieved by computing the expectation of a conditional probability using the closure of Gaussian distribution, which is identical to Kalman filter. The update step avoids calculating the computationally intensive integral in the Bayes’ law. Instead, it seeks to numerically minimize the Kullback-Leibler divergence between the proposal distribution and the real posterior. The proposal distribution is often chosen as a parameterized function to obtain an approximate solution of the minimization problem. The real posterior is represented as the product of the prior distribution and the output probability. In general, variational Bayesian filter is computationally beneficial in high-dimensional estimation problems and it has been widely adopted in adaptive filtering applications.

All the state estimation algorithms discussed above adhere strictly to the Bayesian filtering framework. In this framework, the mathematical form of conditional probabilities, including transition probability and output probability, plays an important rule in computing total probability rule and Bayes’ law. The standard definition of conditional probability is a measure of the occurrence probability of one event, given the second event. Moreover, its distribution in the whole space is often assumed to be exactly known in Bayes filter design. In this paper, we find that by conditioning on an additional event, which stipulates a distance metric between two observed variables within a specified threshold, one can transform the conditional probability to a special integral form that is similar to convolution operation. This definition relaxes the necessity of information completeness, which allows us to design a more general filter. We define this new probability as convolutional conditional probability.

Based on this definition, the transition probability can be extended to a convolutional form by conditioning on the event that the real state and its virtual state satisfy an inequality condition. The same extension can be applied to the output probability. Collectively, these two extensions forge a generalized filtering framework, which we refer to as convolutional Bayesian filtering. This new framework encompasses standard Bayesian filtering as a special case when the distance metric is set as the Dirac delta function. One of its natural benefit is the capability to explicitly handle mismatches between mathematical model and the real system by tailoring the distance metric properly.

Under this new framework, we can reformulate nearly all Bayesian filtering algorithms into a more generalized type. Particularly, convolutional Bayes filter possesses analytical forms of convolution operation in systems with Gaussian noises, which allows to design a robust Kalman filter family. For non-Gaussian systems, convolution operation usually has no analytical forms but can be efficiently approximated by a newly proposed exponential density rescaling technique. This technique enables to rescale transition probability and output probability into their fractional powers when the distance metric is defined in a distributional sense. We further establish the theoretical connection between this approximation technique and information bottleneck theory. It is proven that the fractional power from density rescaling is related to Lagrange multiplier of an optimization problem whose objective is to modulate the balance between preserving information about measurement model and squeezing representation of measurement data.

The remainder of this paper is organized as follows: Section 2 introduces the definition of convolutional conditional probability. Section 3 discusses the framework of convolutional Bayesian filtering. Section 4 introduces an approximation technique for non-Gaussian systems. Section 5 shows the simulation results. Section 6 concludes this paper.

2 Convolutional Conditional Probability

As discussed before, Bayesian filtering is built upon two pillars: total probability rule (1a) and Bayes’ law (1b). Both of them rely on how to handle the conditional probability $p(y|x)$ , which is a measure of the occurrence probability of the event $\left\{\mathbf{y}=y\right\}$ , given the event $\left\{\mathbf{x}=x\right\}$ . This can also be interpreted as the ratio of the probability of both events happening to the probability of the original event. According to this interpretation, the total probability rule is

	$p(y)=\int{p(y\|x)}p(x)\,\mathrm{d}x,$	(1a)
and the Bayes’ law is
	$p(x\|y)=\frac{p(x){p(y\|x)}}{\int p(x){p(y\|x)}\,\mathrm{d}x}.$	(1b)

Note that we use boldface to denote a random variable, such as $\mathbf{x}$ , and normal font to denote the realization of this variable, such as $x$ . Previously, the explicit form of the conditional probability $p(y|x)$ is assumed known directly in Bayes filter design. One may be interested in what will happen if the conditional probability $p(y|x)$ is unknown. This question helps us to conceive a new definition of conditional probability, i.e., convolutional conditional probability.

Given three random variables $\mathbf{x}$ , $\mathbf{y}$ , and $\mathbf{z}$ , where the information of $p(z|x)$ is known and certain constraints exist between $\mathbf{y}$ and $\mathbf{z}$ , our objective is to compute $p(y|x)$ by leveraging these constraints. In the case that $\mathbf{y}$ and $\mathbf{z}$ are equal, i.e., $\mathbf{y}=\mathbf{z}$ , we have $p(y|x)=p(z|x)$ , as shown in Fig. 1(a). Conversely, if $\mathbf{y}$ is not equal to $\mathbf{z}$ and their difference is bounded by an inequality function, we can define a convolutional version of conditional probability, as shown in Fig. 1(b).

Definition 1 (Convolutional Conditional Probability).

Given $p(z|x)$ , if $\mathbf{y}$ and $\mathbf{z}$ are conditionally independent given $\mathbf{x}$ , then convolutional conditional probability is defined as

p_{c}(y|x):=p(y|x,d(\mathbf{y},\mathbf{z})\leq\mathbf{r}).

Here, $d(\mathbf{y},\mathbf{z})\leq\mathbf{r}$ is the inequality condition, where $\mathbf{r}$ is a threshold random variable with known cumulative distribution function $F(r)$ , and $d(\cdot,\cdot)$ is the distance metric for two random variables. The calculation of convolutional conditional probability $p_{c}(y|x)$ is summarized in the subsequent proposition:

Proposition 1.

The convolutional conditional probability satisfies

	$\displaystyle p_{c}(y\|x)$	$\displaystyle=\frac{\int_{z}\left(1-F(d(y,z))\right)p(z\|x)\,\mathrm{d}z}{\int_% {y}\int_{z}\left(1-F(d(y,z))\right)p(z\|x)\,\mathrm{d}z\,\mathrm{d}y}$		(2)
		$\displaystyle\propto\int_{z}\left(1-F(d(y,z))\right)p(z\|x)\,\mathrm{d}z.$		(2)

Proof.

According to Bayes’ law, we have

\displaystyle p_{c}(y|x)=\frac{P(d(\mathbf{y},\mathbf{z})\leq\mathbf{r}|x,y)p_% {\text{pri}}(y|x)}{\int_{y}P(d(\mathbf{y},\mathbf{z})\leq\mathbf{r}|x,y)p_{% \text{pri}}(y|x)\,\mathrm{d}y}.

(3)

In (3), $p_{\text{pri}}(y|x)$ is the prior of $p(y|x)$ . It is chosen as an uninformative probability since we have no knowledge of $p(y|x)$ [13]:

p_{\text{pri}}(y|x)=C,\quad C>0.

(4)

The likelihood term $P(d(\mathbf{y},\mathbf{z})\leq\mathbf{r}|x,y)$ in (3) is simplified as

	$\displaystyle P(d(\mathbf{y},\mathbf{z})\leq\mathbf{r}\|x,y)$	(5)
$\displaystyle=$	$\displaystyle P(d(y,\mathbf{z})\leq\mathbf{r}\|x,y)$
$\displaystyle=$	$\displaystyle\int_{z}\left(1-F(d(y,z))\right)p(z\|x,y)\,\mathrm{d}z$
$\displaystyle=$	$\displaystyle\int_{z}\left(1-F(d(y,z))\right)p(z\|x)\,\mathrm{d}z.$

Substituting (4) and (5) into (3), we have (2). Note that the final equation of (5) holds due to the conditional independence of $\mathbf{y}$ and $\mathbf{z}$ , given $\mathbf{x}$ . ∎

Remark 1.

The calculation of convolutional conditional probability $p_{c}(y|x)$ resembles convolution operation. In the convolution operation, a kernel is applied over an input space to generate a modified output. Here, $1-F(d(y,z))$ serves as the kernel function, which is a weighting coefficient of $p(z|x)$ based on the distance between $\mathbf{y}$ and $\mathbf{z}$ .

We want to emphasize that $p(y|x)$ , as used in (1a) and (1b), can become any form of conditional probabilities. Actually, $p_{c}(y|x)$ represents a specific form of conditional probability, which measures the probability of the event $\{\mathbf{y}=y\}$ , conditioned on two events $\{\mathbf{x}=x\}$ and $\{d(\mathbf{y},\mathbf{z})\leq\mathbf{r}\}$ . Compared to standard definition, convolutonal conditional probability has the third event which describes the upper bound between two random variables. Under this new definition, one can substitute $p(y|x)$ with $p_{c}(y|x)$ in (1a) and (1b) to construct two new rules:


$\displaystyle p(y)$	$\displaystyle=\int_{x}p_{c}(y\|x)p(x)\,\mathrm{d}x,$	(6a)
$\displaystyle p(x\|y)$	$\displaystyle=\frac{p(x)p_{c}(y\|x)}{\int_{x}p(x)p_{c}(y\|x)\,\mathrm{d}x}.$	(6b)

Here, (6a) can be regarded as a generalized total probability rule and (6b) can be regarded as a generalized Bayes’ law. Note that $p(y)$ and $p(x|y)$ in (6) are distinct from those notations in (1) because they implicitly condition on the third event $\{d(\mathbf{y},\mathbf{z})\leq\mathbf{r}\}$ , while this event in (1) is reduced to $\{\mathbf{y}=\mathbf{z}\}$ . That is to say, an equivalence event is implicitly conditioned in standard definition (1). It can be proven that total probability rule (1a) and Bayes’ law (1b) are the special cases of (6a) and (6b), as described in Proposition 2.

Proposition 2 (Limiting Property).

Choose $\mathbf{r}$ as the exponential distribution, i.e., $\mathbf{r}\sim\mathrm{Exp}(\lambda)$ and $d(y,z)=\|y-z\|^{2}/\sigma^{2}$ , where $\lambda>0$ is exponential parameter, and $\sigma>0$ is scale parameter. As $\sigma\to 0$ , $p_{c}(y|x)$ reduces to $p(y|x)$ , (6a) reduces to total probability rule (1a), and (6b) reduces to Bayes’ law (1b).

Proof.

To simplify the derivation, let us define

\displaystyle\eta:=-\frac{\lambda\|y-z\|^{2}}{\sigma^{2}},\,\gamma:=\frac{% \lambda^{\frac{n}{2}}}{\left(\pi\sigma^{2}\right)^{\frac{n}{2}}},

where $n$ is the dimension of $y$ or $z$ . From Proposition 1, we have

	$\displaystyle p_{c}(y\|x)$	$\displaystyle=\frac{\int_{z}e^{\eta}p(z\|x)\,\mathrm{d}z}{\int_{y}\int_{z}e^{% \eta}p(z\|x)\,\mathrm{d}z\,\mathrm{d}y}$
		$\displaystyle=\frac{\gamma\int_{z}e^{\eta}p(z\|x)\,\mathrm{d}z}{\int_{y}\gamma% \int_{z}e^{\eta}p(z\|x)\,\mathrm{d}z\,\mathrm{d}y}.$

As $\sigma\to 0$ , $\gamma e^{\eta}$ becomes the Dirac function of $z$ centered at $y$ . Thus, we have

	$\displaystyle\lim_{\sigma\to 0}p_{c}(y\|x)=$	$\displaystyle\lim_{\sigma\to 0}\frac{\int_{z}p(z\|x)\delta(z-y)\,\mathrm{d}z}{% \int_{y}\int_{z}p(z\|x)\delta(z-y)\,\mathrm{d}z\,\mathrm{d}y}$		(7)
	$\displaystyle=$	$\displaystyle p(y\|x).$		(7)

As a result, $p_{c}(y|x)$ converges to $p(y|x)$ as $\sigma$ approaches 0, and accordingly, (6a) and (6b) converge to (1a) and (1b), respectively. ∎

Proposition 2 elucidates that the kernel function converges to a Dirac delta function at $y$ as the scale parameter tends to zero. As a result, the convolutional conditional probability reduces to its standard version. For finite values of scale parameter, this new definition considers a controllable amount of uncertainty governed by the scale parameter, offering an extension to the previous one.

3 Convolutional Bayesian Filtering

In this section, we demonstrate how model mismatches in the filtering problem can be explicitly addressed using convolutional conditional probability. This is achieved by conditioning on an additional event representing the error bound between the system model and the real system. Further, by substituting the total probability rule and Bayes’ law with (6a) and (6b) in the Bayes filter, we can establish a more generalized filtering framework called convolutional Bayesian filtering.

3.1 Uncertain Hidden Markov Model with Model Mismatch

The essence of Bayesian filtering is to find a balance between the stoasticities of state transition and state observation. The stochasticity of the former comes from inherent randomness in the environment dynamics while that of the latter comes from sensor noises. These two stochastic processes are typically represented by hidden Markov model (HMM):

$\displaystyle\mathbf{x}_{0}$	$\displaystyle\sim p_{0}(x_{0}),$	(8)
$\displaystyle\mathbf{x}_{t}$	$\displaystyle\sim p(x_{t}\|x_{t-1}),$
$\displaystyle\mathbf{y}_{t}$	$\displaystyle\sim p(y_{t}\|x_{t}).$

Here, $\mathbf{x}_{t}\in\mathbb{R}^{n}$ is the system state and $\mathbf{y}_{t}\in\mathbb{R}^{m}$ is the corresponding measurement. Besides, $p_{0}$ denotes the probability distribution of the initial state $\mathbf{x}_{0}$ , $p(x_{t}|x_{t-1})$ represents the transition probability, and $p(y_{t}|x_{t})$ is the output probability.

The standard HMM implicitly assumes that the real system is perfectly modelled, i.e., (8) is an exact description of system dynamics and measurement sensors. However, perfect information about transition or output probabilities is often unattainable due to parametric variation, unmodeled dynamics or external disturbances. In other words, there must be some model errors in engineering practice. This error can lead to significant accuracy degradation in state estimation if not properly considered. To build an HMM with model mismatch, we have to distinguish two kinds of states: the real state and the virtual state. The former is an accurate yet unattainable description of the system. The latter is an artificial construct generated by nominal models and does not exist in the physical world. We use $\bar{\mathbf{x}}_{t}$ to represent the virtual state and ${\mathbf{x}}_{t}$ to represent the real state. Likewise, the real measurement is denoted as $\mathbf{y}_{t}$ , and its virtual counterpart, which is generated by the nominal output probability, is denoted as $\bar{\mathbf{y}}_{t}$ . The HMM with model mismatch is depicted in Fig. 2 and outlined in (9) as follows:


	$\displaystyle\mathbf{x}_{0}\sim p_{0}(x_{0}),$
	$\displaystyle\bar{\mathbf{x}}_{t}\sim p(\bar{x}_{t}\|x_{t-1}),$
	$\displaystyle\bar{\mathbf{y}}_{t}\sim p(\bar{y}_{t}\|x_{t}),$		(9a)
	$\displaystyle d(\mathbf{x}_{t},\bar{\mathbf{x}}_{t})\leq\mathbf{r}_{x},$
	$\displaystyle d(\mathbf{y}_{t},\bar{\mathbf{y}}_{t})\leq\mathbf{r}_{y}.$		(9b)

Here, $p(\bar{x}_{t}|{x}_{t-1})$ and $p(\bar{y}_{t}|{x}_{t})$ denote the nominal transition probability and nominal output probability respectively; $\mathbf{r}_{x}$ and $\mathbf{r}_{y}$ are the threshold random variables depicting the upper bound of model mismatch, with their distributions characterized by the cumulative distribution functions $F_{x}$ and $F_{y}$ , respectively. Importantly, $\mathbf{r}_{x}$ is assumed to be independent of both $\mathbf{x}_{t}$ and $\bar{\mathbf{x}}_{t}$ , and a similar independence assumption is made for $\mathbf{r}_{y}$ relative to $\mathbf{y}_{t}$ and $\bar{\mathbf{y}}_{t}$ . This new form in (9) is called uncertain hidden Markov model.

It is crucial to differentiate between the concepts of system stochasticity (see (9a)) and model mismatch (see (9b)). The distinction hinges on the presence of known mathematical forms. System stochasticity can be accurately modeled using explicit distributions with associated parameters. In contrast, model mismatch refers to the inherent limitations and uncertainties in a model’s ability to represent the real system. Typically, this can only be quantified by an upper bound that reflects the extent to which the model deviates from reality. If we can only acquire the bound of system stochasticity, it inherently becomes a special case of model mismatch. Conversely, if the distribution of model mismatch is determined, it then becomes part of the system’s stochasticity. This distinction is pivotal in understanding and building uncertain HMM.

Remark 2.

When the real system is perfectly modelled, i.e., $\mathbf{x}_{t}=\bar{\mathbf{x}}_{t}$ , $\mathbf{y}_{t}=\bar{\mathbf{y}}_{t}$ , we have $p(\bar{x}_{t}|x_{t-1})=p(x_{t}|x_{t-1})$ and $p(\bar{y}_{t}|x_{t})=p(y_{t}|x_{t})$ . In this case, the uncertain HMM (9) reduces to the standard HMM (8).

Remark 3.

The two nominal probabilities $p(\bar{x}_{t}|x_{t-1})$ and $p(\bar{y}_{t}|x_{t})$ can be written to state space model (SSM):

	$\displaystyle\bar{\mathbf{x}}_{t+1}=\bar{f}(\mathbf{x}_{t},\bar{\mathbf{\xi}}_% {t}),$		(10)
	$\displaystyle\bar{\mathbf{y}}_{t}=\bar{g}(\mathbf{x}_{t},\bar{\mathbf{\zeta}}_% {t}).$		(10)

Here, $\bar{f}$ is nominal transition model, $\bar{g}$ is nominal measurement model, $\bar{\xi}_{t}$ is virtual process noise, and $\bar{\zeta}_{t}$ is virtual measurement noise.

In essence, HHM and SSM are just different representations of the same system model, and they can be converted into each other [24, 8]. For example, consider the HMM’s nominal transition probability expressed as $p(\bar{x}_{t}|x_{t-1})=\mathcal{N}(\bar{x}_{t};Ax_{t-1},Q)$ . This can be equivalently represented in the SSM format as $\bar{\mathbf{x}}_{t}=A\mathbf{x}_{t-1}+\bar{\mathbf{\xi}}_{t-1}$ , where $\bar{\mathbf{\xi}}_{t-1}\sim\mathcal{N}(\bar{\mathbf{\xi}}_{t-1};0,Q)$ with $Q$ denoting the covariance matrix of the virtual process noise. Similarly, an HMM with a nominal transition probability defined by a Laplace distribution, $p(\bar{x}_{t}|x_{t-1})=\mathrm{Laplace}(\bar{x}_{t};Ax_{t-1},Q)$ , can be represented in SSM format as $\bar{\mathbf{x}}_{t}=A\mathbf{x}_{t-1}+\bar{\mathbf{\xi}}_{t-1}$ , with $\bar{\mathbf{\xi}}_{t-1}$ following the Laplace distribution $\mathrm{Laplace}(\bar{\mathbf{\xi}}_{t-1};0,Q)$ .

3.2 Filtering Algorithm

When the system is perfectly modelled as in (8), Bayesian filtering serves as an ideal framework to calculate the posterior of system state by iteratively performing (11a) and (11b):


$\displaystyle p(x_{t}\|y_{1:t-1})$	$\displaystyle=\int p(x_{t}\|x_{t-1})p(x_{t-1}\|y_{1:t-1})\,\mathrm{d}x_{t-1},$	(11a)
$\displaystyle p(x_{t}\|y_{1:t})$	$\displaystyle=\frac{p(x_{t}\|y_{1:t-1})p(y_{t}\|x_{t})}{\int p(x_{t}\|y_{1:t-1})p% (y_{t}\|x_{t})\,\mathrm{d}x_{t}}.$	(11b)

Here, $p(x_{t}|y_{1:t})$ is the posterior, $p(x_{t}|y_{1:t-1})$ is the prior. As a tradition, (11a) is called prediction and (11b) is called update. These two equations originate from total probability rule and Bayes’ law respectively. In fact, almost all existing Bayesian filtering algorithms adhere to this framework, with their differences lying in how these two equations are calculated.

When considering the HMM with model mismatch (9), we need to shift the mathematical foundation to convolutional rules, as demonstrated in (6). First, let us redefine some core probability distributions in Bayesian filtering, including posterior distribution, prior distribution, transition probability, and output probability. The redefinition relies on the uncertain HMM:

$\displaystyle p_{c}(x_{t}\|y_{1:t}):$	$\displaystyle=p(x_{t}\|y_{1:t},d_{x}(\mathbf{x}_{i},\bar{\mathbf{x}}_{i})\leq% \mathbf{r}_{x},d_{y}(\mathbf{y}_{i},\bar{\mathbf{y}}_{i})\leq\mathbf{r}_{y}),$	(12)
$\displaystyle p_{c}(x_{t}\|y_{1:t-1}):$	$\displaystyle=p(x_{t}\|y_{1:t-1},d_{x}(\mathbf{x}_{i},\bar{\mathbf{x}}_{i})\leq% \mathbf{r}_{x},d_{y}(\mathbf{y}_{j},\bar{\mathbf{y}}_{j})\leq\mathbf{r}_{y}),$
$\displaystyle p_{c}(x_{t}\|x_{t-1}):$	$\displaystyle=p(x_{t}\|x_{t-1},d_{x}(\mathbf{x}_{t},\bar{\mathbf{x}}_{t})\leq% \mathbf{r}_{x}),$
$\displaystyle p_{c}(y_{t}\|x_{t}):$	$\displaystyle=p(y_{t}\|x_{t},d_{y}(\mathbf{y}_{t},\bar{\mathbf{y}}_{t})\leq% \mathbf{r}_{y}),$
$\displaystyle i$	$\displaystyle=1,2,...,t,\;j=1,2,...,t-1.$

Here, $p_{c}(x_{t}|y_{1:t})$ , $p_{c}(x_{t}|y_{1:t-1})$ , $p_{c}(x_{t}|x_{t-1})$ and $p_{c}(y_{t}|x_{t})$ are convolutional distributions, each corresponding to their respective physical meanings. Then, we will illustrate how to use these definitions in (12) to derive a new Bayesian filtering framework to handle model mismatch. We begin with the assumption of conditional independence:

Assumption 1 (Conditional Independence).

$\mathbf{x}_{t}$ and $\bar{\mathbf{x}}_{t}$ are conditionally independent given $\mathbf{x}_{t-1}$ , i.e.,

p(\bar{x}_{t}|x_{t},x_{t-1})=p(\bar{x}_{t}|x_{t-1}).

Besides, $\mathbf{y}_{t}$ and $\bar{\mathbf{y}}_{t}$ are conditionally independent given $\mathbf{x}_{t}$ .

This assumption suggests that the virtual state can be inferred directly from the past state, without additional information from the current state. Also, the virtual measurement can be inferred directly from the state, without additional information from the real measurement. This assumption originates from the philosophical belief that the physical world and the modeling of a system are mutually exclusive at any given moment; that is, the act of modeling does not affect the system in the physical world, nor does the physical system influence its nominal model. This principle is crucial for estimating the transition and output probabilities using their nominal models. Under the assumption of conditionally independence, we can obtain the main result of the paper:

Theorem 1 (Convolutional Bayesian filtering).

Under Assumption 1, the convolutional Bayesian filtering is calculated recursively by (13a) and (13b):


$\displaystyle p_{c}(x_{t}\|y_{1:t-1})$	$\displaystyle=\int p_{c}(x_{t}\|x_{t-1})p_{c}(x_{t-1}\|y_{1:t-1})\,\mathrm{d}x_{% t-1},$	(13a)
$\displaystyle p_{c}(x_{t}\|y_{1:t})$	$\displaystyle=\frac{p_{c}(x_{t}\|y_{1:t-1})p_{c}(y_{t}\|x_{t})}{\int p_{c}(x_{t}% \|y_{1:t-1})p_{c}(y_{t}\|x_{t})\,\mathrm{d}x_{t}}.$	(13b)

Here, the convolutional transition probability is


	$\displaystyle p_{c}(x_{t}\|x_{t-1})$	$\displaystyle=\frac{\int_{\bar{x}_{t}}\left(1-F_{x}(d_{x}(x_{t},\bar{x}_{t}))% \right)p(\bar{x}_{t}\|x_{t-1})\,\mathrm{d}\bar{x}_{t}}{\int_{x_{t}}\int_{\bar{x% }_{t}}\left(1-F_{x}(d_{x}(x_{t},\bar{x}_{t}))\right)p(\bar{x}_{t}\|x_{t-1})\,% \mathrm{d}\bar{x}_{t}\,\mathrm{d}x_{t}},$	(14a)
and the convolutional output probability is
	$\displaystyle p_{c}(y_{t}\|x_{t})$	$\displaystyle=\frac{\int_{\bar{y}_{t}}\left(1-F_{y}(d_{y}(y_{t},\bar{y}_{t}))% \right)p(\bar{y}_{t}\|x_{t})\,\mathrm{d}\bar{y}_{t}}{\int_{y_{t}}\int_{\bar{y}_% {t}}\left(1-F_{y}(d_{y}(y_{t},\bar{y}_{t}))\right)p(\bar{y}_{t}\|x_{t})\,% \mathrm{d}\bar{y}_{t}\,\mathrm{d}{y}_{t}}.$	(14b)

Proof.

The transition and output probabilities in (8) are standard conditional probabilities. By leveraging the definition of convolutional conditional probability, we can derive the convolutional forms of (8) as in (14). Consequently, their convolutional counterparts are referred to as the convolutional transition probability (14a) and the convolutional output probability (14b), respectively. Then by utilizing (6), we can deduce convolutional Bayesian filtering (13). ∎

Remark 4.

In the iterative process, (13a) and (13b) resemble the prediction and update steps of Bayesian filtering, respectively. The only difference is that all the probability distributions are transformed into their convolutional counterparts. Therefore, we refer to this iterative process as convolutional Bayesian filtering.

3.3 Analytical Form in Gaussian Cases

A major challenge in convolutional Bayesian filtering is the difficulty in computing the integrals in (14a) and (14b), as their analytical solutions generally do not exist. However, an exceptional case arises when distance metrics are represented as quadratic forms, threshold distributions are chosen as exponential distributions, and virtual noises are characterized as additive Gaussian. In this specific case, it is possible to derive an analytical version of convolutional Bayesian filtering.

Corollary 1.

Consider the following nominal system model

	$\displaystyle p(\bar{x}_{t}\|x_{t-1})$	$\displaystyle=\mathcal{N}(\bar{x}_{t};f(x_{t-1}),Q),$		(15)
	$\displaystyle p(\bar{y}_{t}\|x_{t})$	$\displaystyle=\mathcal{N}(\bar{y}_{t};g(x_{t}),R).$		(15)

If $d_{x}(\mathbf{x},\bar{\mathbf{x}})=\|\mathbf{x}-\bar{\mathbf{x}}\|^{2}$ , $d_{y}(\mathbf{y},\bar{\mathbf{y}})=\|\mathbf{y}-\bar{\mathbf{y}}\|^{2}$ , $\mathbf{r}_{x}\sim\mathrm{Exp}(\alpha)$ and $\mathbf{r}_{y}\sim\mathrm{Exp}(\beta)$ with $\alpha,\beta>0$ being exponential coefficients, we have


$\displaystyle p_{c}(x_{t}\|x_{t-1})$	$\displaystyle=\mathcal{N}(x_{t};f(x_{t-1}),Q+1/(2\alpha)\cdot I_{n\times n}),$	(16a)
$\displaystyle p_{c}(y_{t}\|x_{t})$	$\displaystyle=\mathcal{N}(y_{t};g(x_{t}),R+1/(2\beta)\cdot I_{m\times m)}.$	(16b)

Proof.

The proof is provided only for the first part, namely, proving the analytical form of $p_{c}(x_{t}|x_{t-1})$ in (16a). The second part (16b) can be proved in a similar manner. According to (14a), we have

		$\displaystyle p_{c}(x_{t}\|x_{t-1})$
	$\displaystyle\propto$	$\displaystyle\int_{\bar{x}_{t}}\left(1-F_{x}(d_{x}(x_{t},\bar{x}_{t}))\right)p% (\bar{x}_{t}\|x_{t-1})\,\mathrm{d}\bar{x}_{t},$
	$\displaystyle=$	$\displaystyle\int_{\bar{x}_{t}}e^{-\alpha\\|x_{t}-\bar{x}_{t}\\|^{2}}e^{-\frac{1% }{2}\left\\|\bar{x}_{t}-f(x_{t-1})\right\\|^{2}_{Q^{-1}}}\,\mathrm{d}\bar{x}_{t}.$

By completing the square, we have

	$\displaystyle\alpha\\|x_{t}-\bar{x}_{t}\\|^{2}+\frac{1}{2}\left(\bar{x}_{t}-f(x_% {t-1})\right)^{\top}Q^{-1}\left(\bar{x}_{t}-f(x_{t-1})\right)$	(17)
$\displaystyle=$	$\displaystyle\frac{1}{2}\Big{(}\bar{x}_{t}^{\top}(2\alpha I_{n\times n}+Q^{-1}% )\bar{x}_{t}$
	$\displaystyle-2\bar{x}_{t}^{\top}(2\alpha I_{n\times n}x_{t}+Q^{-1}f(x_{t-1}))% +C\Big{)}$
$\displaystyle=$	$\displaystyle\frac{1}{2}\Big{(}\\|\bar{x}_{t}-(2\alpha I_{n\times n}$
	$\displaystyle+Q^{-1})^{-1}(2\alpha x_{t}+Q^{-1}f(x_{t-1}))\\|^{2}_{2\alpha I_{n% \times n}+Q^{-1}}+C\Big{)}.$

where $C$ indicates terms that do not depend on $\bar{x}_{t}$ . The integral of (17) over $\bar{x}_{t}$ is proportional to

e^{-\frac{1}{2}(x_{t}-f(x_{t-1}))^{\top}(1/(2\alpha)\cdot I_{n\times n}+Q)^{-1% }(x_{t}-f(x_{t-1}))},

where $1/(2\alpha)\cdot I_{n\times n}+Q$ is the covariance matrix of the convolutional transition probability and $f(x_{t-1})$ is its mean. This results in an analytical form of the convolutional transition probability:

p_{c}(x_{t}|x_{t-1})=\mathcal{N}(x_{t};f(x_{t-1}),Q+1/(2\alpha)\cdot I_{n% \times n}).

∎

This corollary shows that by using quadratic distance metrics and choosing exponential threshold variables, the covariance matrix of the convolutional transition probability for system (15) essentially becomes the nominal covariance matrix plus a constant matrix related to the exponential coefficient. As the exponential coefficient increases, the exponential distribution becomes more concentrated, with its mean and variance tending towards zero. This implies that the uncertain HMM becomes increasingly deterministic. When the exponential coefficient becomes infinity, the effect of model mismatch diminishes, and the convolutional transition probability $p_{c}(x_{t}|x_{t-1})$ reduces to the nominal transition probability $p(x_{t}|x_{t-1})$ . This analysis is equally applicable to convolutional output probability and nominal output probability.

For linear Gaussian case, where the system in (15) satisfies $f(x_{t-1})=Ax_{t-1}$ and $g(x_{t})=Cx_{t}$ , standard Bayesian filtering simplifies to the canonical Kalman filter. By using Corollary 1, the canonical Kalman filter can be transformed into its convolutional version by only replacing the covariance matrix of process noise $Q$ with $Q+1/{(2\alpha)}$ , and the covariance matrix of measurement noise $R$ with $R+1/{(2\beta)}$ .

Remark 5.

The resulting method is an outlier-robust Kalman filter (KF), which we name as convolutional KF (ConvKF). Unlike the robust regression KF that employs Huber loss [14] or correntropy loss [9, 35, 34], and the student-t KF [30, 2] designed for handling non-Gaussian heavy-tailed distributions, ConKF offers several benefits: First, it quantitatively considers the impact of model mismatch with a clear probabilistic meaning; second, it preserves the original structure of KF, maintaining the conjugate nature of Gaussian distributions without increasing the computational burden; third, our method is in alignment with the well-established results for engineering practice of the KF, as discussed in Chapter 6.1 of [4] and Theorem 7.6 of [15]: if the modeling of noise covariance is imprecise, it is common practice to opt for a larger covariance in the KF. This treatment is proven to preserve stability, albeit resulting in a more conservative filter.

4 Approximation via Exponential Density Rescaling

Except for the Gaussian case addressed in Corollary 1, the convolutional conditional probabilities in Bayesian filtering typically lack analytical forms. In this section, we introduce an approximation technique for computing the convolutional conditional probability, namely the exponential density rescaling technique. Moreover, we offer a theoretical explanation for this technique using the theory of information bottleneck.

4.1 Exponential Density Rescaling

When the distance metric is defined in terms of relative entropy, the transition probability and output probability can be approximated by simply reformulating them into exponential forms with fractional powers. Specifically, we have the following theorem:

Theorem 2 (Exponential Approximation).

When the distance metrics $d_{x}$ and $d_{y}$ are chosen as

d_{x}(\mathbf{x}_{t},\bar{\mathbf{x}}_{t})=D_{\mathrm{KL}}(\hat{p}_{\mathbf{x}% _{t}}||\hat{p}_{\bar{\mathbf{x}}_{t}}),\;d_{y}(\mathbf{y}_{t},\bar{\mathbf{y}}% _{t})=D_{\mathrm{KL}}(\hat{p}_{\mathbf{y}_{t}}||\hat{p}_{\bar{\mathbf{y}}_{t}}),

with $\hat{p}_{\mathbf{x}_{t}}(x)=\delta(x-\mathbf{x}_{t})$ and $\hat{p}_{\mathbf{y}_{t}}(y)=\delta(y-\mathbf{y}_{t})$ representing the empirical distribution, and $\mathbf{r}_{x}\sim\mathrm{Exp}(\alpha)$ , $\mathbf{r}_{y}\sim\mathrm{Exp}(\beta)$ with $\alpha,\beta>0$ , the convolutional transition probability and convolutional output probability can be approximated as

	$\displaystyle p_{c}(x_{t}\|x_{t-1})$	$\displaystyle\approx\frac{p(x_{t}\|x_{t-1})^{\frac{\alpha}{\alpha+1}}}{\int p(x% _{t}\|x_{t-1})^{\frac{\alpha}{\alpha+1}}\,\mathrm{d}x_{t}}:=p_{e}(x_{t}\|x_{t-1}),$
	$\displaystyle p_{c}(y_{t}\|x_{t})$	$\displaystyle\approx\frac{p(y_{t}\|x_{t})^{\frac{\beta}{\beta+1}}}{\int p(y_{t}% \|x_{t})^{\frac{\beta}{\beta+1}}\,\mathrm{d}y_{t}}\,\,\,\,\,\,\,:=p_{e}(y_{t}\|x% _{t}).$

Proof.

We prove the theorem only for the convolutional output probability, because the proof logic for the convolutional transition probability is analogous and thus omitted. Besides, the proof is confined to cases where the sample space is finite. Supposing the sample space has $k$ elements, we define the set of all the probability distributions as

\Delta_{k}=\left\{p\in\mathbb{R}^{k}:p=[p_{1},p_{2},\ldots,p_{k}],\sum_{i}p_{i% }=1,p_{i}>0,\forall i\right\}.

By Proposition 1, we have

p_{c}(y_{t}|x_{t})\propto P(d_{y}(\mathbf{y}_{t},\bar{\mathbf{y}}_{t})\leq% \mathbf{r}_{y}|y_{t},x_{t}).

According to (5), we have

		$\displaystyle P(d_{y}(\mathbf{y}_{t},\bar{\mathbf{y}}_{t})\leq\mathbf{r}_{y}\|y% _{t},x_{t})$
	$\displaystyle=$	$\displaystyle\int_{\bar{y}_{t}}\left(1-F_{y}(d_{y}(y_{t},\bar{y}_{t}))\right)p% (\bar{y}_{t}\|x_{t})\,\mathrm{d}\bar{y}_{t}$
	$\displaystyle=$	$\displaystyle\int_{\bar{y}_{t}}e^{-\beta(D_{\text{KL}}(\hat{p}_{y_{t}}\|\|\hat{p% }_{\bar{{y}}_{t}}))}p(\bar{y}_{t}\|x_{t})\,\mathrm{d}\bar{y}_{t}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{\bar{\mathbf{y}}_{t}\sim p(\bar{y}_{t}\|x_{t})}\left\{% e^{-\beta(D_{\text{KL}}(\hat{p}_{y_{t}}\|\|\hat{p}_{\bar{\mathbf{y}}_{t}}))}% \right\}.$

For any $o\in\mathbb{R}^{k}$ , we define $o^{\prime}:=(o_{1},o_{2},\ldots,o_{k-1})\in\mathbb{R}^{k-1}$ and $C(o)\in\mathbb{R}^{k\times k}$ satisfying $[C(o)]_{(i,j)}=o_{i}\mathds{1}_{(i=j)}-o_{i}o_{j}$ . Under the assumption of finite sample space, we denote $\mathbf{\hat{q}}=\hat{p}_{\bar{\mathbf{y}}_{t}}\in\Delta_{k}$ and $p=\hat{p}_{{y}_{t}}\in\Delta_{k}$ . Thus, we have

\displaystyle\mathbb{E}\left\{e^{-\beta(D_{\text{KL}}(\hat{p}_{y_{t}}||\hat{p}% _{\bar{\mathbf{y}}_{t}}))}\right\}=\mathbb{E}\left\{e^{-\beta(D_{\text{KL}}(p|% |\mathbf{\hat{q}}))}\right\}.

The chi-squared distance is a second-order Taylor approximation of the relative entropy [12]. That is to say, for any $p,q\in\Delta_{k}$ , we have

D_{\text{KL}}(p||q)=\frac{1}{2}\chi^{2}(p,q)+O(\|p-q\|^{2}),

(18)

where $D_{\text{KL}}(p||q):=\sum_{i}p_{i}\log(p_{i}/q_{i})$ and $\chi^{2}(p,q):=\sum_{i}(p_{i}-q_{i})^{2}/{q_{i}}$ . Besides, for all $p,q\in\Delta_{k}$ , the chi-squared distance $\chi^{2}(p,q)$ equals

\chi^{2}(p,q)=(p^{\prime}-q^{\prime})^{\top}C^{-1}(q^{\prime})(p^{\prime}-q^{% \prime}),

(19)

where $p^{\prime}:=(p_{1},p_{2},\ldots,p_{k-1})\in\mathbb{R}^{k-1}$ [12]. According to the central limit theorem [23], the empirical distribution can be approximated by a Gaussian distribution $\hat{\mathbf{q}}\sim\mathcal{N}(\hat{q};q,C(q))$ [28] with $q=p(\bar{y}_{t}|x_{t})$ being the nominal distribution. Defining $C=C(q^{\prime})$ , by (18) and (19), we have

		$\displaystyle\mathbb{E}\left\{e^{-\beta(D_{\text{KL}}(p\|\|\mathbf{\hat{q}}))}\right\}$
	$\displaystyle\approx$	$\displaystyle\mathbb{E}\left\{e^{-\frac{\beta}{2}(p^{\prime}-\hat{\mathbf{q}}^% {\prime})^{\top}C^{-1}(p^{\prime}-\hat{\mathbf{q}}^{\prime})}\right\}$
	$\displaystyle=$	$\displaystyle(2\pi)^{\frac{k-1}{2}}\|C/\beta\|^{\frac{1}{2}}\int\mathcal{N}(p^{% \prime};\hat{q}^{\prime},C/\beta)\mathcal{N}(\hat{q}^{\prime};q^{\prime},C)\,% \mathrm{d}\hat{q}^{\prime}$
	$\displaystyle=$	$\displaystyle(2\pi)^{\frac{k-1}{2}}\|C/\beta\|^{\frac{1}{2}}\mathcal{N}(p^{% \prime};q^{\prime},(\frac{1}{\beta}+1)C)$
	$\displaystyle=$	$\displaystyle(\frac{1}{1+\beta})^{\frac{k-1}{2}}e^{-\frac{1}{2}(\frac{1}{\beta% }+1)^{-1}(p^{\prime}-q^{\prime})^{\top}C^{-1}(p^{\prime}-q^{\prime})}$
	$\displaystyle\approx$	$\displaystyle(\frac{1}{1+\beta})^{\frac{k-1}{2}}e^{-\frac{\beta}{\beta+1}D_{% \text{KL}}(p\|\|q)}$
	$\displaystyle=$	$\displaystyle(\frac{1}{1+\beta})^{\frac{k-1}{2}}e^{-\frac{\beta}{\beta+1}D_{% \text{KL}}(\hat{p}_{y_{t}}\|\|p(\bar{y}_{t}\|x_{t}))}$
	$\displaystyle\propto$	$\displaystyle p({y}_{t}\|x_{t})^{\frac{\beta}{\beta+1}}.$

Thus, we finally achieve $p_{c}(y_{t}|x_{t})\propto p({y}_{t}|x_{t})^{\frac{\beta}{\beta+1}}$ . ∎

Based on the second order approximation of the relative entropy, this theorem provides an effective way of performing convolutional Bayesian filtering by simply transforming transition probability and output probability using their fractional orders.

Remark 6.

In linear Gaussian systems, the proposed approximation alters the covariances for both transition and measurement noises. Consider, for example, the nominal transition probability $p(\bar{x}_{t}|x_{t-1})=\mathcal{N}(\bar{x}_{t};Ax_{t-1},Q)$ ; this becomes $p_{e}({x}_{t}|x_{t-1})=\mathcal{N}(\bar{x}_{t};Ax_{t-1},(\alpha+1)/\alpha\cdot Q)$ , thereby changing the covariance of the transition noise from $Q$ to $(\alpha+1)/\alpha\cdot Q$ . Such a modification aligns with the guidelines of Corollary 1. However, it is crucial to emphasize that while Corollary 1 is confined to Gaussian distributions, Theorem 2 broadens the scope to include any type of distribution. The extensive applicability of this approach is also reflected in the convolutional particle filter algorithm (see Algorithm 1), which is formulated without being limited to any specific distribution type.

Algorithm 1 Convolutional Particle Filter with Exponential Density Rescaling

0: Sequence of measurements

y_{1:t}

, nominal transition probability

p(x_{t}|x_{t-1})

, nominal output probability

p(y_{t}|x_{t})

, number of particles

N

1: for

t=0

T

2: Prediction:

3: for

i=1

N

x_{t}^{(i)}\sim p_{e}(x_{t}|x_{t-1}^{(i)})

5: end for

6: Update:

7: for

i=1

N

w^{i}=p_{e}(y_{t}|x_{t}^{(i)})

9: end for

10: Normalize weights:

w\leftarrow w/\sum(w)

11: Resampling:

12: Select

N

particles based on weights

w

for new set

13: Replace current particles with the new set

14: end for

14: Final set of particles:

x^{(1:N)}

4.2 Connection with Information Bottleneck Theory

Previously, we have proved that convolutional Bayesian filtering can be approximated by exponential density rescaling technique. This section will provide a theoretical view of this technique using the information bottleneck theory.

Given the measurement data $\mathbf{y}_{t}$ , the state $\mathbf{x}_{t}$ can be regarded as its compressed representation. Leveraging the information bottleneck theory [36, 6], we can express the information bottleneck objective as

	$\displaystyle q_{\text{info}}(x_{t}\|y_{t})=$	$\displaystyle\operatorname*{arg\,min}_{q(x_{t}\|y_{t})}\left\{-I(\mathbf{x}_{t}% ,\bar{\mathbf{y}}_{t})\right\}],$		(20)
	s.t.	$\displaystyle I\left(\mathbf{x}_{t},\mathbf{y}_{t}\right)\leq I_{0},$		(20)

where $I(\mathbf{x},\mathbf{y})=D_{\text{KL}}\left(p(x,y)\|p(x)p(y)\right)$ is defined as the mutual information between random variables $\mathbf{x}$ and $\mathbf{y}$ . Here, the mutual information $I(\mathbf{x}_{t},\mathbf{y}_{t})$ and $I(\mathbf{x}_{t},\mathbf{\bar{y}}_{t})$ are defined by two joint probability distributions $p(x_{t},y_{t})$ and $p(x_{t},\bar{y}_{t})$ , which can be decomposed as

	$\displaystyle p(x_{t},y_{t})$	$\displaystyle=q(x_{t}\|y_{t})p(y_{t})=q(y_{t}\|x_{t})p(x_{t}),$
	$\displaystyle p(x_{t},\bar{y}_{t})$	$\displaystyle=p(x_{t}\|\bar{y}_{t})p(\bar{y}_{t}).$

The goal of (20) is to maximize the mutual information between the virtual measurement and its compression, the system state, while ensuring that the mutual information between the state and the actual measurement does not exceed $I_{0}$ . The concept of the “information bottleneck” emerges from the limitation that $I(\mathbf{x}_{t},\mathbf{y}_{t})$ must not exceed $I_{0}$ , which requires compressing the information in $\mathbf{y}_{t}$ through a bottleneck, as depicted in Fig. 3(a).

The constrained optimization problem in (20) can be transformed into an unconstrained optimization problem by using the Lagrange multiplier $1-\gamma$ :

q_{\text{info}}(x_{t}|y_{t})=\operatorname*{arg\,min}_{q(x_{t}|y_{t})}\left\{-% I(\mathbf{x}_{t},\bar{\mathbf{y}}_{t})+(1-\gamma)I(\mathbf{x}_{t},\mathbf{y}_{% t})\right\}.

(21)

By leveraging the Markov property (see Fig. 3(b)) [3, 20], (21) can further be rewritten as

\displaystyle q_{\text{info}}(x_{t}|y_{t})=\operatorname*{arg\,min}_{q(x_{t}|y% _{t})}\left\{I(\mathbf{x}_{t},\mathbf{y}_{t}|\bar{\mathbf{y}}_{t})-\gamma\cdot I% (\mathbf{x}_{t},\mathbf{y}_{t})\right\}.

(22)

We can find an approximate upper bound of (22). For the first term, we have:

	$\displaystyle I(\mathbf{x}_{t},\mathbf{y}_{t}\|\bar{\mathbf{y}}_{t})$	(23)
$\displaystyle=$	$\displaystyle\mathbb{E}_{p(y_{t},\bar{y}_{t})}\left\{\mathbb{E}_{q(x_{t}\|y_{t}% )}\left\{\log\left(\frac{{q(\mathbf{x}_{t}\|\mathbf{y}_{t})p(\mathbf{y}_{t}\|{% \bar{\mathbf{y}}}_{t})}}{p(\mathbf{x}_{t}\|\mathbf{\bar{y}}_{t})p(\mathbf{y}_{t% }\|{\bar{\mathbf{y}}}_{t})}\right)\right\}\right\}$
$\displaystyle=$	$\displaystyle\mathbb{E}_{p(y_{t},\bar{y}_{t})}\left\{\mathbb{E}_{q(x_{t}\|y_{t}% )}\left\{\log\left(\frac{{q(\mathbf{x}_{t}\|\mathbf{y}_{t})}}{p(\mathbf{x}_{t}\|% \mathbf{\bar{y}}_{t})}\right)\right\}\right\}$
$\displaystyle=$	$\displaystyle\mathbb{E}_{p(y_{t})}\left\{\mathbb{E}_{q(x_{t}\|y_{t})}\left\{% \log\left(\frac{{q(\mathbf{x}_{t}\|\mathbf{y}_{t})}}{p_{c}(\mathbf{x}_{t}\|y_{1:% t-1})}\right)\right\}\right\}$
	$\displaystyle-\mathbb{E}_{p(\bar{y}_{t})}\left\{\mathbb{E}_{p(x_{t}\|\bar{y}_{t% })}\left\{\log\left(\frac{{p(\mathbf{x}_{t}\|\bar{\mathbf{y}}_{t})}}{p_{c}(% \mathbf{x}_{t}\|y_{1:t-1})}\right)\right\}\right\}$
$\displaystyle=$	$\displaystyle\mathbb{E}_{p(y_{t})}\left\{\mathbb{E}_{q(x_{t}\|y_{t})}\left\{% \log\left(\frac{{q(\mathbf{x}_{t}\|\mathbf{y}_{t})}}{p_{c}(\mathbf{x}_{t}\|y_{1:% t-1})}\right)\right\}\right\}$
	$\displaystyle-\mathbb{E}_{p(\bar{y}_{t})}\left\{D_{\mathrm{KL}}(p(x_{t}\|\bar{% \mathbf{y}}_{t})\\|p_{c}({x}_{t}\|y_{1:t-1}))\right\}.$

Note that we have $q(x_{t}|y_{t},\bar{y}_{t})=q(x_{t}|y_{t})$ due to the Markov property [3, 20]. Because the KL divergence is always positive, (23) can be upper bounded by

		$\displaystyle I(\mathbf{x}_{t},\mathbf{y}_{t}\|\bar{\mathbf{y}}_{t})$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{p(y_{t})}\left\{\mathbb{E}_{q(x_{t}\|y_{t})}\left\{% \log\left(\frac{{q(\mathbf{x}_{t}\|\mathbf{y}_{t})}}{p_{c}(\mathbf{x}_{t}\|y_{1:% t-1})}\right)\right\}\right\}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{q(x_{t}\|y_{t})}\left\{\mathbb{E}_{p(y_{t})}\left\{% \log\left(\frac{{q(\mathbf{x}_{t}\|\mathbf{y}_{t})}}{p_{c}(\mathbf{x}_{t}\|y_{1:% t-1})}\right)\right\}\right\}$
	$\displaystyle\approx$	$\displaystyle\mathbb{E}_{q(x_{t}\|y_{t})}\left\{\log\left(\frac{{q(\mathbf{x}_{% t}\|y_{t})}}{p_{c}(\mathbf{x}_{t}\|y_{1:t-1})}\right)\right\}$
	$\displaystyle=$	$\displaystyle D_{\text{KL}}\left(q(x_{t}\|y_{t})\\|p_{c}(x_{t}\|y_{1:t-1})\right).$

The approximate equality in the penultimate line is due to the substitution of expected values with sample values. Besides, the second term $I\left(\mathbf{x}_{t},\mathbf{y}_{t}\right)$ is approximately lower bounded by

	$\displaystyle I\left(\mathbf{x}_{t},\mathbf{y}_{t}\right)$	(24)
$\displaystyle=$	$\displaystyle H\left(\mathbf{y}_{t}\right)+\mathbb{E}_{q(x_{t}\|y_{t})}\left\{% \mathbb{E}_{p(y_{t})}\left\{\log{q(\mathbf{y}_{t}\|\mathbf{x}_{t})}\right\}\right\}$
$\displaystyle=$	$\displaystyle H\left(\mathbf{y}_{t}\right)+\mathbb{E}_{q(x_{t}\|y_{t})}\left\{% \mathbb{E}_{p(y_{t})}\left\{\log{p(\mathbf{y}_{t}\|\mathbf{x}_{t})}\right\}\right\}$
	$\displaystyle+\mathbb{E}_{p(y_{t})}\left\{\mathbb{E}_{q(x_{t}\|y_{t})}\left\{% \log{q(\mathbf{y}_{t}\|\mathbf{x}_{t})}\right\}-\mathbb{E}_{q(x_{t}\|y_{t})}% \left\{\log{p(\mathbf{y}_{t}\|\mathbf{x}_{t})}\right\}\right\}$
$\displaystyle=$	$\displaystyle H\left(\mathbf{y}_{t}\right)+\mathbb{E}_{q(x_{t}\|y_{t})}\left\{% \mathbb{E}_{p(y_{t})}\left\{\log{p(\mathbf{y}_{t}\|\mathbf{x}_{t})}\right\}\right\}$
	$\displaystyle+\mathbb{E}_{p(x_{t})}\left\{D_{\mathrm{KL}}(q(y_{t}\|\mathbf{x}_{% t})\\|p(y_{t}\|\mathbf{x}_{t}))\right\}$
$\displaystyle\geq$	$\displaystyle H\left(\mathbf{y}_{t}\right)+\mathbb{E}_{q(x_{t}\|y_{t})}\left\{% \mathbb{E}_{p(y_{t})}\left\{\log{p(\mathbf{y}_{t}\|\mathbf{x}_{t})}\right\}\right\}$
$\displaystyle\approx$	$\displaystyle H\left(\mathbf{y}_{t}\right)+\mathbb{E}_{q(x_{t}\|y_{t})}\left\{% \log{p(y_{t}\|\mathbf{x}_{t})}\right\}.$

Here, $p(y_{t}|\mathbf{x}_{t})$ is not real output probability. Instead, it is the value of the nominal output probability $p(\bar{y_{t}}|\mathbf{x}_{t})$ at the real measurement data $y_{t}$ . Combining (23) and (24), we can transform the unconstrained optimization problem in (22) into minimizing its variational lower bound.

	$\displaystyle q_{\text{info}}(x_{t}\|y_{t})$	(25)
$\displaystyle=$	$\displaystyle\operatorname*{arg\,min}_{q(x_{t}\|y_{t})}\Big{\{}-\gamma\mathbb{E% }_{q(x_{t}\|y_{t})}\left\{\log p(y_{t}\|\mathbf{x}_{t})\right\}$
	$\displaystyle\quad\quad\quad\quad+D_{\text{KL}}\left(q(x_{t}\|y_{t})\\|p_{c}(x_{% t}\|y_{1:t-1})\right)\Big{\}}$
$\displaystyle=$	$\displaystyle\operatorname*{arg\,min}_{q(x_{t}\|y_{t})}\left\{\mathbb{E}_{q(x_{% t}\|y_{t})}\left\{\log\left(\frac{q(\mathbf{x}_{t}\|y_{t})}{p_{c}(\mathbf{x}_{t}% \|y_{1:t-1})\cdot p(y_{t}\|\mathbf{x}_{t}))^{\gamma}}\right)\right\}\right\}$
$\displaystyle=$	$\displaystyle\frac{p_{c}(x_{t}\|y_{1:t-1})p(y_{t}\|x_{t})^{\gamma}}{\int p_{c}(x% _{t}\|y_{1:t-1})p(y_{t}\|x_{t})^{\gamma}\,\mathrm{d}x_{t}}.$

Here, the entropy of the measurement $H(\mathbf{y}_{t})$ is omitted because it is a constant term regarding the optimization objective (22). From (25), it can be seen that by setting $\gamma=\beta/(\beta+1)$ , the solution to the information bottleneck problem coincides with the update step of convolutional Bayesian filtering approximated by the exponential density rescaling technique. This relationship offers a new perspective for understanding our framework. In more details, the variable $\gamma$ serves as a Lagrange multiplier that balances the trade-off between reconstructing information about measurement model and compressing representation of measurement data. As $\beta$ increases, the compression bottleneck becomes less restrictive. When $\beta\to\infty$ , convolutional Bayesian filtering simplifies to standard Bayesian filtering, and $\gamma\to 1$ , indicating that information is constructed without bottleneck.

5 Simulations

In this section, we evaluate our proposed framework across three benchmark systems to demonstrate its applicability to classic filtering algorithms in addressing model mismatches. We conduct $N=100$ Monte Carlo experiments with $M=40$ time steps for each simulation. In each experiment, the chosen evaluation metric is the root mean square error (RMSE), which is defined as

RMSE

\displaystyle=\sqrt{\frac{1}{M}\sum_{i=1}^{M}\|x_{i}-\hat{x}_{i}\|^{2}}.

Here, $x_{i},\hat{x}_{i}$ stand for the real and estimated state at the $i$ -th step. This metric is averaged with 100 experiments for fair performance evaluation in our simulations.

5.1 Linear Wiener Velocity Model

The Wiener velocity model is a well-known standard environment in the field of target tracking, where the velocity is modeled as the Wiener process [42]. The state $\mathbf{x}=\begin{bmatrix}p_{x},\,p_{y},\,v_{x},\,v_{y}\end{bmatrix}^{\top}$ consists of system positions $p_{x},p_{y}$ and system velocities $v_{x},v_{y}$ . The Wiener velocity model is described by

	$\displaystyle\bar{\mathbf{x}}_{t+1}$	$\displaystyle=\begin{bmatrix}1&0&0.1&0\\ 0&1&0&0.1\\ 0&0&1&0\\ 0&0&0&1\end{bmatrix}\mathbf{x}_{t}+\bar{\mathbf{\xi}}_{t},$
	$\displaystyle\bar{\mathbf{y}}_{t}$	$\displaystyle=\begin{bmatrix}1&0&0&0\\ 0&1&0&0\end{bmatrix}\mathbf{x}_{t}+\bar{\mathbf{\zeta}}_{t}.$

Here, the virtual process noise $\bar{\xi}_{t}$ is modelled by $\bar{\xi}_{t}\sim\mathcal{N}(0,Q)$ with covariance $Q=\mathbb{I}_{4\times 4}$ , and virtual measurement noise satisfies $\bar{\zeta}_{t}\sim\mathcal{N}(0,R)$ with $R=\mathbb{I}_{2\times 2}$ . Additionally, the initial state $\mathbf{x}_{0}$ satisfies $\mathbf{x}_{0}\sim\mathcal{N}\left(\begin{bmatrix}0&0&1&1\end{bmatrix}^{\top},% \,\mathbb{I}_{4\times 4}\right)$ .

To evaluate the effectiveness of our designed filter, we consider two different cases for model mismatch, which is a common setting in existing works [14, 30]:

•

Case A: Transition Model Mismatch: In the real system, the process noise is a mixture of Gaussian noises, while the measurement noise is Gaussian:

	$\displaystyle\xi_{t}$	$\displaystyle\sim 0.9\cdot\mathcal{N}(0,Q)+0.1\cdot\mathcal{N}(0,100Q),$
	$\displaystyle\zeta_{t}$	$\displaystyle\sim\mathcal{N}(0,R).$

•

Case B: Measurement Model Mismatch: The process noise is Gaussian, while the measurement noise is a mixture of Gaussian noises:

	$\displaystyle\xi_{t}$	$\displaystyle\sim\mathcal{N}(0,Q),$
	$\displaystyle\zeta_{t}$	$\displaystyle\sim 0.9\cdot\mathcal{N}(0,R)+0.1\cdot\mathcal{N}(0,1000R).$

Our proposed convolutional Bayesian filtering framework is applied to the KF through the application of Corollary 1, which we have named the ConvKF. We conduct comparisons of ConvKF using various values for parameters defined in Corollary 1, with the standard KF and the Huber KF. Note that Huber KF is a widely-used robust method that replaces the quadratic loss in the optimization formulation of KF with the Huber loss [14]. In Fig. 4, a box plot of the RMSE demonstrates that ConvKF outperforms the standard KF across a broad range of parameters in both cases A and B. Specifically, in case A, altering the exponential coefficient $\alpha$ from 0.005 to 0.05 results in an almost unchanged RMSE for ConvKF. In contrast, for case B, adjusting the exponential coefficient $\beta$ from 0.005 to 0.05 leads to a slight increase in the RMSE for ConvKF.

5.2 Sequence Forecasting System

In this subsection, we consider a popular nonlinear system used for sequence forecasting [35]. The state space model is given by

	$\displaystyle\bar{\mathbf{x}}_{t+1}$	$\displaystyle=x_{t}+\kappa_{1}\cdot\begin{bmatrix}-1&0\\ 0.1&-1\end{bmatrix}\mathbf{x}_{t}+\kappa_{2}\cdot\cos(\mathbf{x}_{t})+\bar{\xi% }_{t},$
	$\displaystyle\bar{\mathbf{y}}_{t}$	$\displaystyle=\mathbf{x}_{t}+\sin(\mathbf{x}_{t})+\bar{\zeta}_{t},$

where $\mathbf{x}_{0}\sim\mathcal{N}\left(0,\,\mathbb{I}_{2\times 2}\right)$ . Both the constants $\kappa_{1}$ and $\kappa_{2}$ are set to $0.1$ . We assume $Q=R=\mathbb{I}_{2\times 2}$ for virtual process noise $\bar{\xi}_{t}\sim\mathcal{N}\left(0,\,Q\right)$ and virtual measurement noise $\bar{\zeta}_{t}\sim\mathcal{N}\left(0,\,R\right)$ , respectively. We construct convolutional approaches for EKF and UKF by considering quadratic form distance metrics and an exponential distribution threshold variable, similar to Corollary 1. These approaches are named the Convolutional EKF (ConvEKF) and Convolutional UKF (ConvUKF), respectively. Similar to the discussion in Subsection 5.1, we compare our methods with the standard UKF [16], standard EKF [33], Huber UKF [7], and Iterated EKF (IEKF) [5]. The IEKF is a variant of the EKF that enhances linear approximation to nonlinear systems through iterative updates, thereby improving filter performance. The Huber UKF is a robust version of the UKF, which replaces the quadratic loss in the optimization of the update step with Huber loss. Our comparisons consider the following two cases:

•

Case A: Transition Model Mismatch: In the real system, the process noise is a mixture of Gaussian noises, while the measurement noise is Gaussian. Specifically, we have

	$\displaystyle\xi_{t}$	$\displaystyle\sim 0.9\cdot\mathcal{N}(0,Q)+0.1\cdot\mathcal{N}(0,100Q),$
	$\displaystyle\zeta_{t}$	$\displaystyle\sim\mathcal{N}(0,R).$

•

Case B: Measurement Model Mismatch: The process noise is Gaussian, while the measurement noise is a mixture of Gaussian noises. This is represented as

	$\displaystyle\xi_{t}$	$\displaystyle\sim\mathcal{N}(0,Q),$
	$\displaystyle\zeta_{t}$	$\displaystyle\sim 0.9\cdot\mathcal{N}(0,R)+0.1\cdot\mathcal{N}(0,1000R).$

As demonstrated in Fig. 5, ConvEKF outperforms the other methods in both case A and case B over a wide range of parameters. Additionally, ConvUKF also shows improvements over the standard UKF, particularly in situations with measurement outliers. Notably, the Huber UKF fails in scenarios with transition model mismatch, possibly because it is designed to enhance robustness by considering the post-prediction prior in the update step, rather than directly incorporating robustness into the prediction step.

5.3 Isothermal Gas-phase Reactor Model

We perform simulation on a commonly used isothermal gas-phase reactor model for state estimation [32]. This model describes the reversible chemical reaction $2A_{r}\rightleftharpoons B_{r}$ . Initially, the reactor is charged with certain amounts of $A_{r}$ and $B_{r}$ , but the exact composition of the original mixture remains uncertain. The state $\mathbf{x}$ includes the partial pressures, i.e., $\mathbf{x}=\left[\begin{matrix}P_{A}&P_{B}\end{matrix}\right]^{\top}$ . The discrete-time version of the gas-phase reactor model with Euler method is

	$\displaystyle\bar{\mathbf{x}}_{t+1}$	$\displaystyle=\begin{pmatrix}P_{A,t}+(-2k_{1}\cdot P_{A,t}^{2}+2k_{2}\cdot P_{% B,t})\cdot\mathrm{d}t\\ P_{B,t}+(k_{1}\cdot P_{A,t}^{2}-k_{2}\cdot P_{B,t})\cdot\mathrm{d}t\\ \end{pmatrix}+\bar{\xi}_{t},$
	$\displaystyle\bar{\mathbf{y}}_{t}$	$\displaystyle=P_{A,t}+P_{B,t}+\bar{\zeta}_{t},$

where ${\mathbf{x}}_{0}=[0.1,\,4.5]^{\top}$ , $k_{1}=0.16$ , $k_{2}=0.0064$ , and $\mathrm{d}t=0.1$ . The virtual process noise satisfies $\bar{\xi}_{t}\sim\mathrm{Laplace}\left(0,\,Q\right)$ with $Q=10^{-4}\mathbb{I}_{2\times 2}$ , and the virtual measurement noise satisfies $\bar{\zeta}_{t}\sim\mathrm{Laplace}\left(0,\,R\right)$ with $R=\mathbb{I}_{1\times 1}$ . For our subsequent verification, we will also set up two different simulations similar to the Section 5.1:

•

Case A, Transition model mismatch: The measurement noise obeys Laplace distribution while the process noise is a mixture of Laplace noise, i.e.,

	$\displaystyle\xi_{t}$	$\displaystyle\sim 0.9\cdot\mathrm{Laplace}(0,Q)+0.1\cdot\mathrm{Laplace}(0,100% 0Q),$
	$\displaystyle\zeta_{t}$	$\displaystyle\sim\mathrm{Laplace}(0,R).$

•

Case B, Measurement model mismatch: The process noise obeys Laplace distribution while the measurement noise is a mixture of Laplace noise, i.e.,

	$\displaystyle\xi_{t}$	$\displaystyle\sim\mathrm{Laplace}(0,Q),$
	$\displaystyle\zeta_{t}$	$\displaystyle\sim 0.9\cdot\mathrm{Laplace}(0,R)+0.1\cdot\mathrm{Laplace}(0,100% 0R).$

We apply our proposed convolutional Bayesian filtering framework to PF, approximated using exponential density rescaling, and refer to it as ConvPF, as shown in Algorithm 1. Our method is compared with standard PF, auxiliary PF (APF) [29], and student-t PF (STPF) [38]. Note that APF and STPF are two widely used robust algorithms. APF introduces an auxiliary variable to select particles based on both their weights and the likelihood of the current observation prior to the actual resampling step. This method focuses computational resources on more promising particles, enhancing the filter’s performance, particularly in scenarios with tailed observation densities. On the other hand, STPF employs the Student’s t distribution, which has heavier tails, making it more capable of handling extreme values or deviations from normal assumptions.

As depicted in Fig. 6, our approach yields the minimum estimation error in scenarios involving both transition (Case A), and measurement model mismatches (Case B). Although PF exhibits a marginal enhancement in RMSE over the standard PF, the improvement is not significant. STPF shows varied performance; in Case A, the STPF’s median RMSE is marginally better than that of PF, yet its overall variance and average RMSE are notably higher. In Case B, the STPF does offer an improvement compared to the PF. However, our ConvPF method, with tuning parameters $\alpha$ and $\beta$ , consistently outperforms the other methods.

6 Conclusion and Discussion

This paper extends the definition of conditional probability and introduces a convolutional Bayesian filtering framework by transforming transition and output probabilities into convolutional forms, broadening the scope of Bayesian filtering. We demonstrate that convolutional Bayesian filtering possesses analytical forms of convolution operation in systems with Gaussian noises. For non-Gaussian cases, the transition and output probabilities can be effectively approximated by scaling them into fractional powers, when employing the relative entropy as the distance measure. This leads to an enhanced version of the Kalman filter, which achieves robustness through simple modifications to the noise covariance matrix, while still preserving the conjugate nature of Gaussian distributions. The practical efficacy of convolutional Bayesian filtering is demonstrated through its application to various common filtering algorithms, including the Kalman filter, extended Kalman filter, unscented Kalman filter, and particle filter.

In this paper, our primary focus is the generalization of Bayesian filtering theory to a convolutional form. Bayesian filtering undeniably forms the foundation of optimal filtering theory for discrete-time systems, highlighting the significance and applicability of our extension. Nevertheless, it’s also crucial to acknowledge the distinctive aspects of filtering theory for continuous-time systems. In these systems, the conditional density function of states typically derives from numerical solutions of Kusher’s or Duncan-Mortensen-Zakai’s equations [41, 1], rather than Bayes’ law. A notable advancement in this domain is the Yau-Yau method [39, 40], which is rigorously proven to converge to a global solution (a type of convergence otherwise only seen in particle filters in discrete-time systems) and can be pre-computed offline, facilitating real-time applications [10, 27]. While we do not explore how to apply our approach to continuous-time systems in this paper, such an extension is a compelling future research avenue.

References

[1] Optimal control of continuous-time stochastic systems, author=Mortensen, Richard Edgar. University of California, Berkeley, 1966.
[2] Gabriel Agamennoni, Juan I Nieto, and Eduardo M Nebot. Approximate inference in state-space models with heavy-tailed noise. IEEE Transactions on Signal Processing, 60(10):5024–5037, 2012.
[3] Alexander A Alemi. Variational predictive information bottleneck. In Symposium on Advances in Approximate Bayesian Inference, pages 1–6. PMLR, 2020.
[4] Brian DO Anderson and John B Moore. Optimal filtering. Courier Corporation, 2012.
[5] Bradley M Bell and Frederick W Cathey. The iterated Kalman filter update as a Gauss-Newton method. IEEE Transactions on Automatic Control, 38(2):294–297, 1993.
[6] William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, complexity, and learning. Neural computation, 13(11):2409–2463, 2001.
[7] Zhu Bing, Chang Lubin, Jiangning Xu, et al. Huber-Based Adaptive Unscented Kalman Filter with Non-Gaussian Measurement Noise [J]. Circuits Systems and Signal Processing, 37(12):1–21, 2018.
[8] Wenhan Cao, Chang Liu, Zhiqian Lan, Yingxi Piao, and Shengbo Eben Li. Generalized moving horizon estimation for nonlinear systems with robustness to measurement outliers. In 2023 American Control Conference (ACC), pages 1614–1621. IEEE, 2023.
[9] Badong Chen, Xi Liu, Haiquan Zhao, and Jose C Principe. Maximum correntropy Kalman filter. Automatica, 76:70–77, 2017.
[10] Xiuqiong Chen, Ji Shi, and Stephen S-T Yau. Real-time solution of time-varying yau filtering problems via direct method and gaussian approximation. IEEE Transactions on Automatic Control, 64(4):1648–1654, 2018.
[11] Zhe Chen et al. Bayesian filtering: From Kalman filters to particle filters, and beyond. Statistics, 182(1):1–69, 2003.
[12] Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
[13] Karl J Friston, William Penny, Christophe Phillips, S Kiebel, G Hinton, and John Ashburner. Classical and Bayesian inference in neuroimaging: theory. NeuroImage, 16(2):465–483, 2002.
[14] Peter J Huber. Robust statistics, volume 523. John Wiley & Sons, 2004.
[15] Andrew H Jazwinski. Stochastic processes and filtering theory. Courier Corporation, 2007.
[16] Simon J Julier and Jeffrey K Uhlmann. New extension of the Kalman filter to nonlinear systems. In Signal processing, sensor fusion, and target recognition VI, volume 3068, pages 182–193. Spie, 1997.
[17] Simon J Julier and Jeffrey K Uhlmann. Unscented filtering and nonlinear estimation. Proceedings of the IEEE, 92(3):401–422, 2004.
[18] R.E. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960.
[19] Rudolph E Kalman and Richard S Bucy. New results in linear filtering and prediction theory. 1961.
[20] Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. An optimization-centric view on Bayes’ rule: Reviewing and generalizing variational inference. Journal of Machine Learning Research, 23(132):1–109, 2022.
[21] A Kolmogorov. Interpolation and extrapolation of stationary random sequences. Izvestiya Rossiiskoi Akademii Nauk. Seriya Matematicheskaya, 5:3, 1941.
[22] Rahul Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear state space models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
[23] Sang Gyu Kwak and Jong Hae Kim. Central limit theorem: the cornerstone of modern statistics. Korean journal of anesthesiology, 70(2):144–156, 2017.
[24] Shengbo Eben Li. Reinforcement learning for sequential decision and optimal control. Springer, 2023.
[25] Chang Liu, Shengbo Eben Li, Diange Yang, and J. Karl Hedrick. Distributed Bayesian Filter Using Measurement Dissemination for Multiple Unmanned Ground Vehicles With Dynamically Changing Interaction Topologies. Journal of Dynamic Systems, Measurement, and Control, 140(3):030903, 11 2017.
[26] Jun S Liu and Rong Chen. Sequential Monte Carlo methods for dynamic systems. Journal of the American statistical association, pages 1032–1044, 1998.
[27] Xue Luo and Stephen S-T Yau. Complete real time solution of the general nonlinear filtering problem without memory. IEEE Transactions on Automatic Control, 58(10):2563–2578, 2013.
[28] Jeffrey W Miller and David B Dunson. Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 2018.
[29] Michael K Pitt and Neil Shephard. Filtering via simulation: Auxiliary particle filters. Journal of the American statistical association, 94(446):590–599, 1999.
[30] Michael Roth, Emre Özkan, and Fredrik Gustafsson. A Student’s t filter for heavy tailed process and measurement noise. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5770–5774. IEEE, 2013.
[31] Simo Sarkka and Aapo Nummenmaa. Recursive noise adaptive Kalman filtering by variational Bayesian approximations. IEEE Transactions on Automatic control, 54(3):596–600, 2009.
[32] Julian D Schiller, Simon Muntwiler, Johannes Köhler, Melanie N Zeilinger, and Matthias A Müller. A Lyapunov function for robust stability of moving horizon estimation. IEEE Transactions on Automatic Control, 2023.
[33] Gerald L Smith, Stanley F Schmidt, and Leonard A McGee. Application of statistical filter theory to the optimal estimation of position and velocity on board a circumlunar vehicle, volume 135. National Aeronautics and Space Administration, 1962.
[34] Yangtianze Tao, Jiayi Kang, and Stephen Shing-Toung Yau. Maximum Correntropy Ensemble Kalman Filter. In 2023 62nd IEEE Conference on Decision and Control (CDC), pages 8659–8664. IEEE, 2023.
[35] Yangtianze Tao and Stephen Shing-Toung Yau. Outlier-Robust Iterative Extended Kalman Filtering. IEEE Signal Processing Letters, 2023.
[36] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
[37] Norbert Wiener. Extrapolation, interpolation, and smoothing of stationary time series: with engineering applications. The MIT press, 1949.
[38] Dingjie Xu, Chen Shen, and Feng Shen. A robust particle filtering algorithm with non-Gaussian measurement noise using student-t distribution. IEEE Signal Processing Letters, 21(1):30–34, 2013.
[39] Shing-Tung Yau and Stephen S-T Yau. Real time solution of nonlinear filtering problem without memory I. Mathematical Research Letters, 7(6):671–693, 2000.
[40] Shing-Tung Yau and Stephen S-T Yau. Real time solution of the nonlinear filtering problem without memory II. SIAM Journal on Control and Optimization, 47(1):163–195, 2008.
[41] Moshe Zakai. On the optimal filtering of diffusion processes. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 11(3):230–243, 1969.
[42] Zhengxin Zhang, Xiaosheng Si, Changhua Hu, and Yaguo Lei. Degradation data analysis and remaining useful life estimation: A review on Wiener-process-based methods. European Journal of Operational Research, 271(3):775–796, 2018.

	$\displaystyle p_{c}(y\|x)$	$\displaystyle=\frac{\int_{z}\left(1-F(d(y,z))\right)p(z\|x)\,\mathrm{d}z}{\int_% {y}\int_{z}\left(1-F(d(y,z))\right)p(z\|x)\,\mathrm{d}z\,\mathrm{d}y}$		(2)
		$\displaystyle\propto\int_{z}\left(1-F(d(y,z))\right)p(z\|x)\,\mathrm{d}z.$		(2)


$\displaystyle p(x_{t}\|y_{1:t-1})$	$\displaystyle=\int p(x_{t}\|x_{t-1})p(x_{t-1}\|y_{1:t-1})\,\mathrm{d}x_{t-1},$	(11a)
$\displaystyle p(x_{t}\|y_{1:t})$	$\displaystyle=\frac{p(x_{t}\|y_{1:t-1})p(y_{t}\|x_{t})}{\int p(x_{t}\|y_{1:t-1})p% (y_{t}\|x_{t})\,\mathrm{d}x_{t}}.$	(11b)

$\displaystyle p_{c}(x_{t}\|y_{1:t}):$	$\displaystyle=p(x_{t}\|y_{1:t},d_{x}(\mathbf{x}_{i},\bar{\mathbf{x}}_{i})\leq% \mathbf{r}_{x},d_{y}(\mathbf{y}_{i},\bar{\mathbf{y}}_{i})\leq\mathbf{r}_{y}),$	(12)
$\displaystyle p_{c}(x_{t}\|y_{1:t-1}):$	$\displaystyle=p(x_{t}\|y_{1:t-1},d_{x}(\mathbf{x}_{i},\bar{\mathbf{x}}_{i})\leq% \mathbf{r}_{x},d_{y}(\mathbf{y}_{j},\bar{\mathbf{y}}_{j})\leq\mathbf{r}_{y}),$
$\displaystyle p_{c}(x_{t}\|x_{t-1}):$	$\displaystyle=p(x_{t}\|x_{t-1},d_{x}(\mathbf{x}_{t},\bar{\mathbf{x}}_{t})\leq% \mathbf{r}_{x}),$
$\displaystyle p_{c}(y_{t}\|x_{t}):$	$\displaystyle=p(y_{t}\|x_{t},d_{y}(\mathbf{y}_{t},\bar{\mathbf{y}}_{t})\leq% \mathbf{r}_{y}),$
$\displaystyle i$	$\displaystyle=1,2,...,t,\;j=1,2,...,t-1.$


$\displaystyle p_{c}(x_{t}\|y_{1:t-1})$	$\displaystyle=\int p_{c}(x_{t}\|x_{t-1})p_{c}(x_{t-1}\|y_{1:t-1})\,\mathrm{d}x_{% t-1},$	(13a)
$\displaystyle p_{c}(x_{t}\|y_{1:t})$	$\displaystyle=\frac{p_{c}(x_{t}\|y_{1:t-1})p_{c}(y_{t}\|x_{t})}{\int p_{c}(x_{t}% \|y_{1:t-1})p_{c}(y_{t}\|x_{t})\,\mathrm{d}x_{t}}.$	(13b)


	$\displaystyle p_{c}(x_{t}\|x_{t-1})$	$\displaystyle=\frac{\int_{\bar{x}_{t}}\left(1-F_{x}(d_{x}(x_{t},\bar{x}_{t}))% \right)p(\bar{x}_{t}\|x_{t-1})\,\mathrm{d}\bar{x}_{t}}{\int_{x_{t}}\int_{\bar{x% }_{t}}\left(1-F_{x}(d_{x}(x_{t},\bar{x}_{t}))\right)p(\bar{x}_{t}\|x_{t-1})\,% \mathrm{d}\bar{x}_{t}\,\mathrm{d}x_{t}},$	(14a)
and the convolutional output probability is
	$\displaystyle p_{c}(y_{t}\|x_{t})$	$\displaystyle=\frac{\int_{\bar{y}_{t}}\left(1-F_{y}(d_{y}(y_{t},\bar{y}_{t}))% \right)p(\bar{y}_{t}\|x_{t})\,\mathrm{d}\bar{y}_{t}}{\int_{y_{t}}\int_{\bar{y}_% {t}}\left(1-F_{y}(d_{y}(y_{t},\bar{y}_{t}))\right)p(\bar{y}_{t}\|x_{t})\,% \mathrm{d}\bar{y}_{t}\,\mathrm{d}{y}_{t}}.$	(14b)