Automatic Regularization for Linear MMSE Filters

Daniel Gomes de Pinho Zanco [email protected] Leszek Szczecinski [email protected] Jacob Benesty [email protected] INRS–Institut National de la Recherche Scientific, Montreal, QC, H5A-1K6, Canada.

Abstract

In this work, we consider the problem of regularization in the design of minimum mean square error (MMSE)linear filters. Using the relationship with statistical machine learning methods, using a Bayesian approach, the regularization parameter is found from the observed signals in a simple and automatic manner. The proposed approach is illustrated in system identification and beamforming examples, where the automatic regularization is shown to yield near-optimal results.

keywords:

MMSE filter, regularization, Bayesian approach, system identification, beamforming.

1 Introduction

Minimum mean square error (MMSE) linear filters are ubiquitous in many signal processing applications such as channel equalization [1, Ch. 5.4], system identification [2], antenna beamforming [1, Ch. 6.5], and many others.

The two main classes of MMSE filters are (i) the error minimization, where the linear filter is designed to approximate the desired signal with the smallest average squared error, and (ii) interference suppression, where the objective is to minimize the interference energy while maintaining the energy of the desired signal.

The equations solved to obtain the MMSE filters rely on the implicit or explicit inversion of the covariance matrix of the input signal. To avoid numerical problems and to guarantee the uniqueness of the solution, the equations must be regularized, as is most often done by adding a positive regularization parameter to the diagonal elements of the covariance matrix.

Determining the regularization parameter is frequently regarded as a challenge for practitioners and, depending on the signal-to-noise ratio (SNR) or the type of problem, it is often handcrafted for each specific problem. This attitude changes and, recently, the regularization received in-depth attention in the context of system identification [3].

On the other hand, this issue is rather well known in the contexts of machine learning and regression analysis, where methods such as cross-validation [4, 5] or expectation maximization (EM) [6, 18.1.3] are often used to find parameters which are not of direct interest, but affect the solutions (known as hyperparameters).

However, despite regularization being crucial to finding MMSE filters, the signal processing literature, in general does not use simple and general solutions from the area of machine learning. The main reason, we believe, is that they are not offered in closed form and, in general, may require searching over the entire space of solutions and solving the regularized equations multiple times. We show that, in practice, the solution can be found very efficiently via fixed-point iteration and does not entail any significant complexity increase if we exploit the eigenvalue decomposition of the covariance matrix.

This paper is organized as follows. We start with the general problem formulation in Sec. 2 and, in Sec. 3 we reformulate it using the probabilistic framework, which allows us to apply the maximum likelihood (ML) estimation to the parameters defining the model and obtain the optimal regularization parameter. Section 4 discusses automatic regularization in the interference-suppression problem. In Sec. 5, to illustrate the operation of the proposed method, we apply it to system identification (as an example of error-minimization) and to beamforming (as an example of interference suppression) to show how the automatically regularized MMSE filters compare to other methods proposed in the literature and to an “oracle” solution. The latter relies on ex-ante knowledge of the best regularization parameter, and is obtained by grid search over the space of the latter, by maximizing the performance criterion of interest, which is possible in the simulations where we know all signals involved.

The examples indicate that the regularization parameter, which we find, automatically adjusts to the changes in operational conditions (such as the SNR) and to the problem structure.

Our main conclusion is that, by adopting the machine learning approach, the automatic regularization is so simple that it deserves to be a go-to solution in the signal processing context.

2 MMSE problem formulation

We consider the linear filtering of the input signal $\boldsymbol{x}(t)\in\mathbb{C}^{M}$ using the weights/filter $\boldsymbol{w}$ aiming at the approximation of the desired signal $d(t)\in\mathbb{C}$ . There are two categories of this problem with respect to how the filter $\boldsymbol{w}$ is found, which are described below.

•

The error-minimization problem, where we know the desired signal $d(t)$ , the filtering error is given by

\displaystyle e(t)

\displaystyle=d(t)-\boldsymbol{w}^{\mathrm{H}}\boldsymbol{x}(t),\quad t=0,1,% \ldots,N-1,

(1)

and the MMSE problem consists in solving

	$\displaystyle\hat{\boldsymbol{w}}$	$\displaystyle=\mathop{\mathrm{argmin}}_{\boldsymbol{w}}\left\{\mathds{E}\left[% \|d(t)-\boldsymbol{w}^{\mathrm{H}}\boldsymbol{x}(t)\|^{2}\right]+\alpha\\|% \boldsymbol{w}\\|_{2}^{2}\right\}$		(2)
		$\displaystyle=(\overline{{\textnormal{{R}}}}_{\boldsymbol{x}}+\alpha{% \textnormal{{I}}})^{-1}\overline{\boldsymbol{r}}_{\boldsymbol{x}d},$		(3)

where $\mathds{E}[\cdot]$ denotes mathematical expectation taken with respect to all random variables, $\alpha\geq 0$ is a regularization parameter, $||\cdot||_{2}$ is the Euclidean norm, $\overline{{\textnormal{{R}}}}_{\boldsymbol{x}}=\mathds{E}\left[\boldsymbol{x}(% t)\boldsymbol{x}^{\mathrm{H}}(t)\right]$ , I is the identity matrix, and $\overline{\boldsymbol{r}}_{\boldsymbol{x}d}=\mathds{E}\left[\boldsymbol{x}(t)d% ^{*}(t)\right]$ ; we use $(\cdot)^{\mathrm{H}}$ to denote conjugate-transpose operation, and $(\cdot)^{*}$ denotes complex conjugation.

•

The interference suppression problem, where we assume that the signal $\boldsymbol{x}(t)$ has the form:

\displaystyle\boldsymbol{x}(t)=d(t)\boldsymbol{a}+\boldsymbol{z}(t)\in\mathbb{% C}^{M},

(4)

with $\boldsymbol{z}(t)$ being the interference, and $\boldsymbol{a}$ the response generated by the desired signal $d(t)$ , where $\|\boldsymbol{a}\|^{2}=M$ . The goal is then to minimize the (energy of) interference in the filtered output $\boldsymbol{w}^{\mathrm{H}}\boldsymbol{x}(t)$ , i.e.,

\displaystyle\hat{\boldsymbol{w}}

\displaystyle=\mathop{\mathrm{argmin}}_{\boldsymbol{w}}\left\{\mathds{E}\left[% |\boldsymbol{w}^{\mathrm{H}}\boldsymbol{x}(t)|^{2}\right]+\alpha\|\boldsymbol{% w}\|^{2}\right\}\quad{\textnormal{s. \ t.}}\quad\boldsymbol{w}^{\mathrm{H}}% \boldsymbol{a}=1,

(5)

while maintaining the energy of the desired signal, as enforced by the constraint $\boldsymbol{w}^{\mathrm{H}}\boldsymbol{a}=1$ . The problem (5) is known to be solved by [7, Sec. 2.8]

\displaystyle\hat{\boldsymbol{w}}

\displaystyle=\frac{\big{(}\overline{{\textnormal{{R}}}}_{\boldsymbol{x}}+% \alpha{\textnormal{{I}}}\big{)}^{-1}\boldsymbol{a}}{\boldsymbol{a}^{\mathrm{H}% }\big{(}\overline{{\textnormal{{R}}}}_{\boldsymbol{x}}+\alpha{\textnormal{{I}}% }\big{)}^{-1}\boldsymbol{a}}.

(6)

Numerous applications of these two problems have been presented in the literature. For example, the error minimization problem (3) is found in system identification, equalization [7, Ch. 2], interference cancellation [8, Ch. 8], and many others. The interference suppression problem (5) is popular in beamforming [9] and spectral estimation [10].

Note that (3) is a regularized version of the Wiener equation [7, Ch. 2.4] and (6) is the regularized version of the linearly-constrained minimum variance (LCMV) filter [7, Ch. 2.8]. However, in textbook formulations, the problems (2) or (5) are defined with $\alpha=0$ , i.e., without regularization. The latter is added in (3) and (6) by practitioners [7, Ch. 8.10], [2, Sec. 4], [11, Sec. 2.B] with the aim of improving conditioning of the matrix $\overline{{\textnormal{{R}}}}_{\boldsymbol{x}}+\alpha{\textnormal{{I}}}$ , which must be inverted (at least implicitly¹¹1The explicit inversion of the matrix in (3) may be avoided by solving linear equations $(\overline{{\textnormal{{R}}}}_{\boldsymbol{x}}+\alpha{\textnormal{{I}}})\hat{% \boldsymbol{w}}=\boldsymbol{r}_{\boldsymbol{x}d}$ .).

The main reason why regularization is required comes from the fact that, in practice, we do not have access to $\overline{{\textnormal{{R}}}}_{\boldsymbol{x}}$ or $\overline{\boldsymbol{r}}_{\boldsymbol{x}d}$ . Rather, they are estimated from the data using time-averaging,

	$\displaystyle\overline{{\textnormal{{R}}}}_{\boldsymbol{x}}$	$\displaystyle\approx{\textnormal{{R}}}_{\boldsymbol{x}}=\frac{1}{N}\sum_{t=0}^% {N-1}\boldsymbol{x}(t)\boldsymbol{x}^{\mathrm{H}}(t),$		(7)
	$\displaystyle\overline{\boldsymbol{r}}_{\boldsymbol{x}d}$	$\displaystyle\approx\boldsymbol{r}_{\boldsymbol{x}d}=\frac{1}{N}\sum_{t=0}^{N-% 1}\boldsymbol{x}(t)d^{*}(t).$		(8)

Then, the regularization term, $\alpha{\textnormal{{I}}}$ , is a practical solution to deal with imperfect estimates (7)-(8), and/or with the numerical errors involved in solving (3). The parameter $\alpha$ has to be “appropriately chosen” and will depend on all the elements of the model (1). In particular, since the importance of the estimation errors in (7)-(8) decreases with $N$ , we expect that the value of $\alpha$ also decreases with $N$ .

2.1 Known regularization solutions in signal processing

Recognizing regularization to be an important practical element in the definition of linear filters, this problem was addressed in the literature, particularly in the context of the minimum variance distortionless response (MVDR) formulation; two, the most representative solutions, are shown below.

2.1.1 Ledoit-Wolf matrix shrinkage

The Ledoit-Wolf matrix shrinkage method [12] assumes the following relationship between the true and empirical covariance matrix

\displaystyle\overline{{\textnormal{{R}}}}_{\boldsymbol{x}}

\displaystyle\approx\beta{\textnormal{{R}}}_{\boldsymbol{x}}+\eta{\textnormal{% {I}}},

(9)

and, by minimizing the squared Frobenius norm of the approximation error:

\displaystyle\hat{\eta},\hat{\beta}=\min_{\beta,\eta}\mathds{E}[\|\overline{{% \textnormal{{R}}}}_{\boldsymbol{x}}-\beta{\textnormal{{R}}}_{\boldsymbol{x}}-% \eta{\textnormal{{I}}}\|^{2}_{\mathrm{F}}],

(10)

finds the shrinkage parameters as [11, Eqs. (32)-(33)]

	$\displaystyle\hat{\eta}$	$\displaystyle=\min\left[1,\frac{\hat{\rho}}{\\|{\textnormal{{R}}}_{\boldsymbol{% x}}-\hat{\nu}{\textnormal{{I}}}\\|^{2}_{\mathrm{F}}}\right]\hat{\nu},$		(11)
	$\displaystyle\hat{\beta}$	$\displaystyle=1-\frac{\hat{\eta}}{\hat{\nu}},$		(12)

where

	$\displaystyle\hat{\rho}$	$\displaystyle=\frac{1}{N^{2}}\sum_{t=0}^{N-1}\\|\boldsymbol{x}(t)\\|^{4}-\frac{1% }{N}\\|{\textnormal{{R}}}_{\boldsymbol{x}}\\|^{2}_{\mathrm{F}},$		(13)
	$\displaystyle\hat{\nu}$	$\displaystyle=\frac{1}{M}\textrm{Tr}({\textnormal{{R}}}_{\boldsymbol{x}}),$		(14)

with $\textrm{Tr}(\cdot)$ being the trace of a square matrix.

By factorizing $\beta$ , the shrinkage parameters can then be converted back into a regularization parameter:

\displaystyle\alpha_{\textrm{LW}}=\frac{\eta}{\beta}.

(15)

This method has been used to find regularization in the interference suppression problem (6), e.g., in [11]. On the other hand, we are not aware of its application to the error-minimization problem (3), most likely because the latter depends not only on the noisy covariance matrix ${\textnormal{{R}}}_{\boldsymbol{x}}$ but, also, explicitly requires a noisy cross-correlation vector $\boldsymbol{r}_{\boldsymbol{x}d}$ .

In that regard, the interference suppression problem uses noisy ${\textnormal{{R}}}_{\boldsymbol{x}}$ and error-free $\boldsymbol{a}$ , and, therefore, appears to be affected only by errors in the former. As we will see, such an interpretation is misleading, and the regularization depends also on $\boldsymbol{a}$ .²²2Note that, in some works, e.g., [9], the problem is formulated assuming that $\boldsymbol{a}$ is also corrupted by errors. We do not use such a model, as assuming that $\boldsymbol{a}$ is perfectly known allows us to emphasize the fact that the regularization depends not only on the noise but also on the deterministic elements of the model.

2.1.2 Hoerl, Kennard, and Baldwin regularization

Some regularization strategies are derived by exploiting the fact that the Wiener equations (3) can be obtained from the regularized ordinary least squares (OLS) problem:

\displaystyle\hat{\boldsymbol{w}}(\alpha)=\mathop{\mathrm{argmin}}_{% \boldsymbol{w}}\left[\frac{1}{N}\|\boldsymbol{d}-{\textnormal{{X}}}^{\mathrm{H% }}\boldsymbol{w}\|^{2}+\alpha\|\boldsymbol{w}\|^{2}\right],

(16)

where $\boldsymbol{d}=[d^{*}(0),\ldots,d^{*}(N-1)]^{\mathsf{T}}$ and ${\textnormal{{X}}}=[\boldsymbol{x}(0),\ldots,\boldsymbol{x}(N-1)]^{\mathsf{T}}$ ; $(\cdot)^{\mathsf{T}}$ is a transposition operator.

The method proposed by Hoerl, Kennard, and Baldwin (HKB) in [13, Eq. (2.2)], finds the regularization in two steps. First, (16) is solved for $\alpha=0$ and, next, the regularization parameter is calculated as

\displaystyle\alpha_{\textrm{HKB}}

\displaystyle=\frac{\tilde{\sigma}_{e}^{2}(0)}{N\tilde{\sigma}_{w}^{2}(0)},

(17)

where

	$\displaystyle\tilde{\sigma}_{e}^{2}(\alpha)$	$\displaystyle=\frac{1}{N}\\|\boldsymbol{d}-{\textnormal{{X}}}^{\mathrm{H}}\hat{% \boldsymbol{w}}(\alpha)\\|^{2},$		(18)
	$\displaystyle\tilde{\sigma}_{w}^{2}(\alpha)$	$\displaystyle=\frac{1}{\gamma}\\|\hat{\boldsymbol{w}}(\alpha)\\|^{2},$		(19)

and $\gamma\in(0,M]$ is the number of degrees of freedom of the solution. In the error-minimization problem, we set $\gamma=M$ , while in the interference suppression problem, due to a linear constraint on $\boldsymbol{w}$ , we set $\gamma=M-1$ .

The HKB regularization was studied in the beamforming context [11], but we immediately see that it cannot be applied for $N<M$ . This is because then the rank of X is smaller than $M$ , so there is an infinite number of $\hat{\boldsymbol{w}}(0)$ that solve (16), each of which yields $\tilde{\sigma}_{e}^{2}(0)=0$ . Thus, for $N<M$ , (17) produces $\alpha_{\textrm{HKB}}=0$ .

3 Bayesian formulation and inference of regularization parameter

To obtain the Bayesian formulation of the problem, we rewrite (1) in vector form:

\displaystyle\boldsymbol{d}

\displaystyle={\textnormal{{X}}}^{\mathrm{H}}\boldsymbol{w}+\boldsymbol{e},

(20)

where $\boldsymbol{d}$ , X are already defined in (16), and $\boldsymbol{e}=[e^{*}(0),\ldots,e^{*}(N-1)]^{\mathsf{T}}$ .

Assuming that $e(t)$ are independent, identically distributed (i.i.d.) zero-mean Gaussian variables with variance $v_{e}$ , we have

\displaystyle f(\boldsymbol{d}|{\textnormal{{X}}},\boldsymbol{w})

\displaystyle=\mathcal{N}(\boldsymbol{d};{\textnormal{{X}}}^{\mathrm{H}}% \boldsymbol{w},v_{e}{\textnormal{{I}}}),

(21)

where

\displaystyle\mathcal{N}(\boldsymbol{y};\boldsymbol{m},{\textnormal{{V}}})=% \frac{1}{\textrm{det}(\pi{\textnormal{{V}}})}\exp[-(\boldsymbol{y}-\boldsymbol% {m})^{\mathrm{H}}{\textnormal{{V}}}^{-1}(\boldsymbol{y}-\boldsymbol{m})]

(22)

denotes the probability density function (PDF) of a circular, complex Gaussian with mean $\boldsymbol{m}$ and covariance matrix V.

The Bayesian approach models the parameter $\boldsymbol{w}$ as a random vector with posterior distribution given by

\displaystyle f(\boldsymbol{w}|\boldsymbol{d},{\textnormal{{X}}})

\displaystyle\propto f(\boldsymbol{d}|{\textnormal{{X}}},\boldsymbol{w})f(% \boldsymbol{w}).

(23)

Then, assuming the elements $\boldsymbol{w}$ to be i.i.d. zero-mean, Gaussian random variables with variance $v_{w}$ , i.e.,

\displaystyle f(\boldsymbol{w})

\displaystyle=\mathcal{N}(\boldsymbol{w};\boldsymbol{0},v_{w}{\textnormal{{I}}% }),

(24)

it is simple to see that, using (21) and (24), the posterior distribution (23) is given by

\displaystyle f(\boldsymbol{w}|\boldsymbol{d},{\textnormal{{X}}})

\displaystyle=\mathcal{N}(\boldsymbol{w};\hat{\boldsymbol{w}},{\textnormal{{R}% }}_{\boldsymbol{w}}),

(25)

where

$\displaystyle{\textnormal{{R}}}_{\boldsymbol{w}}$	$\displaystyle=\frac{v_{e}}{N}({\textnormal{{R}}}_{\boldsymbol{x}}+\alpha{% \textnormal{{I}}})^{-1},$	(26)
$\displaystyle\hat{\boldsymbol{w}}$	$\displaystyle=({\textnormal{{R}}}_{\boldsymbol{x}}+\alpha{\textnormal{{I}}})^{% -1}\boldsymbol{r}_{\boldsymbol{x}d},$	(27)
$\displaystyle\alpha$	$\displaystyle=\frac{v_{e}}{Nv_{w}}.$	(28)

Of course, (27) being the mean of the posterior, it is also the maximum a posteriori (MAP) estimate, i.e., $\hat{\boldsymbol{w}}=\mathop{\mathrm{argmax}}_{\boldsymbol{w}}f(\boldsymbol{w}% |\boldsymbol{d},{\textnormal{{X}}})$ and is the same as the solution of the Wiener equation (2) obtained from empirical moments given in (7)-(8).

This modeling approach is well-known in signal processing textbooks. For example, [3, Ch. 4] or [1, Part VII - Summary and Notes] note the equivalence between the MAP estimation of $\boldsymbol{w}$ and the Wiener (least-squares) solution. On the other hand, the signal processing literature does not exploit this model to its full extent and does not find the parameters $\boldsymbol{v}=[v_{w},v_{e}]$ even if it would give us the immediate advantage of defining the regularization parameter $\alpha$ via (28). An additional advantage is that, knowing $\boldsymbol{v}$ , we can find the posterior variance ${\textnormal{{R}}}_{\boldsymbol{w}}$ which allows us to assess the uncertainty of the estimation: remember, the diagonal elements of ${\textnormal{{R}}}_{\boldsymbol{w}}$ are the posterior variances of the estimates $\hat{\boldsymbol{w}}$ .

3.1 Inference

We will infer the parameters $\boldsymbol{v}$ using the ML approach:

	$\displaystyle\alpha_{{\textnormal{ML}}},\hat{v}_{e}$	$\displaystyle=\mathop{\mathrm{argmax}}_{\alpha,v_{e}}L(\alpha,v_{e}),$		(29)
	$\displaystyle L(\alpha,v_{e})$	$\displaystyle=-\log f(\boldsymbol{d}\|{\textnormal{{X}}},\boldsymbol{v}),$		(30)

where, instead of $v_{w}$ and $v_{e}$ , we parameterized the variables using $\alpha=v_{e}/(Nv_{w})$ , which does not affect the optimality of ML solution, and focuses directly on the regularization parameter $\alpha$ we are interested in.³³3Of course, we can obtain the ML estimates $\hat{v}_{e}$ and $\hat{v}_{w}$ , too.

We marginalize over $\boldsymbol{w}$ to obtain

\displaystyle f(\boldsymbol{d}|{\textnormal{{X}}},\boldsymbol{v})

\displaystyle=\int f(\boldsymbol{d}|{\textnormal{{X}}},\boldsymbol{w},% \boldsymbol{v})f(\boldsymbol{w}|\boldsymbol{v})\,\mathrm{d}\boldsymbol{w},

(31)

with the distributions under integration being those shown in (21) and (24); the conditioning on $\boldsymbol{v}$ merely makes explicit their dependence on the parameters $\boldsymbol{v}$ . Since all the variables are Gaussian, it is rather easy to show that

\displaystyle f(\boldsymbol{d}|{\textnormal{{X}}},\boldsymbol{v})

\displaystyle=\frac{\textrm{det}[({\textnormal{{R}}}_{\boldsymbol{x}}+\alpha{% \textnormal{{I}}})^{-1}]{\alpha^{M}}}{\pi^{N}v_{e}^{N}}\exp\left[\frac{N}{v_{e% }}\big{(}\boldsymbol{r}_{\boldsymbol{x}d}^{\mathrm{H}}\hat{\boldsymbol{w}}-% \tilde{\sigma}_{d}^{2}\big{)}\right],

(32)

where $\tilde{\sigma}_{d}^{2}=\|\boldsymbol{d}\|^{2}/N$ is the estimate of the second moment of $d(t)$ , and, from (3), $\boldsymbol{r}_{\boldsymbol{x}d}^{\mathrm{H}}\hat{\boldsymbol{w}}$ is real.

Thus,

	$\displaystyle L(\alpha,v_{e})=$	$\displaystyle-\log\textrm{det}\left[({\textnormal{{R}}}_{\boldsymbol{x}}+% \alpha{\textnormal{{I}}})^{-1}\right]-M\log\alpha+N\log v_{e}$		(33)
		$\displaystyle+\frac{N}{v_{e}}(\tilde{\sigma}_{d}^{2}-\boldsymbol{r}_{% \boldsymbol{x}d}^{\mathrm{H}}\hat{\boldsymbol{w}})+N\log\pi,$		(33)

which, for a given $\alpha$ , is uniquely minimized by $\hat{v}_{e}$ satisfying

	$\displaystyle\frac{\,\mathrm{d}}{\,\mathrm{d}v_{e}}L(\alpha,\hat{v}_{e})$	$\displaystyle=\frac{N}{\hat{v}_{e}}\left[1-\frac{1}{\hat{v}_{e}}(\tilde{\sigma% }_{d}^{2}-\boldsymbol{r}_{\boldsymbol{x}d}^{\mathrm{H}}\hat{\boldsymbol{w}})% \right]=0,$		(34)
	$\displaystyle\hat{v}_{e}$	$\displaystyle=\tilde{\sigma}_{d}^{2}-\boldsymbol{r}_{\boldsymbol{x}d}^{\mathrm% {H}}\hat{\boldsymbol{w}}.$		(35)

Then, (29) is reduced to

$\displaystyle\alpha_{{\textnormal{ML}}}$	$\displaystyle=\mathop{\mathrm{argmin}}_{\alpha}L(\alpha),$	(36)
$\displaystyle L(\alpha)$	$\displaystyle=L(\alpha,\hat{v}_{e})$	(37)
	$\displaystyle=-\log\textrm{det}\left[({\textnormal{{R}}}_{\boldsymbol{x}}+% \alpha{\textnormal{{I}}})^{-1}\right]-M\log\alpha+N\log(\tilde{\sigma}_{d}^{2}% -\boldsymbol{r}_{\boldsymbol{x}d}^{\mathrm{H}}\hat{\boldsymbol{w}})+{% \textnormal{Const.}}$	(38)

Using the eigenvalue decomposition, ${\textnormal{{R}}}_{\boldsymbol{x}}={\textnormal{{Q}}}{\textnormal{diag}}(% \boldsymbol{\lambda}){\textnormal{{Q}}}^{\mathrm{H}}$ , where ${\textnormal{diag}}(\boldsymbol{\lambda})$ is a diagonal matrix with diagonal elements taken from the vector $\boldsymbol{\lambda}=[\lambda_{1},\ldots,\lambda_{L}]^{\mathsf{T}}$ , $\lambda_{l}$ being the eigenvalues of ${\textnormal{{R}}}_{\boldsymbol{x}}$ , and the columns of ${\textnormal{{Q}}}\in\mathbb{R}^{L\times L}$ are the corresponding eigenvectors, we obtain

	$\displaystyle\hat{\boldsymbol{w}}(\alpha)$	$\displaystyle={\textnormal{{Q}}}{\textnormal{diag}}^{-1}(\boldsymbol{\lambda}+% \alpha)\boldsymbol{z}_{\boldsymbol{x}d},$		(39)
	$\displaystyle\boldsymbol{z}_{\boldsymbol{x}d}$	$\displaystyle={\textnormal{{Q}}}^{\mathrm{H}}\boldsymbol{r}_{\boldsymbol{x}d}=% [z_{\boldsymbol{x}d,1},\ldots,z_{\boldsymbol{x}d,M}]^{\mathrm{H}},$		(40)

so (38) may be written as

\displaystyle L(\alpha)

\displaystyle=\sum_{m=1}^{M}\log\frac{\alpha+\lambda_{m}}{\alpha}+N\log\left(% \tilde{\sigma}_{d}^{2}-\sum_{m=1}^{M}\frac{z_{\boldsymbol{x}d,m}^{2}}{\alpha+% \lambda_{m}}\right)+{\textnormal{Const}},

(41)

and, now, we easily find its derivative:

\displaystyle L^{\prime}(\alpha)

\displaystyle=N\frac{f(\alpha)}{g(\alpha)}-\frac{\gamma(\alpha)}{\alpha},

(42)

where

	$\displaystyle f(\alpha)$	$\displaystyle=\sum_{m=1}^{M}\frac{z_{\boldsymbol{x}d,m}^{2}}{(\lambda_{m}+% \alpha)^{2}}=\gamma(\alpha)\tilde{\sigma}_{w}^{2}(\alpha),$		(43)
	$\displaystyle g(\alpha)$	$\displaystyle=\tilde{\sigma}_{d}^{2}-\boldsymbol{r}_{\boldsymbol{x}d}^{\mathrm% {H}}\hat{\boldsymbol{w}}(\alpha)=\tilde{\sigma}_{e}^{2}(\alpha)+\alpha\gamma(% \alpha)\tilde{\sigma}_{w}^{2}(\alpha),$		(44)

in which $\tilde{\sigma}_{e}^{2}(\alpha)$ and $\tilde{\sigma}_{w}^{2}(\alpha)$ are given in (18) and (19), respectively, and the latter uses

\displaystyle\gamma\equiv\gamma(\alpha)

\displaystyle=\sum_{m=1}^{M}\frac{\lambda_{m}}{\lambda_{m}+\alpha},

(45)

also known as the effective number of parameters [14, Sec. 7.6]. Note that $\gamma(\alpha)\in[0,M]$ and, for $\alpha=0$ , if no eigenvalues are zero, we can use $\gamma=\gamma(\alpha)=M$ , as we did in (19).

As already noted in [15], solving $L^{\prime}(\alpha)=0$ amounts to finding the real roots of the polynomial of degree not larger than $2M-1$ , whose properties are described in the following:

Proposition 1 (Roots of $L^{\prime}(\alpha)$ )

1.

$\lim_{\alpha\rightarrow\infty}L(\alpha)=0$ , i.e., $\alpha=\infty$ is a root of $L^{\prime}(\alpha)$ .
2.

The odd-numbered roots (the first, the third, etc.) of $L^{\prime}(\alpha)$ are minima of $L(\alpha)$ .

$L^{\prime}(\alpha)$ has an even number of roots if and only if

\displaystyle N\|\boldsymbol{r}_{\boldsymbol{x}d}\|^{2}>\tilde{\sigma}_{d}^{2}% {\textnormal{Tr}}({\textnormal{{R}}}_{\boldsymbol{x}}).

(46)

Proof: A

Some comments are in order.

•

We should appreciate the possibility of absence of finite roots of $L^{\prime}(\alpha)$ . Note that, if the only root⁴⁴4Of course, we talk about the real roots which are meaningful solutions. is $\alpha=\infty$ , then it is also the first root, which means that $L(\alpha)$ is minimized for $\alpha=\infty$ , in which case $\hat{\boldsymbol{w}}(\alpha)=\boldsymbol{0}$ . The fact that such a solution may be optimal is not at all obvious when formulating the filtering problem. As we will see empirically, it is indeed the case in some scenarios.
•

Since $\tilde{\sigma}_{d}$ , $\boldsymbol{r}_{\boldsymbol{x}d}$ , and ${\textnormal{{R}}}_{\boldsymbol{x}}$ are empirical means, which, for large $N$ , tend to its corresponding expected values, (46) is likely to be satisfied for sufficiently large $N$ , where the latter dominates the left-hand side (l.h.s.) of (46). In other words, by increasing $N$ , we will have an even number of roots and then $\alpha=\infty$ is a local maximum of $L(\alpha)$ and thus $\alpha_{{\textnormal{ML}}}$ is finite.

Finding the roots may be done exploiting the polynomial structure of $L^{\prime}(\alpha)$ but, in practice, this is feasible only for moderate $M$ , e.g., in MVDR receivers applied in arrays composed of dozens of antennas. For large $M$ , e.g., $M>100$ , typical in system identification and/or equalization, the roots may be found, e.g., via grid search [15]. However, not all of these methods are very practical, which may explain why they did not receive much attention in the literature – in fact, they were not reused as a go-to-solution by the authors of [15], e.g., in [11].

Our goal is thus to propose a simple approach to solve $L^{\prime}(\alpha)=0$ , which, after reorganizing (42), is equivalent to solving

\displaystyle\alpha

\displaystyle=\gamma(\alpha)\frac{g(\alpha)}{Nf(\alpha)},

(47)

which we do via a fixed-point iteration:

$\displaystyle\alpha^{(i+1)}$	$\displaystyle=\gamma\big{(}\alpha^{(i)}\big{)}\frac{g\big{(}\alpha^{(i)}\big{)% }}{Nf\big{(}\alpha^{(i)}\big{)}},$	(48)
	$\displaystyle=\frac{\tilde{\sigma}_{e}^{2}\big{(}\alpha^{(i)}\big{)}}{N\tilde{% \sigma}_{w}^{2}\big{(}\alpha^{(i)}\big{)}}+\frac{\alpha^{(i)}}{N}\gamma\big{(}% \alpha^{(i)}\big{)},\quad i=1,\ldots,I,$	(49)
$\displaystyle\alpha_{{\textnormal{ML}}}$	$\displaystyle=\alpha^{(I)},$	(50)

where $I$ is a predefined number of iterations, and initialization $\alpha^{(0)}>0$ must be defined.

Note that:

•

The convergence of the fixed-point iteration (48) is not proven, but, in numerical examples, it always converged to a minima of $L^{\prime}(\alpha)$ when (46) was satisfied (i.e., when there are finite minima of $L(\alpha)$ ).

•

With the initialization $\alpha^{(0)}=0$ , the first iteration of (49) yields

\displaystyle\alpha^{(1)}

\displaystyle=\frac{\tilde{\sigma}_{e}^{2}(0)}{N\tilde{\sigma}_{w}^{2}(0)},

(51)

which is exactly the HKB method shown in (17). We can thus say that our solution generalizes the HKB method, enhancing it with an iterative refinement, and removing the initialization with a non-regularized solution, i.e., $\alpha^{(0)}=0$ , which may be problematic in general, since it cannot be solved meaningfully for $N<M$ .

•

The fixed-point iteration (49) is not a unique way to solve the problem iteratively. For example, using (44) in (47), and isolating $\alpha$ , we obtain a new fixed-point equation:

\displaystyle\alpha^{(i+1)}

\displaystyle=\frac{\tilde{\sigma}_{e}^{2}\big{(}\alpha^{(i)}\big{)}}{[N-% \gamma\big{(}\alpha^{(i)}\big{)}]\tilde{\sigma}_{w}^{2}\big{(}\alpha^{(i)}\big% {)}},

(52)

which is known as Gull-MacKay iteration [6, Ch. 18.1.4]; see [16, v1 App. A] for an alternative derivation.

Our experience shows that Gull-MacKay converges faster than (48). However, it should be applied with care for $N<M$ , because in this case, we do not have a guarantee that $N-\gamma(\alpha)$ is positive [as seen in (45)].

•

The iterative solutions (49) and (52) use $\tilde{\sigma}_{w}^{2}(\alpha)$ and $\tilde{\sigma}_{e}^{2}(\alpha)$ which may be calculated from the eigenvalue decomposition shown in (41)-(45); this reduces the complexity significantly.

4 Automatic regularization in the interference-suppression problem

Having solved the problem of automatic regularization of the Wiener equations (2) in the error-minimization problem, we turn our attention to the interference-suppression problem (5) and we reformulate it to take advantage of the development we already made in Sec. 3. To this end, we need to remove the constraint in (5), which is done by expressing $\boldsymbol{w}$ as

\displaystyle\boldsymbol{w}=\frac{1}{M}\boldsymbol{a}-{\textnormal{{A}}}% \boldsymbol{u},

(53)

where

\displaystyle={\textnormal{{I}}}-\frac{1}{M}\boldsymbol{a}\boldsymbol{a}^{% \mathrm{H}}={\textnormal{{A}}}^{\mathrm{H}}

(54)

is the projection matrix; indeed, it is easy to see that $\boldsymbol{a}^{\mathrm{H}}{\textnormal{{A}}}=\boldsymbol{0}$ , and thus, for any $\boldsymbol{u}\in\mathbb{C}^{M}$ , $\boldsymbol{a}^{\mathrm{H}}\boldsymbol{w}=1$ .

Note that this approach, with a slightly different definition of (53), was also used in [15, 11].

We may thus reformulate (5) as

$\displaystyle\hat{\boldsymbol{w}}$	$\displaystyle=\frac{1}{M}\boldsymbol{a}-{\textnormal{{A}}}\hat{\boldsymbol{u}},$	(55)
$\displaystyle\hat{\boldsymbol{u}}$	$\displaystyle=\mathop{\mathrm{argmin}}_{\boldsymbol{u}}\left\{\mathds{E}\Big{[% }\|(\boldsymbol{a}/M-{\textnormal{{A}}}\boldsymbol{u})^{\mathrm{H}}\boldsymbol{% x}(t)\|^{2}\Big{]}+\alpha\\|\boldsymbol{a}/M-{\textnormal{{A}}}\boldsymbol{u}\\|^% {2}\right\},$	(56)
	$\displaystyle=\mathop{\mathrm{argmin}}_{\boldsymbol{u}}\left\{\mathds{E}\Big{[% }\|\tilde{d}(t)-\boldsymbol{u}^{\mathrm{H}}\tilde{\boldsymbol{x}}(t)\|^{2}\Big{]% }+\alpha\boldsymbol{u}^{\mathrm{H}}{\textnormal{{A}}}\boldsymbol{u}\right\},$	(57)

where we removed the constant terms from (57), and we used

	$\displaystyle\tilde{\boldsymbol{x}}(t)$	$\displaystyle={\textnormal{{A}}}\boldsymbol{x}(t),$		(58)
	$\displaystyle\tilde{d}(t)$	$\displaystyle=\frac{1}{M}\boldsymbol{a}^{\mathrm{H}}\boldsymbol{x}(t).$		(59)

Proposition 2

The optimization in (57) is equivalent to

\displaystyle\hat{\boldsymbol{u}}

\displaystyle=\mathop{\mathrm{argmin}}_{\boldsymbol{u}}\left\{\mathds{E}\Big{[% }|\tilde{d}(t)-\boldsymbol{u}^{\mathrm{H}}\tilde{\boldsymbol{x}}(t)|^{2}\Big{]% }+\alpha\|\boldsymbol{u}\|^{2}\right\}.

(60)

Proof: We can always write $\boldsymbol{u}=\boldsymbol{u}_{\parallel}+\boldsymbol{u}_{\perp}$ , where $\boldsymbol{u}_{\parallel}=\beta\boldsymbol{a}$ is the term collinear with $\boldsymbol{a}$ and $\boldsymbol{u}_{\perp}$ is the term orthogonal to $\boldsymbol{a}$ , i.e., $\boldsymbol{a}^{\mathrm{H}}\boldsymbol{u}_{\perp}=0$ . Then, from $\boldsymbol{u}_{\parallel}^{\mathrm{H}}{\textnormal{{A}}}=0$ , we see that $\boldsymbol{u}^{\mathrm{H}}\tilde{\boldsymbol{x}}(t)=\boldsymbol{u}_{\perp}^{% \mathrm{H}}\tilde{\boldsymbol{x}}(t)$ and $\boldsymbol{u}^{\mathrm{H}}{\textnormal{{A}}}\boldsymbol{u}=\|\boldsymbol{u}_{% \perp}^{\mathrm{H}}\|^{2}$ , which means that the cost function under minimization in (57) is insensitive to adding a term collinear with $\boldsymbol{a}$ to any $\boldsymbol{u}$ , i.e., $\boldsymbol{u}+\beta\boldsymbol{a}$ . In particular, we may remove the term collinear with $\boldsymbol{a}$ from $\hat{\boldsymbol{u}}$ by adding a penalty term $\|\boldsymbol{u}_{\parallel}\|^{2}$ , i.e.,

\displaystyle\hat{\boldsymbol{u}}

\displaystyle=\mathop{\mathrm{argmin}}_{\boldsymbol{u}}\left\{\mathds{E}\Big{[% }|\tilde{d}(t)-\boldsymbol{u}^{\mathrm{H}}\tilde{\boldsymbol{x}}(t)|^{2}\Big{]% }+\alpha(\boldsymbol{u}^{\mathrm{H}}{\textnormal{{A}}}\boldsymbol{u}+\|% \boldsymbol{u}_{\parallel}\|^{2})\right\},

(61)

and, because $\boldsymbol{u}^{\mathrm{H}}{\textnormal{{A}}}\boldsymbol{u}+\|\boldsymbol{u}_{% \parallel}\|^{2}=\|\boldsymbol{u}\|^{2}$ , (61) is the same as (60).

The goal of Proposition 2 was to obtain (60) which has the same form as error-minimization problem (2). Thus, we can reuse the equations of the latter, i.e.,

$\displaystyle\hat{\boldsymbol{u}}$	$\displaystyle=({\textnormal{{R}}}_{\tilde{\boldsymbol{x}}}+\alpha{\textnormal{% {I}}})^{-1}\boldsymbol{r}_{\tilde{\boldsymbol{x}}\tilde{d}},$	(62)
$\displaystyle{\textnormal{{R}}}_{\tilde{\boldsymbol{x}}}$	$\displaystyle=\frac{1}{N}\sum_{t=0}^{N-1}\tilde{\boldsymbol{x}}(t)\tilde{% \boldsymbol{x}}^{\mathrm{H}}(t)={\textnormal{{A}}}{\textnormal{{R}}}_{% \boldsymbol{x}}{\textnormal{{A}}},$	(63)
$\displaystyle\boldsymbol{r}_{\tilde{\boldsymbol{x}}\tilde{d}}$	$\displaystyle=\frac{1}{N}\sum_{t=0}^{N-1}\tilde{\boldsymbol{x}}(t)\tilde{d}^{*% }(t)=\frac{1}{M}{\textnormal{{A}}}{\textnormal{{R}}}_{\boldsymbol{x}}% \boldsymbol{a},$	(64)

as well as we can apply the iterative solution (49) to find the regularization factor, that is

\displaystyle\alpha^{(i+1)}

\displaystyle=\gamma\big{(}\alpha^{(i)}\big{)}\frac{\frac{1}{N}\|\tilde{% \boldsymbol{d}}-\tilde{\textnormal{{X}}}^{\mathrm{H}}\hat{\boldsymbol{u}}\big{% (}\alpha^{(i)}\big{)}\|^{2}}{N\|\hat{\boldsymbol{u}}\big{(}\alpha^{(i)}\big{)}% \|^{2}}+\frac{\gamma\big{(}\alpha^{(i)}\big{)}}{N}\alpha^{(i)}

(65)

Since we removed the terms collinear with $\boldsymbol{a}$ , we have $\hat{\boldsymbol{u}}^{\mathrm{H}}\boldsymbol{a}=0$ , and

\displaystyle\|\hat{\boldsymbol{u}}^{(i)}\|^{2}=\|\hat{\boldsymbol{w}}^{(i)}\|% ^{2}-\frac{1}{M},

(66)

and, from B, we have

\displaystyle\gamma^{(i)}=M-\alpha^{(i)}{\textnormal{Tr}}[({\textnormal{{R}}}_% {\boldsymbol{x}}+\alpha^{(i)}{\textnormal{{I}}})^{-1}]-\boldsymbol{a}^{\mathrm% {H}}{\textnormal{{R}}}_{\boldsymbol{x}}^{\mathrm{H}}({\textnormal{{R}}}_{% \boldsymbol{x}}+\alpha^{(i)}{\textnormal{{I}}})^{-1}\hat{\boldsymbol{w}}^{(i)},

(67)

which may be integrated in the fixed point iteration.

For example, the Gull-MacKay iteration (52) becomes

	$\displaystyle\hat{\boldsymbol{w}}^{(i)}$	$\displaystyle=\frac{({\textnormal{{R}}}_{\boldsymbol{x}}+\alpha^{(i)}{% \textnormal{{I}}})^{-1}\boldsymbol{a}}{\boldsymbol{a}^{\mathrm{H}}({% \textnormal{{R}}}_{\boldsymbol{x}}+\alpha^{(i)}{\textnormal{{I}}})^{-1}% \boldsymbol{a}},$		(68)
	$\displaystyle\alpha^{(i+1)}$	$\displaystyle=\frac{(\hat{\boldsymbol{w}}^{(i)})^{\mathrm{H}}{\textnormal{{R}}% }_{\boldsymbol{x}}\hat{\boldsymbol{w}}^{(i)}}{\left(\\|\hat{\boldsymbol{w}}^{(i% )}\\|^{2}-\displaystyle\frac{1}{M}\right)\left(\displaystyle\frac{N}{\gamma^{(i% )}}-1\right)}.$		(69)

5 Numerical examples

5.1 Error-minimization problem: system identification

We consider the problem of identification of an acoustic impulse response, where $x(t)$ is an AR(1) process, i.e., $x(t)=ax(t-1)+u(t)$ and $u(t)$ is generated from a zero-mean unit-variance white Gaussian noise; we use $a=0.9$ . The impulse response $\boldsymbol{h}=[h(0),\ldots,h(M-1)]^{\mathsf{T}}$ with length $M=600$ , shown in Fig. 1, is calculated using software [17] for a room of dimensions $(5,4,6)$ m, the source in position $(2,3.5,2)$ m, the receiver in position $(2,1.5,1)$ m, a sampling rate of $8$ kHz, and a reverberation time of $225$ ms. The desired output is obtained as $d(t)=\boldsymbol{h}^{\mathsf{T}}\boldsymbol{x}(t)+e(t)$ , with $e(t)$ being a zero-mean Gaussian noise with variance $v^{*}_{e}$ , and

\displaystyle\boldsymbol{x}(t)=[x(t),x(t-1),\dots,x(t-M+1)]^{\mathsf{T}}.

(70)

We define the SNR as

SNR

\displaystyle=10\log_{10}\left(\frac{\mathds{E}[|\boldsymbol{h}^{\mathsf{T}}% \boldsymbol{x}(t)|^{2}]}{v^{*}_{e}}\right)~{}[{\textnormal{dB}}].

(71)

Although we use real variables, it is easy to see that the formulas to find $\alpha$ , derived in Sec. 3, are the same.

The quality of the estimate $\hat{\boldsymbol{w}}\equiv\hat{\boldsymbol{w}}(\alpha)$ will be assessed through the misalignment (a relative estimation error) of the impulse response:

\displaystyle\mathsf{m}(\alpha)

\displaystyle=20\log_{10}\left(\frac{\|\hat{\boldsymbol{w}}-\boldsymbol{h}\|_{% 2}}{\|\boldsymbol{h}\|_{2}}\right)[{\textnormal{dB}}].

(72)

A simple, worst-case metric, is obtained by setting $\alpha=\infty$ , for which $\hat{\boldsymbol{w}}=\boldsymbol{0}$ , and thus we have $\mathsf{m}(\infty)=0~{}{\textnormal{dB}}$ . The best-case reference is obtained with “oracle”-given regularization parameter and its corresponding misalignment:

	$\displaystyle\hat{\alpha}$	$\displaystyle=\mathop{\mathrm{argmin}}_{\alpha}\mathsf{m}(\alpha),$		(73)
	$\displaystyle\hat{\mathsf{m}}$	$\displaystyle=\mathsf{m}(\hat{\alpha}).$		(74)

Fig. 2 illustrates the convergence of fixed point iterations (49) and (52): it shows the evolution of $\alpha^{(i)}$ with the starting point $\alpha^{(0)}=0.5$ , chosen to be far from the oracle-given $\hat{\alpha}$ . We evaluate various realizations of the data with $N=1000$ and $M=600$ , and note that, beyond $I=5$ , for practical purposes, convergence may be declared for Gull-MacKay, while the fixed-point iteration (49) is slower, requiring approximately twice as many iterations.

All the results we show in the following are thus based on the Gull-MacKay iteration, with $I=5$ and $\alpha^{(0)}=0.5$ . We verified that, in all displayed cases, the condition (46) was not violated.⁵⁵5This is because we decided to use $N>M$ which is a practical approach to the system identification. However, for smaller $N$ , the condition (46) may be violated.

The results, shown in Fig. 3(a)(c), are consistent with intuition: by increasing $N$ and SNR, we decrease the estimation error when the oracle and the fixed-point (Gull-MacKay) iteration regularization is used. In fact, the difference between the regularization parameter $\alpha^{(I)}$ and the oracle-given value $\hat{\alpha}$ is rather small, making the iterative estimation (52) an attractive tool for the choice of $\alpha$ .

Moreover, we observe that (i) the HKB and the Ledoit-Wolf regularization methods may yield worse performance than $\mathsf{m}(\infty)=0$ dB, which is the trivial performance limit. This is well understood for $N<M$ , because then $\alpha_{{\textnormal{HKB}}}=0$ , i.e., the solution is not regularized; see our comments at the end of Sec. 2.1.2. Moreover, for low SNR, the HKB regularization requires a substantial number of samples (approx. $N>1600$ ) to merely attain $\mathsf{m}(\alpha)=0~{}{\textnormal{dB}}$ , (ii) the Ledoit-Wolf regularization does not adapt to the data, e.g., for large SNR it fails to outperform the non-regularized ( $\alpha=0$ ) solution. This is not entirely surprising because the Ledoit-Wolf method does not take into account the cross-correlation $\boldsymbol{r}_{\boldsymbol{x}d}$ .

5.2 Interference suppression problem: beamforming

We consider the antenna-processing scenario, in which the signal $\boldsymbol{x}(t)$ (4) is defined as

\displaystyle\boldsymbol{x}(t)=\sum_{k=1}^{K}d_{k}(t)\boldsymbol{a}(\phi_{k})+% \boldsymbol{e}(t),

(75)

where $\boldsymbol{e}(t)$ is a zero-mean, circular complex Gaussian noise with covariance matrix $\mathds{E}[\boldsymbol{e}(t)\boldsymbol{e}^{\mathrm{H}}(t)]={\textnormal{{I}}}$ , and $d_{k}(t)$ are zero-mean, unit-variance, i.i.d. Gaussian variables modeling signals, each with power $\sigma_{k}^{2}=\mathds{E}[|d_{k}(t)|^{2}]$ , and the steering vector for the angle $\phi$ is defined as

\displaystyle\boldsymbol{a}(\phi)

\displaystyle=[1,\mathrm{e}^{-j\pi\cos(\phi)},\mathrm{e}^{-j2\pi\cos(\phi)},% \dots,\mathrm{e}^{-j(M-1)\pi\cos(\phi)}]^{\mathsf{T}},

(76)

that is, we assume that $\boldsymbol{x}(t)$ is acquired at a linear antenna array with $M$ elements spaced at half-wavelength [1, Ch. 6.5].

The true covariance matrix is thus calculated as

\displaystyle\overline{{\textnormal{{R}}}}_{\boldsymbol{x}}

\displaystyle=\sum_{k=1}^{K}\sigma_{k}^{2}\boldsymbol{a}(\phi_{k})\boldsymbol{% a}(\phi_{k})^{\mathrm{H}}+{\textnormal{{I}}}.

(77)

In the beamforming problem, our goal is to suppress the interference signals $d_{l}(t),l\neq k$ using the filter $\hat{\boldsymbol{w}}_{k}$ found through (6), where we know the steering vector of the signal of interest $\boldsymbol{a}=\boldsymbol{a}(\phi_{k}),k\in\{1,\dots,K\}$ . The quality of interference suppression is measured by the signal-to-interference-plus-noise ratio (SINR) at the output of the filter, calculated as

\displaystyle\mathsf{SINR}_{k}

\displaystyle=\frac{\mathds{E}[|d_{k}(t)|^{2}]}{\mathds{E}[|\hat{\boldsymbol{w% }}_{k}^{\mathrm{H}}\boldsymbol{x}(t)|^{2}]-\mathds{E}[|d_{k}(t)|^{2}]}=\frac{% \sigma^{2}_{k}}{\hat{\boldsymbol{w}}_{k}^{\mathrm{H}}\overline{{\textnormal{{R% }}}}_{\boldsymbol{x}}\hat{\boldsymbol{w}}_{k}-\sigma^{2}_{k}}.

(78)

In this example, we use $K=3$ , $[\sigma^{2}_{1},\sigma^{2}_{2},\sigma^{2}_{3}]=[20,10,5]{\textnormal{dB}}$ , and $[\phi_{1},\phi_{2},\phi_{3}]=[0.2\pi,0.3\pi,0.6\pi]$ .

We show in Fig. 4 the empirical frequency of violating condition (46) obtained from 10000 data realizations. In these cases, $L^{\prime}(\alpha)$ often has no finite roots, i.e., $\alpha_{{\textnormal{ML}}}=\infty$ , $\hat{\boldsymbol{u}}=\boldsymbol{0}$ and $\hat{\boldsymbol{w}}=\boldsymbol{a}/M$ . In other words, there are cases where the optimal solution $\hat{\boldsymbol{w}}$ is a matched filter.

To understand why this may happen, we recall that the matched filter is optimal in the presence of white Gaussian noise. This clarifies why the probability of obtaining such a solution is larger for high-energy target signal (e.g., $k=1$ ): this is when the interference is weak and may, indeed, “appear like” white noise, especially for small $N$ . On the other hand, for weak signals (e.g., $k=3$ ), the interference (e.g., from the signal $k=1$ ) is strong and will emerge from the empirical covariance matrix ${\textnormal{{R}}}_{\boldsymbol{x}}$ , even for relatively small $N$ .

The empirical evaluation of the number of roots of $L^{\prime}(\alpha)$ , is shown in Fig. 5 for large and small number of samples $N$ , leads to the following observations: (i) for large $N$ , the vast majority of cases produced a unique and finite root $\alpha_{{\textnormal{ML}}}$ , which was obtained here through the Gull-MacKay iteration (69) (since there are two roots, the first one is the minimum, see Proposition 1b); (ii) for small $N$ , frequent cases are when $\alpha_{{\textnormal{ML}}}=\infty$ (when there is one root) or when there are multiple finite roots; it occurs relatively frequently, especially for strong target signals $k\in\{1,2\}$ ; (iii) in the presence of multiple minima, the matched filter solution $\alpha=\infty$ can be competitive with $\alpha_{{\textnormal{ML}}}$ , i.e., $L(\alpha_{{\textnormal{ML}}})\approx L(\infty)$ .

To handle the multiple-roots situation, without explicitly identifying them all (which may be numerically tedious), we propose a two-step approach: First, we find the root $\alpha_{{\textnormal{ML}}}$ using the Gull-MacKay iteration (69). Next, we verify if $L(\alpha_{{\textnormal{ML}}})>L(\infty)$ , in which case we make a replacement $\alpha_{{\textnormal{ML}}}\leftarrow\infty$ , otherwise we keep $\alpha_{{\textnormal{ML}}}$ unchanged. In fact, this heuristic is easy to implement because, from (41), we have $L(\infty)=N\log\tilde{\sigma}_{d}^{2}$ .

In Fig. 6, we show $\mathsf{SINR}_{k},k=1,2,3$ as a function of $N$ , for different regularization methods.

Similarly, the values of the regularization parameter are shown in Fig. 7. In this case, the thick line corresponds to the median of the regularization parameter, as it gracefully deals with the cases when $\alpha_{{\textnormal{ML}}}=\infty$ .

We observe that (i) the proposed estimation method is very close to the oracle solutions, and clearly outperforms other methods, especially for $N>M$ and for strong target signal $k=1$ , (ii) In many cases, for relatively small $N$ and high target signal power ( $k=1$ ), the optimal regularization is $\alpha_{{\textnormal{ML}}}=\infty$ , which means that the optimal solution is a matched filter, see (68), (iii) as in Sec. 5.1, the HKB regularization approaches the optimal solution only for sufficiently large $N$ , and (iv) the Ledoit-Wolf regularization parameters is independent of the steering vector $\boldsymbol{a}$ (see Fig. 7) which affects its performance; this illustrates well the idea that, in the MVDR problem, the regularization should take into account the steering vector and not only the covariance matrix ${\textnormal{{R}}}_{\boldsymbol{x}}$ .

6 Conclusions

In this work, we presented a method, adopted from the area of statistical machine learning, to find the regularization parameter in two main classes of linear MMSE filters applied in (i) the error-minimization and (ii) the interference suppression problems. Using a probabilistic formulation, we estimate the parameters of the model from the ML principle, where the regularization parameter is found using a few steps of the fixed-point iteration. We also provide data-dependent conditions for the existence of the finite ML solution and show heuristics which deal well with multiple ML solutions.

Numerical examples indicate that the simple iterative solution we show is remarkably close to the optimal regularization parameter.

We compare the proposed solution with other methods known in the literature. We show that the HKB method [13] may be seen as a simplified version of our approach and that the Ledoit-Wolf shrinkage [12] fails to appropriately choose the regularization, which is due to its explicit independence from the desired signal.

Acknowledgments

This work was supported in part by the Fonds de recherche du Québec (FRQ) - Nature et technologies under the Doctoral research scholaships B2X 2024-2025 program, file number 342496, recipient Daniel Gomes de Pinho Zanco.

Appendix A Proof of Proposition 1

Considering (42), we note that $f(\alpha)$ and $g(\alpha)$ shown in (43) and (44) are bounded and positive, therefore, their ratio is also bounded and positive.

Since $\lim_{\alpha\rightarrow 0}\gamma(\alpha)/\alpha=\infty$ , for a sufficiently small $\alpha$ , we have $L^{\prime}(\alpha)<0$ (i.e., $\exists\alpha^{*},\forall\alpha<\alpha^{*},L^{\prime}(\alpha)<0$ ). We also have that

$\displaystyle\lim_{\alpha\rightarrow\infty}L^{\prime}(\alpha)$	$\displaystyle=\lim_{\alpha\rightarrow\infty}\frac{Nf(\alpha)\alpha-g(\alpha)% \gamma(\alpha)}{\alpha g(\alpha)}$	(79)
	$\displaystyle=\lim_{\alpha\rightarrow\infty}\frac{N\alpha\displaystyle\sum_{m=% 1}^{M}\frac{z_{\boldsymbol{x}d,m}^{2}}{(\lambda_{m}+\alpha)^{2}}-\left[\tilde{% \sigma}_{d}^{2}-\displaystyle\sum_{m=1}^{M}\frac{z_{\boldsymbol{x}d,m}^{2}}{% \lambda_{m}+\alpha}\right]\left[\displaystyle\sum_{m=1}^{M}\frac{\lambda_{m}}{% \lambda_{m}+\alpha}\right]}{\alpha\left[\tilde{\sigma}_{d}^{2}-\displaystyle% \sum_{m=1}^{M}\frac{z_{\boldsymbol{x}d,m}^{2}}{\lambda_{m}+\alpha}\right]},$
	$\displaystyle=\frac{\displaystyle\lim_{\alpha\rightarrow\infty}N\displaystyle% \sum_{m=1}^{M}\frac{z_{\boldsymbol{x}d,m}^{2}}{(\lambda_{m}+\alpha)^{2}}-% \displaystyle\frac{1}{\alpha}\left[\tilde{\sigma}_{d}^{2}-\displaystyle\sum_{m% =1}^{M}\frac{z_{\boldsymbol{x}d,m}^{2}}{\lambda_{m}+\alpha}\right]\left[% \displaystyle\sum_{m=1}^{M}\frac{\lambda_{m}}{\lambda_{m}+\alpha}\right]}{% \tilde{\sigma}_{d}^{2}-\displaystyle\lim_{\alpha\rightarrow\infty}% \displaystyle\sum_{m=1}^{M}\frac{z_{\boldsymbol{x}d,m}^{2}}{\lambda_{m}+\alpha% }},$
	$\displaystyle=0,$

thus $\infty$ is a root of $L^{\prime}(\alpha)$ .

Then, from intermediate value theorem, $L^{\prime}(\alpha)$ has at least one finite root if $L^{\prime}(\alpha)>0$ for a sufficiently large $\alpha$ , i.e., when

\displaystyle N\frac{\alpha f(\alpha)}{\gamma(\alpha)g(\alpha)}>1.

(80)

By taking the limit as $\alpha$ tends to $\infty$ on both sides, we can evaluate if $L^{\prime}(\alpha)$ is decreasing, such that if

$\displaystyle N\lim_{\alpha\rightarrow\infty}\frac{\alpha f(\alpha)}{g(\alpha)% \gamma(\alpha)}$	$\displaystyle>1$	(81)
$\displaystyle N\lim_{\alpha\rightarrow\infty}\frac{\alpha f(\alpha)}{\gamma(% \alpha)}$	$\displaystyle>\lim_{\alpha\rightarrow\infty}g(\alpha)$	(82)
$\displaystyle N\lim_{\alpha\rightarrow\infty}\frac{\alpha\displaystyle\sum_{m=% 1}^{M}\frac{z_{\boldsymbol{x}d,m}^{2}}{(\lambda_{m}+\alpha)^{2}}}{% \displaystyle\sum_{m=1}^{M}\frac{\lambda_{m}}{\lambda_{m}+\alpha}}$	$\displaystyle>\tilde{\sigma}_{d}^{2}$	(83)
$\displaystyle N\lim_{\alpha\rightarrow\infty}\frac{\displaystyle\sum_{m=1}^{M}% \frac{z_{\boldsymbol{x}d,m}^{2}}{(\frac{\lambda_{m}}{\alpha}+1)^{2}}}{% \displaystyle\sum_{m=1}^{M}\frac{\lambda_{m}}{\frac{\lambda_{m}}{\alpha}+1}}$	$\displaystyle>\tilde{\sigma}_{d}^{2}$	(84)
$\displaystyle N\frac{\displaystyle\sum_{m=1}^{M}z_{\boldsymbol{x}d,m}^{2}}{% \displaystyle\sum_{m=1}^{M}\lambda_{m}}$	$\displaystyle>\tilde{\sigma}_{d}^{2}$	(85)
$\displaystyle N\\|\boldsymbol{r}_{\boldsymbol{x}d}\\|^{2}>\tilde{\sigma}_{d}^{2}% \sum_{m=1}^{M}\lambda_{m},$		(86)

where (86) is the same as (46).

When (86) is true, $L^{\prime}(\alpha)$ changes sign at least once, and thus $L^{\prime}(\alpha)$ has at least 2 roots (one at $\infty$ and the other at the sign change). If there are 3 roots, then (86) cannot be true, since $L^{\prime}(0)<0$ and $L^{\prime}(\infty)=0$ , and three roots would require two sign changes. These observations can be extended to an arbitrary number of roots. In fact, (86) can only be true if the number of roots is even, and, since $\infty$ is always a root, the condition also tells us if there is at least one finite root.

This finishes the proof.

Appendix B Derivation of $\gamma$ in MVDR filter

To calculate $\gamma={\textnormal{Tr}}[{\textnormal{{I}}}-\alpha({\textnormal{{A}}}{% \textnormal{{R}}}_{\boldsymbol{x}}{\textnormal{{A}}}+\alpha{\textnormal{{I}}})% ^{-1}]$ , we find

$\displaystyle{\textnormal{{I}}}-\alpha({\textnormal{{A}}}{\textnormal{{R}}}_{% \boldsymbol{x}}{\textnormal{{A}}}+\alpha{\textnormal{{I}}})^{-1}$	$\displaystyle={\textnormal{{A}}}(\alpha{\textnormal{{R}}}_{\boldsymbol{x}}^{-1% }+{\textnormal{{I}}}-\boldsymbol{a}\boldsymbol{a}^{\mathrm{H}}/M)^{-1}{% \textnormal{{A}}},$	(87)
	$\displaystyle={\textnormal{{A}}}[({\textnormal{{R}}}_{\boldsymbol{x}}+\alpha{% \textnormal{{I}}}){\textnormal{{R}}}_{\boldsymbol{x}}^{-1}-\boldsymbol{a}% \boldsymbol{a}^{\mathrm{H}}/M]^{-1}{\textnormal{{A}}},$	(88)
	$\displaystyle={\textnormal{{A}}}\left[{\textnormal{{R}}}_{\boldsymbol{x}}({% \textnormal{{R}}}_{\boldsymbol{x}}+\alpha{\textnormal{{I}}})^{-1}+\frac{{% \textnormal{{R}}}_{\boldsymbol{x}}({\textnormal{{R}}}_{\boldsymbol{x}}+\alpha{% \textnormal{{I}}})^{-1}\boldsymbol{a}\boldsymbol{a}^{\mathrm{H}}{\textnormal{{% R}}}_{\boldsymbol{x}}({\textnormal{{R}}}_{\boldsymbol{x}}+\alpha{\textnormal{{% I}}})^{-1}}{M-\boldsymbol{a}^{\mathrm{H}}{\textnormal{{R}}}_{\boldsymbol{x}}({% \textnormal{{R}}}_{\boldsymbol{x}}+\alpha{\textnormal{{I}}})^{-1}\boldsymbol{a% }}\right]{\textnormal{{A}}},$	(89)
	$\displaystyle={\textnormal{{A}}}\left[{\textnormal{{S}}}+\frac{{\textnormal{{S% }}}\boldsymbol{a}\boldsymbol{a}^{\mathrm{H}}{\textnormal{{S}}}}{M-\boldsymbol{% a}^{\mathrm{H}}{\textnormal{{S}}}\boldsymbol{a}}\right]{\textnormal{{A}}},$	(90)

where ${\textnormal{{S}}}={\textnormal{{R}}}_{\boldsymbol{x}}({\textnormal{{R}}}_{% \boldsymbol{x}}+\alpha{\textnormal{{I}}})^{-1}={\textnormal{{I}}}-\alpha({% \textnormal{{R}}}_{\boldsymbol{x}}+\alpha{\textnormal{{I}}})^{-1}$ .

Next, using the fact that ${\textnormal{{A}}}{\textnormal{{A}}}={\textnormal{{A}}}$ ,

$\displaystyle\gamma$	$\displaystyle={\textnormal{Tr}}({\textnormal{{S}}}{\textnormal{{A}}})+\frac{{% \textnormal{Tr}}({\textnormal{{S}}}\boldsymbol{a}\boldsymbol{a}^{\mathrm{H}}{% \textnormal{{S}}}{\textnormal{{A}}})}{M-\boldsymbol{a}^{\mathrm{H}}{% \textnormal{{S}}}\boldsymbol{a}},$	(91)
	$\displaystyle={\textnormal{Tr}}({\textnormal{{S}}})-\frac{1}{M}\boldsymbol{a}^% {\mathrm{H}}{\textnormal{{S}}}\boldsymbol{a}+\frac{\boldsymbol{a}^{\mathrm{H}}% {\textnormal{{S}}}{\textnormal{{S}}}\boldsymbol{a}-\frac{1}{M}\|\boldsymbol{a}^% {\mathrm{H}}{\textnormal{{S}}}\boldsymbol{a}\|^{2}}{M-\boldsymbol{a}^{\mathrm{H% }}{\textnormal{{S}}}\boldsymbol{a}},$	(92)
	$\displaystyle={\textnormal{Tr}}({\textnormal{{S}}})-\frac{\boldsymbol{a}^{% \mathrm{H}}{\textnormal{{S}}}({\textnormal{{I}}}-{\textnormal{{S}}})% \boldsymbol{a}}{\boldsymbol{a}^{\mathrm{H}}({\textnormal{{I}}}-{\textnormal{{S% }}})\boldsymbol{a}},$	(93)
	$\displaystyle={\textnormal{Tr}}({\textnormal{{S}}})-\boldsymbol{a}^{\mathrm{H}% }{\textnormal{{S}}}\hat{\boldsymbol{w}},$	(94)
	$\displaystyle=M-\alpha{\textnormal{Tr}}[({\textnormal{{R}}}_{\boldsymbol{x}}+% \alpha{\textnormal{{I}}})^{-1}]-\boldsymbol{a}^{\mathrm{H}}{\textnormal{{R}}}_% {\boldsymbol{x}}({\textnormal{{R}}}_{\boldsymbol{x}}+\alpha{\textnormal{{I}}})% ^{-1}\hat{\boldsymbol{w}}.$	(95)

References

[1] A. H. Sayed, Adaptive Filters. Hoboken, New Jersey: John Wiley & Sons, 2008.
[2] L.-M. Dogariu, J. Benesty, C. Paleologu, and S. Ciochină, “An insightful overview of the Wiener filter for system identification,” Applied Sciences, vol. 11, no. 17, 2021. [Online]. Available: https://www.mdpi.com/2076-3417/11/17/7774
[3] G. Pillonetto, T. Chen, A. Chiuso, G. De Nicolao, and L. Ljung, Regularized System Identification. Springer Link, 2022.
[4] D. M. Allen, “Mean square error of prediction as a criterion for selecting variables,” Technometrics, vol. 13, no. 3, pp. 469–475, 1971. [Online]. Available: https://www.tandfonline.com/doi/abs/10.1080/00401706.1971.10488811
[5] G. H. Golub, M. Heath, and G. Wahba, “Generalized cross-validation as a method for choosing a good ridge parameter,” Technometrics, vol. 21, no. 2, pp. 215–223, 1979. [Online]. Available: http://www.jstor.org/stable/1268518
[6] D. Barber, Bayesian reasoning and Machine Learning. New York: Cambridge University Press, 2012.
[7] S. Haykin, Adaptive Filter Theory, 4th ed. Prentice Hall, 2002.
[8] D. Tse and P. Viswanath, Fundamentals of Wireless Communication. Cambridge University Press, May 2005.
[9] J. Li, P. Stoica, and Z. Wang, “On robust Capon beamforming and diagonal loading,” IEEE Transactions on Signal Processing, vol. 51, no. 7, pp. 1702–1715, 2003.
[10] J. Li and P. Stoica, “An adaptive filtering approach to spectral estimation and SAR imaging,” IEEE Transactions on Signal Processing, vol. 44, no. 6, pp. 1469–1484, 1996.
[11] L. Du, J. Li, and P. Stoica, “Fully automatic computation of diagonal loading levels for robust adaptive beamforming,” IEEE Transactions on Aerospace and Electronic Systems, vol. 46, no. 1, pp. 449–458, 2010.
[12] O. Ledoit and M. Wolf, “A well-conditioned estimator for large-dimensional covariance matrices,” Journal of Multivariate Analysis, vol. 88, no. 2, pp. 365–411, 2004. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0047259X03000964
[13] R. W. K. Arthur E. Hoerl and K. F. Baldwin, “Ridge regression:some simulations,” Communications in Statistics, vol. 4, no. 2, pp. 105–123, 1975. [Online]. Available: https://doi.org/10.1080/03610927508827232
[14] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer Series in Statistics, 2009.
[15] Y. Selén, R. Abrahamsson, and P. Stoica, “Automatic robust adaptive beamforming via ridge regression,” Signal Processing, vol. 88, no. 1, pp. 33–49, 2008. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0165168407002460
[16] D. Gomes de Pinho Zanco, L. Szczecinski, and J. Benesty. (2023) Automatic regularization for linear MMSE filters. [Online]. Available: https://arxiv.longhoe.net/pdf/2312.06560
[17] N. Werner, “audiolabs/rir-generator: Version 0.2.0,” 2023.

Automatic Regularization for Linear MMSE Filters

Abstract

keywords:

1 Introduction

2 MMSE problem formulation

2.1 Known regularization solutions in signal processing

2.1.1 Ledoit-Wolf matrix shrinkage

2.1.2 Hoerl, Kennard, and Baldwin regularization

3 Bayesian formulation and inference of regularization parameter

3.1 Inference

Proposition 1 (Roots of L′⁢(α)superscript𝐿′𝛼L^{\prime}(\alpha)italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_α ))

4 Automatic regularization in the interference-suppression problem

Proposition 2

5 Numerical examples

5.1 Error-minimization problem: system identification

5.2 Interference suppression problem: beamforming

6 Conclusions

Acknowledgments

Appendix A Proof of Proposition 1

Appendix B Derivation of γ𝛾\gammaitalic_γ in MVDR filter

References

Proposition 1 (Roots of $L^{\prime}(\alpha)$ )

Appendix B Derivation of $\gamma$ in MVDR filter