Automatic Regularization for Linear MMSE Filters
Abstract
In this work, we consider the problem of regularization in the design of minimum mean square error (MMSE)linear filters. Using the relationship with statistical machine learning methods, using a Bayesian approach, the regularization parameter is found from the observed signals in a simple and automatic manner. The proposed approach is illustrated in system identification and beamforming examples, where the automatic regularization is shown to yield near-optimal results.
keywords:
MMSE filter, regularization, Bayesian approach, system identification, beamforming.1 Introduction
Minimum mean square error (MMSE) linear filters are ubiquitous in many signal processing applications such as channel equalization [1, Ch. 5.4], system identification [2], antenna beamforming [1, Ch. 6.5], and many others.
The two main classes of MMSE filters are (i) the error minimization, where the linear filter is designed to approximate the desired signal with the smallest average squared error, and (ii) interference suppression, where the objective is to minimize the interference energy while maintaining the energy of the desired signal.
The equations solved to obtain the MMSE filters rely on the implicit or explicit inversion of the covariance matrix of the input signal. To avoid numerical problems and to guarantee the uniqueness of the solution, the equations must be regularized, as is most often done by adding a positive regularization parameter to the diagonal elements of the covariance matrix.
Determining the regularization parameter is frequently regarded as a challenge for practitioners and, depending on the signal-to-noise ratio (SNR) or the type of problem, it is often handcrafted for each specific problem. This attitude changes and, recently, the regularization received in-depth attention in the context of system identification [3].
On the other hand, this issue is rather well known in the contexts of machine learning and regression analysis, where methods such as cross-validation [4, 5] or expectation maximization (EM) [6, 18.1.3] are often used to find parameters which are not of direct interest, but affect the solutions (known as hyperparameters).
However, despite regularization being crucial to finding MMSE filters, the signal processing literature, in general does not use simple and general solutions from the area of machine learning. The main reason, we believe, is that they are not offered in closed form and, in general, may require searching over the entire space of solutions and solving the regularized equations multiple times. We show that, in practice, the solution can be found very efficiently via fixed-point iteration and does not entail any significant complexity increase if we exploit the eigenvalue decomposition of the covariance matrix.
This paper is organized as follows. We start with the general problem formulation in Sec. 2 and, in Sec. 3 we reformulate it using the probabilistic framework, which allows us to apply the maximum likelihood (ML) estimation to the parameters defining the model and obtain the optimal regularization parameter. Section 4 discusses automatic regularization in the interference-suppression problem. In Sec. 5, to illustrate the operation of the proposed method, we apply it to system identification (as an example of error-minimization) and to beamforming (as an example of interference suppression) to show how the automatically regularized MMSE filters compare to other methods proposed in the literature and to an “oracle” solution. The latter relies on ex-ante knowledge of the best regularization parameter, and is obtained by grid search over the space of the latter, by maximizing the performance criterion of interest, which is possible in the simulations where we know all signals involved.
The examples indicate that the regularization parameter, which we find, automatically adjusts to the changes in operational conditions (such as the SNR) and to the problem structure.
Our main conclusion is that, by adopting the machine learning approach, the automatic regularization is so simple that it deserves to be a go-to solution in the signal processing context.
2 MMSE problem formulation
We consider the linear filtering of the input signal using the weights/filter aiming at the approximation of the desired signal . There are two categories of this problem with respect to how the filter is found, which are described below.
-
•
The error-minimization problem, where we know the desired signal , the filtering error is given by
(1) and the MMSE problem consists in solving
(2) (3) where denotes mathematical expectation taken with respect to all random variables, is a regularization parameter, is the Euclidean norm, , I is the identity matrix, and ; we use to denote conjugate-transpose operation, and denotes complex conjugation.
-
•
The interference suppression problem, where we assume that the signal has the form:
(4) with being the interference, and the response generated by the desired signal , where . The goal is then to minimize the (energy of) interference in the filtered output , i.e.,
(5) while maintaining the energy of the desired signal, as enforced by the constraint . The problem (5) is known to be solved by [7, Sec. 2.8]
(6)
Numerous applications of these two problems have been presented in the literature. For example, the error minimization problem (3) is found in system identification, equalization [7, Ch. 2], interference cancellation [8, Ch. 8], and many others. The interference suppression problem (5) is popular in beamforming [9] and spectral estimation [10].
Note that (3) is a regularized version of the Wiener equation [7, Ch. 2.4] and (6) is the regularized version of the linearly-constrained minimum variance (LCMV) filter [7, Ch. 2.8]. However, in textbook formulations, the problems (2) or (5) are defined with , i.e., without regularization. The latter is added in (3) and (6) by practitioners [7, Ch. 8.10], [2, Sec. 4], [11, Sec. 2.B] with the aim of improving conditioning of the matrix , which must be inverted (at least implicitly111The explicit inversion of the matrix in (3) may be avoided by solving linear equations .).
The main reason why regularization is required comes from the fact that, in practice, we do not have access to or . Rather, they are estimated from the data using time-averaging,
(7) | ||||
(8) |
Then, the regularization term, , is a practical solution to deal with imperfect estimates (7)-(8), and/or with the numerical errors involved in solving (3). The parameter has to be “appropriately chosen” and will depend on all the elements of the model (1). In particular, since the importance of the estimation errors in (7)-(8) decreases with , we expect that the value of also decreases with .
2.1 Known regularization solutions in signal processing
Recognizing regularization to be an important practical element in the definition of linear filters, this problem was addressed in the literature, particularly in the context of the minimum variance distortionless response (MVDR) formulation; two, the most representative solutions, are shown below.
2.1.1 Ledoit-Wolf matrix shrinkage
The Ledoit-Wolf matrix shrinkage method [12] assumes the following relationship between the true and empirical covariance matrix
(9) |
and, by minimizing the squared Frobenius norm of the approximation error:
(10) |
finds the shrinkage parameters as [11, Eqs. (32)-(33)]
(11) | ||||
(12) |
where
(13) | ||||
(14) |
with being the trace of a square matrix.
By factorizing , the shrinkage parameters can then be converted back into a regularization parameter:
(15) |
This method has been used to find regularization in the interference suppression problem (6), e.g., in [11]. On the other hand, we are not aware of its application to the error-minimization problem (3), most likely because the latter depends not only on the noisy covariance matrix but, also, explicitly requires a noisy cross-correlation vector .
In that regard, the interference suppression problem uses noisy and error-free , and, therefore, appears to be affected only by errors in the former. As we will see, such an interpretation is misleading, and the regularization depends also on .222Note that, in some works, e.g., [9], the problem is formulated assuming that is also corrupted by errors. We do not use such a model, as assuming that is perfectly known allows us to emphasize the fact that the regularization depends not only on the noise but also on the deterministic elements of the model.
2.1.2 Hoerl, Kennard, and Baldwin regularization
Some regularization strategies are derived by exploiting the fact that the Wiener equations (3) can be obtained from the regularized ordinary least squares (OLS) problem:
(16) |
where and ; is a transposition operator.
The method proposed by Hoerl, Kennard, and Baldwin (HKB) in [13, Eq. (2.2)], finds the regularization in two steps. First, (16) is solved for and, next, the regularization parameter is calculated as
(17) |
where
(18) | ||||
(19) |
and is the number of degrees of freedom of the solution. In the error-minimization problem, we set , while in the interference suppression problem, due to a linear constraint on , we set .
3 Bayesian formulation and inference of regularization parameter
To obtain the Bayesian formulation of the problem, we rewrite (1) in vector form:
(20) |
where , X are already defined in (16), and .
Assuming that are independent, identically distributed (i.i.d.) zero-mean Gaussian variables with variance , we have
(21) |
where
(22) |
denotes the probability density function (PDF) of a circular, complex Gaussian with mean and covariance matrix V.
The Bayesian approach models the parameter as a random vector with posterior distribution given by
(23) |
Then, assuming the elements to be i.i.d. zero-mean, Gaussian random variables with variance , i.e.,
(24) |
it is simple to see that, using (21) and (24), the posterior distribution (23) is given by
(25) |
where
(26) | ||||
(27) | ||||
(28) |
Of course, (27) being the mean of the posterior, it is also the maximum a posteriori (MAP) estimate, i.e., and is the same as the solution of the Wiener equation (2) obtained from empirical moments given in (7)-(8).
This modeling approach is well-known in signal processing textbooks. For example, [3, Ch. 4] or [1, Part VII - Summary and Notes] note the equivalence between the MAP estimation of and the Wiener (least-squares) solution. On the other hand, the signal processing literature does not exploit this model to its full extent and does not find the parameters even if it would give us the immediate advantage of defining the regularization parameter via (28). An additional advantage is that, knowing , we can find the posterior variance which allows us to assess the uncertainty of the estimation: remember, the diagonal elements of are the posterior variances of the estimates .
3.1 Inference
We will infer the parameters using the ML approach:
(29) | ||||
(30) |
where, instead of and , we parameterized the variables using , which does not affect the optimality of ML solution, and focuses directly on the regularization parameter we are interested in.333Of course, we can obtain the ML estimates and , too.
We marginalize over to obtain
(31) |
with the distributions under integration being those shown in (21) and (24); the conditioning on merely makes explicit their dependence on the parameters . Since all the variables are Gaussian, it is rather easy to show that
(32) |
where is the estimate of the second moment of , and, from (3), is real.
Thus,
(33) | ||||
which, for a given , is uniquely minimized by satisfying
(34) | ||||
(35) |
Then, (29) is reduced to
(36) | ||||
(37) | ||||
(38) |
Using the eigenvalue decomposition, , where is a diagonal matrix with diagonal elements taken from the vector , being the eigenvalues of , and the columns of are the corresponding eigenvectors, we obtain
(39) | ||||
(40) |
so (38) may be written as
(41) |
and, now, we easily find its derivative:
(42) |
where
(43) | ||||
(44) |
in which and are given in (18) and (19), respectively, and the latter uses
(45) |
also known as the effective number of parameters [14, Sec. 7.6]. Note that and, for , if no eigenvalues are zero, we can use , as we did in (19).
As already noted in [15], solving amounts to finding the real roots of the polynomial of degree not larger than , whose properties are described in the following:
Proposition 1 (Roots of )
-
1.
, i.e., is a root of .
-
2.
The odd-numbered roots (the first, the third, etc.) of are minima of .
-
3.
has an even number of roots if and only if
(46)
Proof: A
Some comments are in order.
-
•
We should appreciate the possibility of absence of finite roots of . Note that, if the only root444Of course, we talk about the real roots which are meaningful solutions. is , then it is also the first root, which means that is minimized for , in which case . The fact that such a solution may be optimal is not at all obvious when formulating the filtering problem. As we will see empirically, it is indeed the case in some scenarios.
-
•
Since , , and are empirical means, which, for large , tend to its corresponding expected values, (46) is likely to be satisfied for sufficiently large , where the latter dominates the left-hand side (l.h.s.) of (46). In other words, by increasing , we will have an even number of roots and then is a local maximum of and thus is finite.
Finding the roots may be done exploiting the polynomial structure of but, in practice, this is feasible only for moderate , e.g., in MVDR receivers applied in arrays composed of dozens of antennas. For large , e.g., , typical in system identification and/or equalization, the roots may be found, e.g., via grid search [15]. However, not all of these methods are very practical, which may explain why they did not receive much attention in the literature – in fact, they were not reused as a go-to-solution by the authors of [15], e.g., in [11].
Our goal is thus to propose a simple approach to solve , which, after reorganizing (42), is equivalent to solving
(47) |
which we do via a fixed-point iteration:
(48) | ||||
(49) | ||||
(50) |
where is a predefined number of iterations, and initialization must be defined.
Note that:
- •
-
•
With the initialization , the first iteration of (49) yields
(51) which is exactly the HKB method shown in (17). We can thus say that our solution generalizes the HKB method, enhancing it with an iterative refinement, and removing the initialization with a non-regularized solution, i.e., , which may be problematic in general, since it cannot be solved meaningfully for .
- •
- •
4 Automatic regularization in the interference-suppression problem
Having solved the problem of automatic regularization of the Wiener equations (2) in the error-minimization problem, we turn our attention to the interference-suppression problem (5) and we reformulate it to take advantage of the development we already made in Sec. 3. To this end, we need to remove the constraint in (5), which is done by expressing as
(53) |
where
A | (54) |
is the projection matrix; indeed, it is easy to see that , and thus, for any , .
We may thus reformulate (5) as
(55) | ||||
(56) | ||||
(57) |
where we removed the constant terms from (57), and we used
(58) | ||||
(59) |
Proposition 2
The optimization in (57) is equivalent to
(60) |
Proof: We can always write , where is the term collinear with and is the term orthogonal to , i.e., . Then, from , we see that and , which means that the cost function under minimization in (57) is insensitive to adding a term collinear with to any , i.e., . In particular, we may remove the term collinear with from by adding a penalty term , i.e.,
(61) |
The goal of Proposition 2 was to obtain (60) which has the same form as error-minimization problem (2). Thus, we can reuse the equations of the latter, i.e.,
(62) | ||||
(63) | ||||
(64) |
as well as we can apply the iterative solution (49) to find the regularization factor, that is
(65) |
Since we removed the terms collinear with , we have , and
(66) |
and, from B, we have
(67) |
which may be integrated in the fixed point iteration.
For example, the Gull-MacKay iteration (52) becomes
(68) | ||||
(69) |
5 Numerical examples
5.1 Error-minimization problem: system identification
We consider the problem of identification of an acoustic impulse response, where is an AR(1) process, i.e., and is generated from a zero-mean unit-variance white Gaussian noise; we use . The impulse response with length , shown in Fig. 1, is calculated using software [17] for a room of dimensions m, the source in position m, the receiver in position m, a sampling rate of kHz, and a reverberation time of ms. The desired output is obtained as , with being a zero-mean Gaussian noise with variance , and
(70) |
We define the SNR as
SNR | (71) |
![Refer to caption](x1.png)
Although we use real variables, it is easy to see that the formulas to find , derived in Sec. 3, are the same.
The quality of the estimate will be assessed through the misalignment (a relative estimation error) of the impulse response:
(72) |
A simple, worst-case metric, is obtained by setting , for which , and thus we have . The best-case reference is obtained with “oracle”-given regularization parameter and its corresponding misalignment:
(73) | ||||
(74) |
![Refer to caption](x2.png)
Fig. 2 illustrates the convergence of fixed point iterations (49) and (52): it shows the evolution of with the starting point , chosen to be far from the oracle-given . We evaluate various realizations of the data with and , and note that, beyond , for practical purposes, convergence may be declared for Gull-MacKay, while the fixed-point iteration (49) is slower, requiring approximately twice as many iterations.
All the results we show in the following are thus based on the Gull-MacKay iteration, with and . We verified that, in all displayed cases, the condition (46) was not violated.555This is because we decided to use which is a practical approach to the system identification. However, for smaller , the condition (46) may be violated.
The results, shown in Fig. 3(a)(c), are consistent with intuition: by increasing and SNR, we decrease the estimation error when the oracle and the fixed-point (Gull-MacKay) iteration regularization is used. In fact, the difference between the regularization parameter and the oracle-given value is rather small, making the iterative estimation (52) an attractive tool for the choice of .
Moreover, we observe that (i) the HKB and the Ledoit-Wolf regularization methods may yield worse performance than dB, which is the trivial performance limit. This is well understood for , because then , i.e., the solution is not regularized; see our comments at the end of Sec. 2.1.2. Moreover, for low SNR, the HKB regularization requires a substantial number of samples (approx. ) to merely attain , (ii) the Ledoit-Wolf regularization does not adapt to the data, e.g., for large SNR it fails to outperform the non-regularized () solution. This is not entirely surprising because the Ledoit-Wolf method does not take into account the cross-correlation .
![Refer to caption](x3.png)
![Refer to caption](x4.png)
![Refer to caption](x5.png)
![Refer to caption](x6.png)
5.2 Interference suppression problem: beamforming
We consider the antenna-processing scenario, in which the signal (4) is defined as
(75) |
where is a zero-mean, circular complex Gaussian noise with covariance matrix , and are zero-mean, unit-variance, i.i.d. Gaussian variables modeling signals, each with power , and the steering vector for the angle is defined as
(76) |
that is, we assume that is acquired at a linear antenna array with elements spaced at half-wavelength [1, Ch. 6.5].
The true covariance matrix is thus calculated as
(77) |
In the beamforming problem, our goal is to suppress the interference signals using the filter found through (6), where we know the steering vector of the signal of interest . The quality of interference suppression is measured by the signal-to-interference-plus-noise ratio (SINR) at the output of the filter, calculated as
(78) |
In this example, we use , , and .
![Refer to caption](x7.png)
We show in Fig. 4 the empirical frequency of violating condition (46) obtained from 10000 data realizations. In these cases, often has no finite roots, i.e., , and . In other words, there are cases where the optimal solution is a matched filter.
To understand why this may happen, we recall that the matched filter is optimal in the presence of white Gaussian noise. This clarifies why the probability of obtaining such a solution is larger for high-energy target signal (e.g., ): this is when the interference is weak and may, indeed, “appear like” white noise, especially for small . On the other hand, for weak signals (e.g., ), the interference (e.g., from the signal ) is strong and will emerge from the empirical covariance matrix , even for relatively small .
![Refer to caption](x8.png)
![Refer to caption](x9.png)
The empirical evaluation of the number of roots of , is shown in Fig. 5 for large and small number of samples , leads to the following observations: (i) for large , the vast majority of cases produced a unique and finite root , which was obtained here through the Gull-MacKay iteration (69) (since there are two roots, the first one is the minimum, see Proposition 1b); (ii) for small , frequent cases are when (when there is one root) or when there are multiple finite roots; it occurs relatively frequently, especially for strong target signals ; (iii) in the presence of multiple minima, the matched filter solution can be competitive with , i.e., .
To handle the multiple-roots situation, without explicitly identifying them all (which may be numerically tedious), we propose a two-step approach: First, we find the root using the Gull-MacKay iteration (69). Next, we verify if , in which case we make a replacement , otherwise we keep unchanged. In fact, this heuristic is easy to implement because, from (41), we have .
In Fig. 6, we show as a function of , for different regularization methods.
![Refer to caption](x10.png)
![Refer to caption](x11.png)
![Refer to caption](x12.png)
Similarly, the values of the regularization parameter are shown in Fig. 7. In this case, the thick line corresponds to the median of the regularization parameter, as it gracefully deals with the cases when .
We observe that (i) the proposed estimation method is very close to the oracle solutions, and clearly outperforms other methods, especially for and for strong target signal , (ii) In many cases, for relatively small and high target signal power (), the optimal regularization is , which means that the optimal solution is a matched filter, see (68), (iii) as in Sec. 5.1, the HKB regularization approaches the optimal solution only for sufficiently large , and (iv) the Ledoit-Wolf regularization parameters is independent of the steering vector (see Fig. 7) which affects its performance; this illustrates well the idea that, in the MVDR problem, the regularization should take into account the steering vector and not only the covariance matrix .
![Refer to caption](x13.png)
![Refer to caption](x14.png)
![Refer to caption](x15.png)
6 Conclusions
In this work, we presented a method, adopted from the area of statistical machine learning, to find the regularization parameter in two main classes of linear MMSE filters applied in (i) the error-minimization and (ii) the interference suppression problems. Using a probabilistic formulation, we estimate the parameters of the model from the ML principle, where the regularization parameter is found using a few steps of the fixed-point iteration. We also provide data-dependent conditions for the existence of the finite ML solution and show heuristics which deal well with multiple ML solutions.
Numerical examples indicate that the simple iterative solution we show is remarkably close to the optimal regularization parameter.
We compare the proposed solution with other methods known in the literature. We show that the HKB method [13] may be seen as a simplified version of our approach and that the Ledoit-Wolf shrinkage [12] fails to appropriately choose the regularization, which is due to its explicit independence from the desired signal.
Acknowledgments
This work was supported in part by the Fonds de recherche du Québec (FRQ) - Nature et technologies under the Doctoral research scholaships B2X 2024-2025 program, file number 342496, recipient Daniel Gomes de Pinho Zanco.
Appendix A Proof of Proposition 1
Considering (42), we note that and shown in (43) and (44) are bounded and positive, therefore, their ratio is also bounded and positive.
Since , for a sufficiently small , we have (i.e., ). We also have that
(79) | ||||
thus is a root of .
Then, from intermediate value theorem, has at least one finite root if for a sufficiently large , i.e., when
(80) |
By taking the limit as tends to on both sides, we can evaluate if is decreasing, such that if
(81) | ||||
(82) | ||||
(83) | ||||
(84) | ||||
(85) | ||||
(86) |
When (86) is true, changes sign at least once, and thus has at least 2 roots (one at and the other at the sign change). If there are 3 roots, then (86) cannot be true, since and , and three roots would require two sign changes. These observations can be extended to an arbitrary number of roots. In fact, (86) can only be true if the number of roots is even, and, since is always a root, the condition also tells us if there is at least one finite root.
This finishes the proof.
Appendix B Derivation of in MVDR filter
To calculate , we find
(87) | ||||
(88) | ||||
(89) | ||||
(90) |
where .
Next, using the fact that ,
(91) | ||||
(92) | ||||
(93) | ||||
(94) | ||||
(95) |
References
- [1] A. H. Sayed, Adaptive Filters. Hoboken, New Jersey: John Wiley & Sons, 2008.
- [2] L.-M. Dogariu, J. Benesty, C. Paleologu, and S. Ciochină, “An insightful overview of the Wiener filter for system identification,” Applied Sciences, vol. 11, no. 17, 2021. [Online]. Available: https://www.mdpi.com/2076-3417/11/17/7774
- [3] G. Pillonetto, T. Chen, A. Chiuso, G. De Nicolao, and L. Ljung, Regularized System Identification. Springer Link, 2022.
- [4] D. M. Allen, “Mean square error of prediction as a criterion for selecting variables,” Technometrics, vol. 13, no. 3, pp. 469–475, 1971. [Online]. Available: https://www.tandfonline.com/doi/abs/10.1080/00401706.1971.10488811
- [5] G. H. Golub, M. Heath, and G. Wahba, “Generalized cross-validation as a method for choosing a good ridge parameter,” Technometrics, vol. 21, no. 2, pp. 215–223, 1979. [Online]. Available: http://www.jstor.org/stable/1268518
- [6] D. Barber, Bayesian reasoning and Machine Learning. New York: Cambridge University Press, 2012.
- [7] S. Haykin, Adaptive Filter Theory, 4th ed. Prentice Hall, 2002.
- [8] D. Tse and P. Viswanath, Fundamentals of Wireless Communication. Cambridge University Press, May 2005.
- [9] J. Li, P. Stoica, and Z. Wang, “On robust Capon beamforming and diagonal loading,” IEEE Transactions on Signal Processing, vol. 51, no. 7, pp. 1702–1715, 2003.
- [10] J. Li and P. Stoica, “An adaptive filtering approach to spectral estimation and SAR imaging,” IEEE Transactions on Signal Processing, vol. 44, no. 6, pp. 1469–1484, 1996.
- [11] L. Du, J. Li, and P. Stoica, “Fully automatic computation of diagonal loading levels for robust adaptive beamforming,” IEEE Transactions on Aerospace and Electronic Systems, vol. 46, no. 1, pp. 449–458, 2010.
- [12] O. Ledoit and M. Wolf, “A well-conditioned estimator for large-dimensional covariance matrices,” Journal of Multivariate Analysis, vol. 88, no. 2, pp. 365–411, 2004. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0047259X03000964
- [13] R. W. K. Arthur E. Hoerl and K. F. Baldwin, “Ridge regression:some simulations,” Communications in Statistics, vol. 4, no. 2, pp. 105–123, 1975. [Online]. Available: https://doi.org/10.1080/03610927508827232
- [14] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer Series in Statistics, 2009.
- [15] Y. Selén, R. Abrahamsson, and P. Stoica, “Automatic robust adaptive beamforming via ridge regression,” Signal Processing, vol. 88, no. 1, pp. 33–49, 2008. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0165168407002460
- [16] D. Gomes de Pinho Zanco, L. Szczecinski, and J. Benesty. (2023) Automatic regularization for linear MMSE filters. [Online]. Available: https://arxiv.longhoe.net/pdf/2312.06560
- [17] N. Werner, “audiolabs/rir-generator: Version 0.2.0,” 2023.