11institutetext: Dipartimento di Fisica Università di Pisa, and INFN Sezione di Pisa, Pisa I-56127,Italy 22institutetext: Università di Trento, Dipartimento di Fisica, I-38123 Povo, Trento, Italy 33institutetext: INFN, Trento Institute for Fundamental Physics and Applications, I-38123 Povo, Trento, Italy 44institutetext: Institute for Gravitational and Subatomic Physics (GRASP),Utrecht University, Princetonplein 1, 3584 CC, Utrecht, The Netherlands 55institutetext: Nikhef, Science Park 105, 1098 XG, Amsterdam, The Netherlands 66institutetext: Royal Holloway University of London, London, United Kingdom

Maximum Entropy Spectral Analysis: an application to gravitational waves data analysis

Alessandro Martini [email protected]    Stefano Schmidt [email protected]    Gregory Ashton [email protected]    Walter Del Pozzo [email protected]
Abstract

The Maximum Entropy Spectral Analysis (MESA) method, developed by Burg, offers a powerful tool for spectral estimation of a time-series. It relies on Jaynes’ maximum entropy principle, allowing the spectrum of a stochastic process to be inferred using the coefficients of an autoregressive process AR(p𝑝pitalic_p) of order p𝑝pitalic_p. A closed-form recursive solution provides estimates for both the autoregressive coefficients and the order p𝑝pitalic_p of the process. We provide a ready-to-use implementation of this algorithm in a Python package called memspectrum, characterized through power spectral density (PSD) analysis on synthetic data with known PSD and comparisons of different criteria for stop** the recursion. Additionally, we compare the performance of our implementation with the ubiquitous Welch algorithm, using synthetic data generated from the GW150914 strain spectrum released by the LIGO-Virgo-Kagra collaboration. Our findings indicate that Burg’s method provides PSD estimates with systematically lower variance and bias. This is particularly manifest in the case of a small (O(5000500050005000)) number of data points, making Burg’s method most suitable to work in this regime. Since this is close to the typical length of analysed gravitational waves data, improving the estimate of the PSD in this regime leads to more reliable posterior profiles for the system under study. We conclude our investigation by utilising MESA, and its particularly easy parametrisation where the only free parameter is the order p of the AR process, to marginalise over the interferometers noise PSD in conjunction with inferring the parameters of GW150914

1 Introduction

The problem of inferring the morphology and the defining parameters of deterministic signals superimposed to stochastic processes is one of the most wide spread and interesting problems in several areas of human activities. Whenever some form of model for the signal we are looking for is available, the problem is typically solved via the Wiener filter, defined as the whitening filter that maximises the signal-to-noise ratio, i.e. the relative power of the (known) signal over the power of the (known) underlying stochastic process. Hence, signal detection and characterisation requires accurate knowledge of i) the shape of the signal we are looking for and ii) the statistical properties of the stochastic process. The construction of signal models is typically driven either by physical or by mathematical arguments hence, although extremely difficult in general, it is doable. On the other hand, stochastic process models can be extremely difficult to construct, both for practical and theoretical reasons. A stochastic process is fully described by the knowledge of the probability distribution governing its realisations – the “paths” of the random variable under scrutiny – over the entire time axis, from t=𝑡t=-\inftyitalic_t = - ∞ to t=𝑡t=\inftyitalic_t = ∞. Clearly this is not possible in practice. Therefore modelling a stochastic process either relies on modelling of the underlying physical processes, thus falling back onto the deterministic case, or on modelling the mathematical and statistical properties of the process, and potentially infering them from the process realisations. The study of the properties of stochastic processes is thus a crucial task in many fields of physics, astronomy, quantitative biology, as well as engineering and finance. Among the classes of stochastic processes, a key role is played by wide-sense stationary processes. These are stochastic processes that display an invariance of their statistical properties, such as their two-point autocovariance function, with respect to translation of the independent variable, usually the time t𝑡titalic_t. If x(t)𝑥𝑡x(t)italic_x ( italic_t ) is a wide-sense stationary process, its statistical properties are completely determined by the knowledge of the (many-points) autocorrelation functions. In practice, one often has easy access to the two-point correlation function

C(τ)=𝐄[xtxt+τ]𝐶𝜏𝐄delimited-[]subscript𝑥𝑡subscript𝑥𝑡𝜏C(\tau)=\mathbf{E}[x_{t}\cdot x_{t+\tau}]italic_C ( italic_τ ) = bold_E [ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUBSCRIPT italic_t + italic_τ end_POSTSUBSCRIPT ] (1)

or, equivalently, to the process power spectral density (PSD) S(f)𝑆𝑓S(f)italic_S ( italic_f ). Thanks to the Wiener-Khinchin theorem, in wide-sense stationary processes, the two are in fact related by a Fourier transform:

S(f)=dτC(τ)ei2πfτ.𝑆𝑓superscriptsubscriptd𝜏𝐶𝜏superscript𝑒𝑖2𝜋𝑓𝜏S(f)=\int_{-\infty}^{\infty}\textrm{d}\tau C(\tau)e^{-i2\pi f\tau}\,.italic_S ( italic_f ) = ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT d italic_τ italic_C ( italic_τ ) italic_e start_POSTSUPERSCRIPT - italic_i 2 italic_π italic_f italic_τ end_POSTSUPERSCRIPT . (2)

In the context of gravitational waves physics, e.g. Finn_1992 , the PSD is introduced as

𝐄[x~(f)x~(f)]=S(f)δ(ff)𝐄delimited-[]~𝑥𝑓~𝑥superscript𝑓𝑆𝑓𝛿𝑓superscript𝑓\mathbf{E}[\tilde{x}(f)\cdot\tilde{x}(f^{\prime})]=S(f)\delta(f-f^{\prime})bold_E [ over~ start_ARG italic_x end_ARG ( italic_f ) ⋅ over~ start_ARG italic_x end_ARG ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] = italic_S ( italic_f ) italic_δ ( italic_f - italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (3)

without highlighting its connection with the time structure of the process itself, thus masking some important properties that will be explored further in what follows. The latter definition in Eq.  (3) gives, however, i) a straightforward interpretation of the PSD: it measures how much signal “power” is located in each frequency; ii) an operative way of estimating it for an unknown process.

An ubiquitous method for such a computation is due to Welch1967 and it is based on Eqs. (2-3). The PSD is obtained by slicing the observed realisation x(t1),,x(tn)𝑥subscript𝑡1𝑥subscript𝑡𝑛x(t_{1}),\ldots,x(t_{n})italic_x ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_x ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) of the process x(t)𝑥𝑡x(t)italic_x ( italic_t ) into many window-corrected batches and averaging the squared moduli of their Fourier transforms. This approach is equivalent Lomb Scargle to taking the Fourier Transform of the windowed sample autocorrelation ρWsubscript𝜌𝑊\rho_{W}italic_ρ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT, written as

ρW={W0ρ0,W±1ρ±1,,W±Mρ±M,0,0,},subscript𝜌𝑊subscript𝑊0subscript𝜌0subscript𝑊plus-or-minus1subscript𝜌plus-or-minus1subscript𝑊plus-or-minus𝑀subscript𝜌plus-or-minus𝑀00\rho_{W}=\left\{W_{0}\rho_{0},W_{\pm 1}\rho_{\pm 1},\dots,W_{\pm M}\rho_{\pm M% },0,0,\dots\right\},italic_ρ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = { italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT ± 1 end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT ± 1 end_POSTSUBSCRIPT , … , italic_W start_POSTSUBSCRIPT ± italic_M end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT ± italic_M end_POSTSUBSCRIPT , 0 , 0 , … } , (4)

where ρ𝜌\rhoitalic_ρ is the empirical autocorrelation and M𝑀Mitalic_M is the maximum time lag at which the autocorrelation is computed. The sequence W𝑊Witalic_W is a window function that can be chosen in several different ways, each choice presenting advantages and disadvantages for the final estimate of the PSD.

The choice of a window function is arbitrary and typically is made by trial and error, until a satisfactory compromise between variance and resolution of the estimate of PSD is reached. A high frequency resolution implies high variance and vice-versa. Besides the window function, Welch’s method requires a number of arbitrary choices to be made, such as the number of time slices and the overlap between consecutive slices. All these knobs must be tuned by hand and their choice can dramatically affect the PSD estimation, hence begging the question of what the “best” PSD estimate is.

Another drawback of this approach is the requirement for the window to be 00 outside the interval in which the autocorrelation is computed. We are arbitrarily assuming ρj=0subscript𝜌𝑗0\rho_{j}=0italic_ρ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 0 for j>M𝑗𝑀j>Mitalic_j > italic_M and modifying the estimate (i.e. the data) if a non-rectangular window is chosen. Making assumptions on unobserved data and modifying the ones we have at our disposal introduces “spurious” information about the process that we, in general, do not really have.

A alternative approach providing a smooth PSD estimation, is to adopt a parametric model for the PSD and to fit its parameters to the data with a Reversible Jump Markov Chain Monte Carlo Cornish_2015 ; Littenberg_2015 . Despite being effective, this method is problem dependent, since it needs to make definite assumptions on the shape of the PSD. Moreover, it can be computationally expensive and it does not come with a handy implementation available to the public. For all the above reasons, we did not consider such methods in our work.

An appealing alternative, based on the Maximum Entropy principle JaynesArticle ; jaynes2003ptl ; Jaynes_MAXENT , has been derived by burg1975maximum . Being rooted on solid theoretical foundations, we will see that Burg’s method, unlike Welch’s, does not require any preprocessing of the data and requires very little tuning of the algorithm parameters, since it provides an iterative closed form expression for the spectrum of a stochastic stationary time series. Furthermore, it embeds the PSD estimation problem into an elegant theoretical framework and makes minimal assumptions on the nature of the data. Lastly and most importantly, it provides a robust link between spectral density estimation and the field of autoregressive processes. This provides a natural and simple machinery to forecast a time series, thus predicting future observations based on previous ones.

In this paper, we discuss the details of the Maximum entropy principle, its application to the problem of PSD estimation with Burg’s algorithm and the link between Burg’s algorithm and autoregressive process. Our goal is to bring (again) to public attention Maximum Entropy Spectral analysis, in the hope that it will be widely employed as a way out of the many undesired aspects of the Welch’s algorithm (or other similar methods). To facilitate this goal, we based this study on memspectrum, a freely available, robust and easy-to-use python implementation of the algorithm described below111 It is available at link: https://pypi.org/project/memspectrum/.. We provide a thorough assessment of the performance of our code and we validate our results performing a number of tests on simulated and real data. We also compare our results with those of spectral analysis carried out with the standard Welch’s method. In order to apply our model on a realistic setting, we analyse some time series of broad interest in the scientific community.

Our paper is organized as follows: we begin by briefly reviewing the theoretical foundations of the maximum entropy principle in Sec. 2. Sec. 3 presents the validation of Burg’s method as well as of our implementation on simulated data. In Sec. 4 we compare the results from memspectrum with the Welch method; Sec. 5 presents a few applications to real time series, including the analysis of GW150914, and, finally, we conclude with a discussion in Sec. 6.

2 Theoretical foundations

The Maximum Entropy principle (MAXENT) is among the most important results in probability theory. It provides a way to uniquely assign probabilities to a phenomenon in a way that best represent our state of knowledge, while being non-committal to unavailable information. Its domain of application turned out to be wider than expected. In fact, thanks to burg1975maximum , this method has also been applied to perform high quality computation of power spectral densities of time series.

After a short introduction to Jaynes’ MAXENT (Sec. 2.1), we will review in detail Burg’s technique of Maximum Entropy Spectral Analysis (MESA) and show that the estimate can always be expressed in an analytical closed form (Sec. 2.2). Next, we will discuss the interesting link between Burg’s method and autoregressive processes (Sec. 2.3) and in Sec. 2.4 we will use such link to forecast a time series.

2.1 Maximum Entropy Principle

Before introducing the MAXENT principle, we will define, via some simple examples, the two core concepts of the problem and the roles they play in deductive inference: the ‘evidence’ and the ‘information’. Let us start with the ‘information’ (or information entropy): it is a measure of the degree of uncertainty on the outcomes of some experiment and specifies the length of the message necessary to provide a full description of the system under study. As an example, no information is brought if we are studying a system whose outcome is certain (the outcome is known with probability p=1𝑝1p=1italic_p = 1), as in this case, a communication is not even needed. Shannon proposed the quantity

I=log21p(x)𝐼subscript21𝑝𝑥I=\log_{2}\frac{1}{p(x)}italic_I = roman_log start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p ( italic_x ) end_ARG (5)

to measure the quantity of information brought by an outcome x𝑥xitalic_x with probability p(x)𝑝𝑥p(x)italic_p ( italic_x ). It is additive quantity as well as a monotonically decreasing function of p[0,1]𝑝01p\in[0,1]italic_p ∈ [ 0 , 1 ]: the more uncertain the outcome, the higher the information it brings.

We can generalize the definition of information in the case where two different outcomes E1,E2subscript𝐸1subscript𝐸2E_{1},E_{2}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, with given probabilities P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and P2subscript𝑃2P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, are possible. To gain some intuition on the problem, we ask ourselves which are the probability assignments that make the outcome more uncertain (i.e. maximize the information). If P1subscript𝑃1P_{1}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and P2subscript𝑃2P_{2}italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are largely different, for instance P1=0.999subscript𝑃10.999P_{1}=0.999italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.999 and P2=0.001subscript𝑃20.001P_{2}=0.001italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.001, we are allowed to believe that event E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT will occur almost certainly, considering E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to be a very implausible outcome. The information content will be very low. On the other hand, most unpredictable situation happens when

P1=P2=12::subscript𝑃1subscript𝑃212absentP_{1}=P_{2}=\frac{1}{2}:italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG :

this describes a situation of ‘maximum ignorance’ and the information content of such system must be high. Any generalization of Eq. (5), must then have its maximum when P1=P2subscript𝑃1subscript𝑃2P_{1}=P_{2}italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. For N𝑁Nitalic_N events, the system with the highest possible information content is when:

P1==PN=1N::subscript𝑃1subscript𝑃𝑁1𝑁absentP_{1}=\ldots=P_{N}=\frac{1}{N}:italic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = … = italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG :

Shannon showed that the only functional form satisfying continuity with respect to its parameters, additivity and that has a maximum for equal probability events is:

H[p1,,pN]=i=1Npilogpi,𝐻subscript𝑝1subscript𝑝𝑁superscriptsubscript𝑖1𝑁subscript𝑝𝑖subscript𝑝𝑖H[p_{1},\dots,p_{N}]=-\sum_{i=1}^{N}p_{i}\log{p_{i}}\,,italic_H [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (6)

which can be interpreted as the ‘expected information’ brought by an experiment with N𝑁Nitalic_N possible outcomes each with its own probability pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In the continuous case:

H[p(x)]=p(x)lnp(x)𝑑x,𝐻delimited-[]𝑝𝑥𝑝𝑥𝑝𝑥differential-d𝑥H[p(x)]=-\int p(x)\ln p(x)dx,italic_H [ italic_p ( italic_x ) ] = - ∫ italic_p ( italic_x ) roman_ln italic_p ( italic_x ) italic_d italic_x , (7)

We call the functional H𝐻Hitalic_H information entropy222In defining the information entropy as in Eq. (7), we are implicitly assuming a uniform measure over the parameter space. In case of a non-uniform measure m(x)𝑚𝑥m(x)italic_m ( italic_x ), the definition generalises to H[p(x)]=p(x)lnp(x)m(x)dx𝐻delimited-[]𝑝𝑥𝑝𝑥𝑝𝑥𝑚𝑥𝑑𝑥H[p(x)]=-\int p(x)\ln\frac{p(x)}{m(x)}dxitalic_H [ italic_p ( italic_x ) ] = - ∫ italic_p ( italic_x ) roman_ln divide start_ARG italic_p ( italic_x ) end_ARG start_ARG italic_m ( italic_x ) end_ARG italic_d italic_x..

We now turn to the core of our problem: how can we assign probabilities to a set of events kee** into account our knowledge of the system and, at the same time, ensure it is non-committal towards unavailable knowledge? The “knowledge” at our disposal about the system under investigation is what we define ‘evidence’ and any probability assignment is given such evidence, in agreement with Cox construction of probability. In the case above, our knowledge on the system is only the total number N𝑁Nitalic_N of different outcomes – this is a minimal requirement. Of course, more complex evidence constraints can be applied.

It is very common that the constraints provided by the evidence are not enough for setting the probabilities for each event: in this case, it is reasonable to assume that the probability assignment should make the experiment as unpredictable as possible333 In Jaynes_MAXENT this statement is made more precise and justified more thoroughly, with arguments based on combinatorial analysis. . In other words, the information entropy content introduced by the probability assignment should be as large as possible, in accordance with the available evidence. MAXENT formalises this reasoning by stating that probabilities should be assigned by maximizing uncertainty (information entropy) using evidence as a constraint. This defines a variational problem, where the information entropy functional H[p1,,pN]𝐻subscript𝑝1subscript𝑝𝑁H\left[p_{1},\dots,p_{N}\right]italic_H [ italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ], defined in Eq.  (6), has to be maximized.

The maximisation of the entropy, supplemented by evidence in the form of constraints to which the sought-for probability distribution must obey, gives rise to several of the most common probability distributions commonly employed in statistics. In the cases of interest, evidence is used to constraint, via Lagrange Multipliers, the momenta of the probability distribuiton we are seeking to evaluate. For instance, whenever the only constraint available is the normalization of the probability distribution (i.e. no evidence is available), the entropy is maximised by the uniform distribution. If we have evidence to constraint the expected value, the information entropy is maximised by the exponential distribution.

Of particular relevance for our purposes is the case in which, in addition to the mean, also the variance is known: MAXENT leads to the Gaussian distribution. This derivation is particularly interesting from the foundational point of view, since it provides a deeper insight into the ubiquitous Gaussian distribution. Indeed, it is not only the limit distribution provided by the central limit theorem for finite variance processes but it is also the distribution that maximizes the entropy for a fixed mean and variance: from the MAXENT principle, it is the correct probability distribution to assign if the mean and covariance are the only quantities that fully define our process. In some sense, we can interpret the central limit theorem as the natural ‘statistical’ evolution toward a configuration that maximizes entropy in repeated experiments.

For this work, we are especially interested in the multi-dimensional case. Suppose we have a vector of measurements (x(t1),,x(tn))=(x1,,xn)𝑥subscript𝑡1𝑥subscript𝑡𝑛subscript𝑥1subscript𝑥𝑛(x(t_{1}),\ldots,x(t_{n}))=(x_{1},\ldots,x_{n})( italic_x ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_x ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) that we conveniently express as a single realization of an unknown stochastic process x(t)𝑥𝑡x(t)italic_x ( italic_t ) and we have information about the expectation value of the process μ(t)𝜇𝑡\mu(t)italic_μ ( italic_t ) and on the matrix of autocovariances CijC(ti,tj)subscript𝐶𝑖𝑗𝐶subscript𝑡𝑖subscript𝑡𝑗C_{ij}\equiv C(t_{i},t_{j})italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≡ italic_C ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), then the MAXENT distribution is the n𝑛nitalic_n-dimensional multivariate Gaussian distribution gregory_2005 :

p𝑝\displaystyle pitalic_p ((x1,,xn)|I)=conditionalsubscript𝑥1subscript𝑥𝑛𝐼absent\displaystyle\left((x_{1},\ldots,x_{n})|I\right)=( ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | italic_I ) =
1(2πdetC)k/2exp(12i,j(xiμi)(xjμj)Cij1).1superscript2𝜋𝐶𝑘212subscript𝑖𝑗subscript𝑥𝑖subscript𝜇𝑖subscript𝑥𝑗subscript𝜇𝑗subscriptsuperscript𝐶1𝑖𝑗\displaystyle\frac{1}{\left(2\pi\det C\right)^{k/2}}\exp\left(-\frac{1}{2}\sum% _{i,j}(x_{i}-\mu_{i})(x_{j}-\mu_{j})C^{-1}_{ij}\right)\,.divide start_ARG 1 end_ARG start_ARG ( 2 italic_π roman_det italic_C ) start_POSTSUPERSCRIPT italic_k / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_C start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) . (8)

For a wide-sense stationary process the mean function is independent of time, hence it can be redefined to be equal to zero without loss of generality, and the auto-covariance function is dependent only on the time lag τtitj𝜏subscript𝑡𝑖subscript𝑡𝑗\tau\equiv t_{i}-t_{j}italic_τ ≡ italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. One can thus choose a sampling rate ΔtΔ𝑡\Delta troman_Δ italic_t so that Cij=C((ij)Δt)subscript𝐶𝑖𝑗𝐶𝑖𝑗Δ𝑡C_{ij}=C((i-j)\Delta t)italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_C ( ( italic_i - italic_j ) roman_Δ italic_t ). The autocovariance matrix thus becomes a Toeplitz matrix444 We remind the reader that a Toeplitz matrix is a matrix in the form: (a0a1a2ana1a0a1an1a2a1a0an2an+1a1a0a1ana2a1a0)matrixsubscript𝑎0subscript𝑎1subscript𝑎2subscript𝑎𝑛subscript𝑎1subscript𝑎0subscript𝑎1subscript𝑎𝑛1subscript𝑎2subscript𝑎1subscript𝑎0subscript𝑎𝑛2subscript𝑎𝑛1subscript𝑎1subscript𝑎0subscript𝑎1subscript𝑎𝑛subscript𝑎2subscript𝑎1subscript𝑎0\begin{pmatrix}a_{0}&a_{1}&a_{2}&\ldots&\ldots&\ldots&a_{n}\\ a_{-1}&a_{0}&a_{1}&\ldots&\ldots&\ldots&a_{n-1}\\ a_{-2}&a_{-1}&a_{0}&\ldots&\ldots&\ldots&a_{n-2}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ a_{-n+1}&\ldots&\ldots&\ldots&a_{-1}&a_{0}&a_{1}\\ a_{-n}&\ldots&\ldots&\ldots&a_{-2}&a_{-1}&a_{0}\end{pmatrix}( start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL … end_CELL start_CELL … end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL … end_CELL start_CELL … end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT - 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL … end_CELL start_CELL … end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_n - 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT - italic_n + 1 end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL … end_CELL start_CELL … end_CELL start_CELL italic_a start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT - italic_n end_POSTSUBSCRIPT end_CELL start_CELL … end_CELL start_CELL … end_CELL start_CELL … end_CELL start_CELL italic_a start_POSTSUBSCRIPT - 2 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) . Toeplitz matrices are asymptotically equivalent to circulant matrices and thus diagonalized by the discrete Fourier transform base Gray . Some simple algebra shows that the time-domain multivariate Gaussian can be transformed into the equivalent frequency domain probability distribution:

p𝑝\displaystyle pitalic_p ((x~1,,x~n/2)|I)=conditionalsubscript~𝑥1subscript~𝑥𝑛2𝐼absent\displaystyle\left((\tilde{x}_{1},\ldots,\tilde{x}_{n/2})|I\right)=( ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_n / 2 end_POSTSUBSCRIPT ) | italic_I ) =
1(2πdetS)n/2exp(12ijx~iSij1x~j),1superscript2𝜋𝑆𝑛212subscript𝑖𝑗subscript~𝑥𝑖subscriptsuperscript𝑆1𝑖𝑗subscript~𝑥𝑗\displaystyle\frac{1}{\left(2\pi\det S\right)^{n/2}}\exp\left(-\frac{1}{2}\sum% _{ij}\tilde{x}_{i}S^{-1}_{ij}\tilde{x}_{j}\right)\,,divide start_ARG 1 end_ARG start_ARG ( 2 italic_π roman_det italic_S ) start_POSTSUPERSCRIPT italic_n / 2 end_POSTSUPERSCRIPT end_ARG roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_S start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (9)

where the matrix Sij=Siδijsubscript𝑆𝑖𝑗subscript𝑆𝑖subscript𝛿𝑖𝑗S_{ij}=S_{i}\delta_{ij}italic_S start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is an n×n𝑛𝑛n\times nitalic_n × italic_n diagonal matrix whose elements are the PSD S(f)𝑆𝑓S(f)italic_S ( italic_f ) calculated at frequency fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Many readers will recognise the familiar form of the Whittle likelihood that stands at the basis of the matched filter method prob_information_theory and of gravitational waves data analysis, (Finn_1992, ; FINDCHIRP, , e.g.). Thanks to MAXENT, the problem of defining the probability distribution describing a wide-sense stationary process is thus entirely reduced to the estimation of the PSD or, equivalently, the autocovariance function.

2.2 Maximum Entropy Spectral Analysis

In principle, if the autocorrelation was known exactly (i.e. at every time τ(,+)𝜏\tau\in(-\infty,+\infty)italic_τ ∈ ( - ∞ , + ∞ )), the computation of the PSD would reduce to a single Fourier transform (i.e. Eq. (2)). However, in any realistic setting, we are dealing with a finite number of samples N𝑁Nitalic_N from the process. In such cases, the single periodogram is not a consistent estimator for the power spectral density, since its variance doesn’t decrease when the sample size increases. Moreover, the error σksubscript𝜎𝑘\sigma_{k}italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in the estimate of the autocorrelation after k𝑘kitalic_k steps increases as σ1/Nksimilar-to𝜎1𝑁𝑘\sigma\sim 1/\sqrt{N-k}italic_σ ∼ 1 / square-root start_ARG italic_N - italic_k end_ARG555 This is easily understood: when computing the autocorrelation at order k𝑘kitalic_k, only Nk𝑁𝑘N-kitalic_N - italic_k examples of the product xtxt+ksubscript𝑥𝑡subscript𝑥𝑡𝑘x_{t}x_{t+k}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT are available and the variance of the average value goes as the inverse of the square root of the points considered. , so that only few values for the autocorrelation function can actually be computed reliably. This bring us to the core of the problem: how to give an estimate from partial (and noisy) knowledge of the autocorrelation function? MAXENT can guide us in this task without any a priori assumptions on the unavailable data666Indeed this is the largest difference with the most common Welch method. The latter assumes that the unknown values of the autocorrelation are 00. Clearly, this assumption is unjustified and MAXENT is a good way to relax this assumption..

As in the previous examples, one needs to set up a variational problem where the entropy, Eq. (7), is maximized subject to some problem-specific constraints. In our case, they are i) the PSD estimate has to be non-negative; ii) its Fourier transform has to match the sample autocorrelation (wherever an estimate of this is available).

Before doing so, there is a technicality to solve: the definition of entropy depends on a probability distribution, not on the PSD. It can be shown  (AblesMESA, ; Bartlett, , e.g.) that the variational problem can be formulated in terms of the power spectral density S(f)𝑆𝑓S(f)italic_S ( italic_f ) alone by considering our signal as the result of the filtering a white noise process using a filter with transfer function T(f)𝑇𝑓T(f)italic_T ( italic_f ) equal to S(f)𝑆𝑓S(f)italic_S ( italic_f )777 A filter with transfer function T(f)𝑇𝑓T(f)italic_T ( italic_f ) takes in input a time series xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and outputs a times series ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that: T(f)=y~(f)x~(f)𝑇𝑓~𝑦𝑓~𝑥𝑓T(f)=\frac{\tilde{y}(f)}{\tilde{x}(f)}italic_T ( italic_f ) = divide start_ARG over~ start_ARG italic_y end_ARG ( italic_f ) end_ARG start_ARG over~ start_ARG italic_x end_ARG ( italic_f ) end_ARG where x~(f)~𝑥𝑓\tilde{x}(f)over~ start_ARG italic_x end_ARG ( italic_f ) denotes the Fourier transform of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (and similarly for ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) . The difference in entropy between the input and the output time series (i.e. the entropy gain) obtained by such filter applied on white noise is:

ΔH=NyNylogS(f)𝑑f.Δ𝐻superscriptsubscript𝑁𝑦𝑁𝑦𝑆𝑓differential-d𝑓\Delta H=\int_{-Ny}^{Ny}\log S(f)df\,.roman_Δ italic_H = ∫ start_POSTSUBSCRIPT - italic_N italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_y end_POSTSUPERSCRIPT roman_log italic_S ( italic_f ) italic_d italic_f . (10)

where ΔtΔ𝑡\Delta troman_Δ italic_t is sampling rate and Ny12Δt𝑁𝑦12Δ𝑡Ny\equiv\frac{1}{2\Delta t}italic_N italic_y ≡ divide start_ARG 1 end_ARG start_ARG 2 roman_Δ italic_t end_ARG is the Nyquist frequency. Thus maximising Eq. (10) is equivalent to maximizing Eq. (7).

Before maximizing the entropy gain, we need to include the evidence available as a form of mathematical constraints for the assignment of S(f)𝑆𝑓S(f)italic_S ( italic_f ). This is equivalent in imposing that the variational solution S(f)𝑆𝑓S(f)italic_S ( italic_f ) for the PSD matches the empirical autocorrelation. Let us define a realization of a stochastic process (x1,,xN)subscript𝑥1subscript𝑥𝑁(x_{1},\ldots,x_{N})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) with sample autocorrelations r¯k,k=0,,N/2formulae-sequencesubscript¯𝑟𝑘𝑘0𝑁2\bar{r}_{k},\,k=0,\ldots,N/2over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k = 0 , … , italic_N / 2, then the PSD must satisfy the following equation:

NyNyS(f)eı2πfkΔt𝑑f=r¯k.superscriptsubscript𝑁𝑦𝑁𝑦𝑆𝑓superscript𝑒italic-ı2𝜋𝑓𝑘Δ𝑡differential-d𝑓subscript¯𝑟𝑘\int_{-Ny}^{Ny}S(f)e^{\imath 2\pi fk\Delta t}df=\bar{r}_{k}\,.∫ start_POSTSUBSCRIPT - italic_N italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_y end_POSTSUPERSCRIPT italic_S ( italic_f ) italic_e start_POSTSUPERSCRIPT italic_ı 2 italic_π italic_f italic_k roman_Δ italic_t end_POSTSUPERSCRIPT italic_d italic_f = over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (11)

Thus, by maximizing Eq. (10) with constraints in Eq. (11), we can give an estimate of the spectrum given a time series sample. This approach on PSD computation provides a result consistent with the empirical autocorrelation function whenever this is available and, at the same time, it does not make any assumption for the unavailable estimates for the autocorrelation at large time lags.

Remarkably, the variational problem admits a closed-form analytical expression for S(f)𝑆𝑓S(f)italic_S ( italic_f ). The expression was first found by  burg1975maximum :

S(f)=PNΔt(s=0Naszs)(s=0Naszs),𝑆𝑓subscript𝑃𝑁Δ𝑡superscriptsubscript𝑠0𝑁subscript𝑎𝑠superscript𝑧𝑠superscriptsubscript𝑠0𝑁subscriptsuperscript𝑎𝑠superscript𝑧𝑠S(f)=\frac{P_{N}\Delta t}{\left(\sum_{s=0}^{N}a_{s}z^{s}\right)\left(\sum_{s=0% }^{N}a^{*}_{s}z^{-s}\right)}\,,italic_S ( italic_f ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT roman_Δ italic_t end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - italic_s end_POSTSUPERSCRIPT ) end_ARG , (12)

where ΔtΔ𝑡\Delta troman_Δ italic_t is the sampling interval of the time series, z=exp(2πifΔt)𝑧2𝜋𝑖𝑓Δ𝑡z=\exp{(2\pi if\Delta t)}italic_z = roman_exp ( 2 italic_π italic_i italic_f roman_Δ italic_t ), a0subscript𝑎0a_{0}italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1. The vector obtained as (1,a1,,aN)1subscript𝑎1subscript𝑎𝑁(1,a_{1},\dots,a_{N})( 1 , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) is also known as the prediction error filter. The coefficients as(s>0)subscript𝑎𝑠𝑠0a_{s}(s>0)italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_s > 0 ), together with an overall multiplicative scale factor PNsubscript𝑃𝑁P_{N}italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, are to be determined by an iterative process (called Burg’s algorithm) At least, two implementations of Burg’s algorithm are available in the literature, labeled as ‘Standard’ and ‘Fast’ in the memspectrum package. The ‘Standard’ method is slower but more stable, while ‘Fast’ trades stability for speed. On simulated stationary data, both versions typically yield similar results, while our tests with real gravitational waves data seems to indicate that the ‘Fast’ implementation introduces noise into the PSD estimate. 888For this reason, it is advisable to use the ‘Standard’ implementation whenever possible. In most case of numerical instability in the ‘Fast’ method, memspectrum will send a warning to user.. A comparison of the computational times for Standard MESA implementation and Fast implementation (together with Welch’s) is porvided in appendix B.

The number N𝑁Nitalic_N of such coefficients is a choice that shall be made by the user and indeed it is the only hyperparameter that needs to be tuned. The details of the derivation and the actual form for the coefficients assubscript𝑎𝑠a_{s}italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT can be found in Appendix A.

2.3 Autoregressive Process Analogy

The application of MESA is not limited to spectral estimates, but it also provides a link between spectral analysis and the study of autoregressive processes (AR) doi:10.1029/RG013i001p00183 . An autoregressive stationary process of order p𝑝pitalic_p, AR(p𝑝pitalic_p), is a time series whose values satisfy the following expression:

xtb1xt1b2xt2bpxtp=νtsubscript𝑥𝑡subscript𝑏1subscript𝑥𝑡1subscript𝑏2subscript𝑥𝑡2subscript𝑏𝑝subscript𝑥𝑡𝑝subscript𝜈𝑡x_{t}-b_{1}x_{t-1}-b_{2}x_{t-2}\dots b_{p}x_{t-p}=\nu_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT … italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT = italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (13)

where b1,,bpsubscript𝑏1subscript𝑏𝑝b_{1},\ldots,b_{p}italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are real coefficients and νtsubscript𝜈𝑡\nu_{t}italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is white noise with a given variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Thus, an AR(p𝑝pitalic_p) process models the dependence of the value of the process at time t𝑡titalic_t from the last p𝑝pitalic_p observations, thus being potentially able to model complex autocorrelation structures within observations.

Thanks to Wold’s theorem Wold_theorem , every stationary time series can be represented as an autoregressive process: this ensures that maximum entropy estimation is faithful and general; it turns out that the maximum entropy principle provides a representation of the time series as an AR(p)𝐴𝑅𝑝AR(p)italic_A italic_R ( italic_p ) process and Burg’s algorithm computes the corresponding autoregressive coefficients that are suitable to model the available data.

To show the analogy, we compute the PSD SAR(p)subscript𝑆𝐴𝑅𝑝S_{AR(p)}italic_S start_POSTSUBSCRIPT italic_A italic_R ( italic_p ) end_POSTSUBSCRIPT of an AR(p)𝐴𝑅𝑝AR(p)italic_A italic_R ( italic_p ) process and we show that it is formally equivalent to the PSD obtained in Eq. (12). This will also provide a direct expression for the autoregressive coefficients bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and for the noise variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We start taking the z𝑧zitalic_z transform 999The z𝑧zitalic_z transform is the discrete-time equivalent of the Laplace transform, thus taking a discrete time-series and returning a complex frequency series. of Eq. (13):

txtztibizitxtizti=tνtzt.subscript𝑡subscript𝑥𝑡superscript𝑧𝑡subscript𝑖subscript𝑏𝑖superscript𝑧𝑖subscript𝑡subscript𝑥𝑡𝑖superscript𝑧𝑡𝑖subscript𝑡subscript𝜈𝑡superscript𝑧𝑡\displaystyle\sum_{t}x_{t}z^{t}-\sum_{i}b_{i}z^{i}\sum_{t}x_{t-i}z^{t-i}=\sum_% {t}\nu_{t}z^{t}\,.∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_t - italic_i end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT . (14)

Calling x~(z)~𝑥𝑧\tilde{x}(z)over~ start_ARG italic_x end_ARG ( italic_z ) and ν~(z)~𝜈𝑧\tilde{\nu}(z)over~ start_ARG italic_ν end_ARG ( italic_z ), the transformed quantities, in the z𝑧zitalic_z domain, the process takes the form:

x~(z)=ν~(z)(1n=1pbnzn).~𝑥𝑧~𝜈𝑧1superscriptsubscript𝑛1𝑝subscript𝑏𝑛superscript𝑧𝑛\tilde{x}(z)=\frac{\tilde{\nu}(z)}{\left(1-\sum_{n=1}^{p}b_{n}z^{n}\right)}\,.over~ start_ARG italic_x end_ARG ( italic_z ) = divide start_ARG over~ start_ARG italic_ν end_ARG ( italic_z ) end_ARG start_ARG ( 1 - ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) end_ARG . (15)

Since we assumed a wide-sense stationary process, x~(z)~𝑥𝑧\tilde{x}(z)over~ start_ARG italic_x end_ARG ( italic_z ) is analytic both on and inside the unit circle. Taking its square value and evaluating it on the unit circle z=eı2πfΔt𝑧superscript𝑒italic-ı2𝜋𝑓Δ𝑡z=e^{-\imath 2\pi f\Delta t}italic_z = italic_e start_POSTSUPERSCRIPT - italic_ı 2 italic_π italic_f roman_Δ italic_t end_POSTSUPERSCRIPT, from the definition of spectral density one obtains:

SAR(p)(f)=|x~(z)|2=|ν~(f)|2|1n=1pbneı2πfnΔt|2.subscript𝑆𝐴𝑅𝑝𝑓superscript~𝑥𝑧2superscript~𝜈𝑓2superscript1superscriptsubscript𝑛1𝑝subscript𝑏𝑛superscript𝑒italic-ı2𝜋𝑓𝑛Δ𝑡2S_{AR(p)}(f)=|\tilde{x}(z)|^{2}=\frac{|\tilde{\nu}(f)|^{2}}{\left|1-\sum_{n=1}% ^{p}b_{n}e^{\imath 2\pi fn\Delta t}\right|^{2}}\,.italic_S start_POSTSUBSCRIPT italic_A italic_R ( italic_p ) end_POSTSUBSCRIPT ( italic_f ) = | over~ start_ARG italic_x end_ARG ( italic_z ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG | over~ start_ARG italic_ν end_ARG ( italic_f ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | 1 - ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_ı 2 italic_π italic_f italic_n roman_Δ italic_t end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (16)

The numerator is the spectral density of white noise νtsubscript𝜈𝑡\nu_{t}italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, i.e. its (constant) variance σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Eqs. (16) and  (12) are equivalent, if we identify bi=aisubscript𝑏𝑖subscript𝑎𝑖b_{i}=-a_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and PNΔt=σ2subscript𝑃𝑁Δ𝑡superscript𝜎2P_{N}\Delta t=\sigma^{2}italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT roman_Δ italic_t = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. This shows that the MAXENT estimation of the PSD models the observed times series as an AR process and provides a fit for the autoregressive coefficients. Furthermore, as a consequence of Wold’s theorem, there is the theoretical guarantee that every stationary time series can be modelled faithfully by the MAXENT.

2.4 Forecasting

The link between MESA and AR processes is of particular interest. Given the solution to Burg’s recursion to determine the aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we automatically obtain the coefficients of the equivalent AR process, hence we are able to exploit Eq. 13 to perform forecasting, thus providing plausible future observations, conditioned on the observed data. Indeed, for an AR(p𝑝pitalic_p) process the conditional probability p(xt|xt1,,xtp)𝑝conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝p(x_{t}|x_{t-1},\ldots,x_{t-p})italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT ) of the observation at time t𝑡titalic_t with respect to the past p𝑝pitalic_p observation has the form:

p𝑝\displaystyle pitalic_p (xt|xt1,,xtp)conditionalsubscript𝑥𝑡subscript𝑥𝑡1subscript𝑥𝑡𝑝\displaystyle(x_{t}|x_{t-1},\ldots,x_{t-p})( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_t - italic_p end_POSTSUBSCRIPT )
=1σ2πexp[12(xti=1pbixtiσ)2].absent1𝜎2𝜋12superscriptsubscript𝑥𝑡superscriptsubscript𝑖1𝑝subscript𝑏𝑖subscript𝑥𝑡𝑖𝜎2\displaystyle=\frac{1}{\sigma\sqrt{2\pi}}\exp\left[-\frac{1}{2}\left(\frac{x_{% t}-\sum_{i=1}^{p}b_{i}x_{t-i}}{\sigma}\right)^{2}\right]\,.= divide start_ARG 1 end_ARG start_ARG italic_σ square-root start_ARG 2 italic_π end_ARG end_ARG roman_exp [ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_σ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (17)

The interpretation of Eq. (2.4) is straightforward: xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT follows a Gaussian distribution with a fixed variance and a mean value mt=i=1pbixtisubscript𝑚𝑡superscriptsubscript𝑖1𝑝subscript𝑏𝑖subscript𝑥𝑡𝑖m_{t}=\sum_{i=1}^{p}b_{i}x_{t-i}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT computed from past observations. Eq. (2.4) provides then a well defined probability framework for predicting future observations: this is a very useful feature of MESA, that does not have an equivalent in any other spectral analysis computation methods.

2.5 Whitening

The theory of the AR processes can be also applied to the problem of whitening a time series. Given a time series, xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the whitening operation produces another time series xtWsubscriptsuperscript𝑥𝑊𝑡x^{W}_{t}italic_x start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that:

xtW=1[x~(f)S(f)]subscriptsuperscript𝑥𝑊𝑡superscript1delimited-[]~𝑥𝑓𝑆𝑓x^{W}_{t}=\mathcal{F}^{-1}\left[\frac{\tilde{x}(f)}{\sqrt{S(f)}}\right]italic_x start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ divide start_ARG over~ start_ARG italic_x end_ARG ( italic_f ) end_ARG start_ARG square-root start_ARG italic_S ( italic_f ) end_ARG end_ARG ] (18)

where 1superscript1\mathcal{F}^{-1}caligraphic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT denotes the inverse Fourier transform of a frequency series. If xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a realization of gaussian noise (see Eq. (2.1)) with PSD S(F)𝑆𝐹S(F)italic_S ( italic_F ), the whitened time series xtWsubscriptsuperscript𝑥𝑊𝑡x^{W}_{t}italic_x start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is just white noise (i.e. uncorrelated samples from a normal gaussian).

From Eq. (13), remembering that bi=aisubscript𝑏𝑖subscript𝑎𝑖b_{i}=-a_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, it’s straightforward to derive an expression for the whitened time series xtWsubscriptsuperscript𝑥𝑊𝑡x^{W}_{t}italic_x start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

xtW=1PNi=0paixtisubscriptsuperscript𝑥𝑊𝑡1subscript𝑃𝑁superscriptsubscript𝑖0𝑝subscript𝑎𝑖subscript𝑥𝑡𝑖x^{W}_{t}=\frac{1}{\sqrt{P_{N}}}\sum_{i=0}^{p}a_{i}x_{t-i}italic_x start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT (19)

This amounts to a convolution of the time series xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the kernel (1,a1,,ap)1subscript𝑎1subscript𝑎𝑝(1,a_{1},\ldots,a_{p})( 1 , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ), plus a variance rescaling. Performing a convolution is an appealing alternative to evaluating Eq. (18) directly.

3 Validation of the model

MESA provides a recursive formula for computing the coefficients aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT in Eq. (12). The number M𝑀Mitalic_M of such coefficients is equivalent to the maximum order of the autocorrelation r¯msubscript¯𝑟𝑚\bar{r}_{m}over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT considered. In an ideal scenario, this would be equal to the number of points the autocorrelation is computed at (equivalent to the length of data considered). However, the computation of high order coefficients of the autocorrelation is unstable and for high enough m𝑚mitalic_m, as the estimation for r¯msubscript¯𝑟𝑚\bar{r}_{m}over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT shows a very high variance, broadly scaling as (Mm)1similar-toabsentsuperscript𝑀𝑚1\sim\left(\sqrt{M-m}\right)^{-1}∼ ( square-root start_ARG italic_M - italic_m end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

It is then clear that the choice of the number of samples of the discrete autocorrelation to consider is important: on the one hand it is advisable to include as much knowledge of the autocorrelation as possible, leading to include all the known r¯msubscript¯𝑟𝑚\bar{r}_{m}over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT; on the other hand, including values of the autocorrelation that are not reliably estimated, can be counterproductive. The order M𝑀Mitalic_M of the autocorrelation to be considered (or, equivalently, the order M𝑀Mitalic_M of the underlying autoregressive process) is the only tuning parameter of MESA and a careful balance between these two necessities must be made when applying the algorithm.

The remainder of this section is devoted to an extensive study on how to make such choice. In Section 3.1, we are going to define two different loss functions to measure how well the algorithm is able to reproduce a known PSD. The basic idea is to validate, as the autoregressive order considered increases, the performance of the algorithm results by measuring the loss function and pick, among the orders the one that yields better results. The performance of the different losses will be assessed by answering to two questions: (i) how well the AR order is recovered and (ii) how well the measured PSD is able to whiten the input time series. This will be discussed Sec. 3.3 and Sec. 3.4.

3.1 Choice of the autoregressive order

Guided from numerical experiments, an indication on the upper bound to the autoregressive order Mmaxsubscript𝑀𝑚𝑎𝑥M_{max}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is doi:10.1190/1.1440902 :

Mmax=2N/ln(2N),subscript𝑀𝑚𝑎𝑥2𝑁2𝑁M_{max}=2N/\ln{(2N)}\,,italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 2 italic_N / roman_ln ( 2 italic_N ) , (20)

where N𝑁Nitalic_N is the number of observed points in the time-series. However, this is just a plausible upper limit on the order of the AR process m𝑚mitalic_m and the optimal algorithm could employ fewer points. We then need a more sophisticated method for computing the right value for m𝑚mitalic_m. We summarise them below:

  • Final prediction Error The first criterion is due to  Akaike1970StatisticalPI . It was proposed that m𝑚mitalic_m should be chosen as the length that minimizes the error when the filter is used as a predictor, the final prediction error (FPE):

    FPE(m)=𝔼[((xtx^t)2)]𝐹𝑃𝐸𝑚𝔼delimited-[]superscriptsubscript𝑥𝑡subscript^𝑥𝑡2FPE(m)=\mathbb{E}\left[\left((x_{t}-\hat{x}_{t})^{2}\right)\right]italic_F italic_P italic_E ( italic_m ) = blackboard_E [ ( ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] (21)

    with x^t=i=1Maixtisubscript^𝑥𝑡superscriptsubscript𝑖1𝑀subscript𝑎𝑖subscript𝑥𝑡𝑖\hat{x}_{t}=\sum_{i=1}^{M}a_{i}x_{t-i}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT. Asymptotically minimizing FPE is equivalent to minimizing the quantity:

    FPE(m)=PmN+m+1Nm1subscriptFPE𝑚subscript𝑃𝑚𝑁𝑚1𝑁𝑚1\mathcal{L}_{\rm FPE}(m)=P_{m}\frac{N+m+1}{N-m-1}caligraphic_L start_POSTSUBSCRIPT roman_FPE end_POSTSUBSCRIPT ( italic_m ) = italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT divide start_ARG italic_N + italic_m + 1 end_ARG start_ARG italic_N - italic_m - 1 end_ARG (22)

    with Pmsubscript𝑃𝑚P_{m}italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT being the estimated noise variance at order m𝑚mitalic_m, see Eq. (33). In the N𝑁N\to\inftyitalic_N → ∞ limit, remembering mmax2N/log(2N)similar-tosubscript𝑚𝑚𝑎𝑥2𝑁2𝑁m_{max}\sim 2N/\log(2N)italic_m start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ∼ 2 italic_N / roman_log ( 2 italic_N ), Akaike’s loss function is equivalent to the minimization of the variance Pmsubscript𝑃𝑚P_{m}italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT of the white noise of the underlying AR(p)𝐴𝑅𝑝AR(p)italic_A italic_R ( italic_p ) model.

  • Variance Maximum (VM) This second criterion kay1988modern is based on a similar assumptions to FPE. It minimises the actual value of the least squares (instead of relying to asymptotical behaviour). t the normalising factor takes into account the k degrees of freedom necessary to estimate the forward prediction error filter aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.
    The quantity to be minimised is

    VM(m)=1N2mt=mN(xti=1maixti)2𝑉𝑀𝑚1𝑁2𝑚superscriptsubscript𝑡𝑚𝑁superscriptsubscript𝑥𝑡superscriptsubscript𝑖1𝑚subscript𝑎𝑖subscript𝑥𝑡𝑖2VM(m)=\frac{1}{N-2m}\sum_{t=m}^{N}\left(x_{t}-\sum_{i=1}^{m}a_{i}x_{t-i}\right% )^{2}italic_V italic_M ( italic_m ) = divide start_ARG 1 end_ARG start_ARG italic_N - 2 italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t - italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (23)

    The package implementation of VM loss function takes advantage of a recursive re-writing of the above formula, as in Eqs (27) and (28) of Cuoco_2001 .

    Several other criteria are available in the literature doi:10.1029/WR018i004p01097 ; bhansali1986 and some are implemented in the memspectrum package. We don’t report them in this paper since they didn’t show any additional merit with respect to the aforementioned loss functions

Once a loss function is selected, the choice of the best recursion order is straightforward: we solve the Levinson recursion doi:10.1002/sapm1946251261 until Mmaxsubscript𝑀𝑚𝑎𝑥M_{max}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, as given in Eq. (20), iterations are reached. Then, the order m𝑚mitalic_m is selected to be the one that minimizes the specified loss function.

In a real implementation of the algorithm, computing all the recursion up to Mmaxsubscript𝑀𝑚𝑎𝑥M_{max}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT can result in a significant waste of computational power: the optimal value is often mopt<<Mmaxmuch-less-thansubscript𝑚𝑜𝑝𝑡subscript𝑀𝑚𝑎𝑥m_{opt}<<M_{max}italic_m start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT < < italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and, in such cases, computing all the values of m𝑚mitalic_m until Mmaxsubscript𝑀𝑚𝑎𝑥M_{max}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is not useful. In practice, we can apply an early stop procedure: every few iterations we look for the best order of moptsubscript𝑚𝑜𝑝𝑡m_{opt}italic_m start_POSTSUBSCRIPT italic_o italic_p italic_t end_POSTSUBSCRIPT; if this value does not change for a while, we assume that a good (local) minimum of the loss function is found and the computation is stopped.

The following sections will be devoted to the study of the statistical properties of the loss functions introduced above: we need to understand which choice provides the best quality in the reproduction of some known power spectral densities. In the following paragraph, we will discuss three different comparison (one qualitative and two quantitative) of the two proposed loss functions, o

3.2 How accurate are the reconstructed PSDs?

Refer to caption
Figure 1: Comparison of the ensemble average PSD estimate

In our initial qualitative comparison, depicted in Figure 1, we juxtapose the reconstruction of a known a-priori Power Spectral Density with those obtained using the two distinct loss functions. The red line in the plot represents our chosen reference PSD, derived from the estimated PSD for the GW150914 event.
To conduct this analysis, we generated 1000 noise time-series whose power spectral densities matches the reference PSD by construction noiseGen . The sampling rate and observation time were fixed at dt=2048𝑑𝑡2048dt=2048italic_d italic_t = 2048Hz and T=5𝑇5T=5italic_T = 5s, respectively. For each noise realization, we employed both the FPE and the VM loss functions to estimate the PSD. Ultimately, we compared the reference PSD against the ensemble average of these two estimation methods.
The FPE-derived estimate, represented by the dashed line, effectively identifies and reconstructs peaks across both high and low frequency ranges with commendable accuracy. However, as illustrated in the inset plots, FPE struggles when confronted with structured peaks—those containing subordinate modes. In such cases, FPE accurately captures the primary mode but overlooks the subsidiary peaks.
On the other hand, the VM estimate, depicted as a dot-dashed line, excels in reconstructing both dominant and subordinate modes with remarkable precision. VM appears to prioritize comprehensive mode reconstruction, while FPE emphasize an accurate reconstruction of major modes while potentially neglecting more intricate sub-peaks.

3.3 How well is the AR order recovered?

Refer to caption
Figure 2: Reconstructed value for the autoregressive order plotted against the true value of the autoregressive order. The reconstructed autoregressive orders are computed from a time series randomly drawn with an AR(p)𝐴𝑅𝑝AR(p)italic_A italic_R ( italic_p ) model, with the two different loss functions under investigation.

Moving to our second comparison, we now focus on another crucial aspect: how accurately each loss function estimates the Autoregressive (AR) order, which represents the number of employed aksubscript𝑎𝑘a_{k}italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT coefficients.
Here, the memspectrum package proves quite useful. It allows us to assign a specific order to the reconstructed autoregressive filter and use the resulting coefficients to forecast time series. With these tools in hand, we generated various time series, each with a different autoregressive order ranging from m = 0 to m = 4000.
To ensure reliability, we created 30 distinct time series for each autoregressive order. This approach lets us compute both the mean and variance, giving us insights into the accuracy of each loss function’s order estimation. This analysis provides valuable information about how well each method performs in estimating the Autoregressive order across a broad spectrum of scenarios.
The results are reported in Figure 2. The injected autoregressive order’s true value is depicted by the red line. The estimations yielded by the two loss functions are illustrated alongside, accompanied by error bars indicating one standard deviation.
The plot reveals two distinct regions: one with ”short” autoregressive orders (m = 0 to around m = 1600) and another with ”long” autoregressive orders (starting from m = 1600).
In the first region, both loss functions provide comparable results that generally match the actual autoregressive order. FPE performs particularly well, offering estimates close to the injected order and with minimal error bars. VM performs slightly worse than FPE in this range, overestimating complexity and showing larger error bars.
Moving into the second region (m>1600𝑚1600m>1600italic_m > 1600), a shift in performance becomes apparent. FPE’s estimates tend to stabilize at a certain autoregressive value. However, as the injected model becomes more complex beyond this point, FPE’s accuracy in recovering the true order diminishes, and its variance increases. In contrast, VM performs better in this range, closely following the actual behavior and consistently recovering the true order within one standard deviation. To conclude, VM appears to prioritize complexity in its approach. In contrast, FPE seems to lean toward synthesis, emphasizing accurate reconstruction of not too complex models.

3.4 How well can MESA whiten the data?

Refer to caption
Figure 3: Histogram of the P-values obtained with a Kolmogorv Smirnov test on the whitened time series, against a univariate, 0 mean normal distirbuition

In Section 2.5, we showed how autoregressive coefficients and noise variance estimate P𝑃Pitalic_P can jointly be used to create a whitening filter, as in Eq. (18) and . To complete our investigation, we compare how well these whitening filters work when obtained from the two different loss functions we’ve been studying.
For this test, we employed the same set of time-series data as described in Section 3.2. Each time series underwent the whitening process using the autoregressive filter derived from its corresponding loss function. We then evaluated the resulting whitened time series against a zero-mean, univariate normal distribution using the Kolmogorov-Smirnov test.
The results are reported as an histogram for the obtained p-values in Figure 3 together with the chosen critical region of p<0.1𝑝0.1p<0.1italic_p < 0.1, representing a 90%percent9090\%90 % confidence level. In this region, there is no statistical difference between the two. Infact, the total number of counts in this bin are respectively cVM=47±7subscript𝑐𝑉𝑀plus-or-minus477c_{VM}=47\pm 7italic_c start_POSTSUBSCRIPT italic_V italic_M end_POSTSUBSCRIPT = 47 ± 7 and cFPE=42±6subscript𝑐𝐹𝑃𝐸plus-or-minus426c_{FPE}=42\pm 6italic_c start_POSTSUBSCRIPT italic_F italic_P italic_E end_POSTSUBSCRIPT = 42 ± 6, affirming the absence of a pronounced discrepancy between the two. In essence, this final examination underscores a shared proficiency in whitening between the two loss functions, showing that very long filters are not needed to obtain a comparlable result in whitening. Both methods showcase comparable results for whitening scopes.

From our previous discussions, it’s evident that both FPE and VM have their own strengths, and the choice between them greatly depends on the specific analysis requirements. In our analysis, VM tends to provide more accurate PSD estimates and often results in longer autoregressive filters. However, in cases where the underlying model is simple, there is a risk of VM overestimating complexity and generating patterns that don’t truly reflect the data.
On the other hand, FPE is a good option for reconstructing processes without introducing unnecessary complexity. However, it might underestimate the complexity of the data, particularly in scenarios involving secondary peaks or in the low-frequency region.

Lastly, it’s worth noting that FPE holds the advantage of lower numerical complexity due to its straightforward calculations involving simple arithmetic. In contrast, VM requires more complex computations, dealing with arrays that might be very long depending on the analysed data.

4 Comparison with Welch method

We perform a qualitative comparison between the performance of the MESA and of the standard Welch algorithm. In this, we cannot avoid to be only qualitative. Indeed, as the results of the comparison are problem dependent, it is very hard to quantify this in a single metric. Although similar studies can be drawn from any other PSD, in this section we focus on a single PSD and we try to generalize some observations that we make. We decide to use the analytical PSD computed for the LIGO Handford interferometer, released together with the GWCT-1 catalog GWTC1 ; PSD_release , and computed with the BayesLine package Cornish_2015 ; Littenberg_2015 ; Cornish_2020 ; Chatziioannou_2019 .

We simulate data101010This is to ensure that we have a baseline PSD to compare the data with from the PSD used for the analysis of the event GW150914 and we employ both Welch’s method and MESA to estimate the spectrum. We vary the length of the data used for the estimation: this is also useful to assess how the computation depends on the data available. We set the total observation time T=1,5,10,100,1000 s𝑇15101001000timesabsentsT=1,5,10,100,1000$\text{\,}\mathrm{s}$italic_T = 1 , 5 , 10 , 100 , 1000 start_ARG end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG For the MESA algorithm, we choose the VM loss function. For the Welch algorithm, we employ a Tukey window with the shape parameter α𝛼\alphaitalic_α equal to 0.4 (see scipy documentation), an overlap fraction of 1/2121/21 / 2 for the segments and a length of segments L=512,1024,2048,8192,32768𝐿51210242048819232768L=512,1024,2048,8192,32768italic_L = 512 , 1024 , 2048 , 8192 , 32768 points, depending on the observation time. In all cases, the sampling rate is set to 4096 Hztimes4096Hz4096\text{\,}\mathrm{H}\mathrm{z}start_ARG 4096 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG. For the Welch algorithm, we use the standard implementation provided by the python library scipy numpy ; scipy . The results from both methods are summarized in Figures  5 and 5 respectively.

First of all, we note that using a longer time series results in a better estimation of the PSD, especially at low frequencies. This is somehow obvious: longer data streams probe lower frequencies thanks to Nyquist’s theorem as well as providing better estimates for the FFT, in the Welch case, and the sample autocorrelation, for MESA.

We also note that MESA converges (Figures  5 and 5) to the underlying spectrum much faster than Welch’s method, providing a better estimate even in the case of short time series. Although observed at every frequency, this behaviour is more evident in the low frequency region. An accurate profile reconstruction can be obtained with MESA using a 5 seconds-strain only, while Welch method requires at least 10 seconds of data to obtain a comparable profile. Furthermore, MESA is able to model all the details of the peak at around 40 Hzsimilar-toabsenttimes40Hz\sim$40\text{\,}\mathrm{H}\mathrm{z}$∼ start_ARG 40 end_ARG start_ARG times end_ARG start_ARG roman_Hz end_ARG (even with T=100 s𝑇times100sT=$100\text{\,}\mathrm{s}$italic_T = start_ARG 100 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG), while the Welch’s algorithm fails to do so even with an observation time of T=1000 s𝑇times1000sT=$1000\text{\,}\mathrm{s}$italic_T = start_ARG 1000 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG.

Another important element is the noise of the spectral estimation: we find that the PSD estimation provided by the Welch’s method is noisier (i.e. has a large number of spurious peaks) compared to the PSD measured with MESA and FPE loss function. This is especially true at high frequencies and for long observation times T𝑇Titalic_T.

Finally, as already discussed Welch’s method is very dependent on the choice of window function. A Tukey window with aforementioned parameters is what we found to be the best compromise between noise and accuracy for the reconstruction, but different choices can be made, possibly providing more accurate results than the ones reported here. However, we want to stress that this fact does not invalidate our discussion but reinforces it: one of the most appealing advantages of MESA is the minimal amount of fine tuning required.

Figure 4: Comparison between analytic (dashed line) and estimated (red line) spectrum. The estimation is performed with Maximum Entropy method on synthetic data, with an increasing observation time T=1,5,10,100,1000 s𝑇1510100times1000sT=1,5,10,100,$1000\text{\,}\mathrm{s}$italic_T = 1 , 5 , 10 , 100 , start_ARG 1000 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG.
Refer to caption
Figure 5: Comparison between analytic (dashed line) and estimated (green line) spectrum. The estimation is performed with Welch’s method on synthetic data with an increasing observation time T=1,5,10,100,1000 s𝑇1510100times1000sT=1,5,10,100,$1000\text{\,}\mathrm{s}$italic_T = 1 , 5 , 10 , 100 , start_ARG 1000 end_ARG start_ARG times end_ARG start_ARG roman_s end_ARG.
Refer to caption

5 Marginalisation over the noise distribution: application to GW parameter estimation

Define the data hypothesis D𝐷Ditalic_D as the statement that the data D=S+N𝐷𝑆𝑁D=S+Nitalic_D = italic_S + italic_N with S𝑆Sitalic_S and N𝑁Nitalic_N some deterministic signal and some noise hypotheses. Typically, in this formulation one is choosing both a functional form for the signal of interest S``h(t;θ)′′𝑆``superscript𝑡𝜃′′S\equiv``h(t;\theta)^{\prime\prime}italic_S ≡ ` ` italic_h ( italic_t ; italic_θ ) start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT and some parametric form f(t)𝑓𝑡f(t)italic_f ( italic_t ) for the noise distribution N``n(t)f(t)′′𝑁``𝑛𝑡similar-to𝑓superscript𝑡′′N\equiv``n(t)\sim f(t)^{\prime\prime}italic_N ≡ ` ` italic_n ( italic_t ) ∼ italic_f ( italic_t ) start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT. Some well established math then leads to the usual Bayesian framework for parameter estimation, see lalinference for an application to gravitational wave physics. This procedure is very robust as long as the choice of noise distribution is indeed representative of the underlying process. Let us relax the N𝑁Nitalic_N hypothesis by defining a residuals hypothesis R𝑅Ritalic_R as R=DS𝑅𝐷𝑆R=D-Sitalic_R = italic_D - italic_S. This might seem a very trivial statement, but it has a non trivial application: given d(t)=h(t;θ)+n(t)𝑑𝑡𝑡𝜃𝑛𝑡d(t)=h(t;\theta)+n(t)italic_d ( italic_t ) = italic_h ( italic_t ; italic_θ ) + italic_n ( italic_t ) where h(t;θ)𝑡𝜃h(t;\theta)italic_h ( italic_t ; italic_θ ) is our signal model, defined by a set of parameters θ𝜃\thetaitalic_θ, the residuals r(t)d(t)h(t;θ)𝑟𝑡𝑑𝑡𝑡𝜃r(t)\equiv d(t)-h(t;\theta)italic_r ( italic_t ) ≡ italic_d ( italic_t ) - italic_h ( italic_t ; italic_θ ). Formally, no reference to the noise process is present anymore. Under MAXENT, we can model r(t)AR(k)similar-to𝑟𝑡AR𝑘r(t)\sim\mathrm{AR}(k)italic_r ( italic_t ) ∼ roman_AR ( italic_k ) with k𝑘kitalic_k the unknown order of the process to be inferred from the residuals, either via one of the aforementioned loss functions or even by marginalising over it while exploring the signal space. Moreover, we can always write p(r(t)|NI)𝑝conditional𝑟𝑡𝑁𝐼p(r(t)|N\,I)italic_p ( italic_r ( italic_t ) | italic_N italic_I ) as in Eq. (2.1) once we know k𝑘kitalic_k, with the PSD given in Eq. (16), whatever the noise process actually is. In other words, we care only about maximising the information entropy in the distribution of the residuals.

Hence, as an application of MESA, and its implementation in memspectrum, we analyse GW150914gw150914 using a Bayesian framework that allows for the marginalisation of the order k𝑘kitalic_k of the AR(k𝑘kitalic_k) process representing the residuals data stream. Although the inference is essentially unchanged compared to the standard case, see lalinference , there are some substantial modification to the likelihood construction. Since MESA is applicable to time-domain data, all calculations prior to the Fourier transform must be performed in time domain, thus increasing the computational cost by a non-negligible amount. We shall refer the time-of-arrival parameter tcsubscript𝑡𝑐t_{c}italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT of the GW to the geocenter. At each iteration of the inference algorithm, we sample a vector θθGWk𝜃subscript𝜃𝐺𝑊𝑘\theta\equiv\theta_{GW}\cup kitalic_θ ≡ italic_θ start_POSTSUBSCRIPT italic_G italic_W end_POSTSUBSCRIPT ∪ italic_k111111We indicate the set of all GW parameters (component masses, spins, luminosity distance, etc.) with θGWsubscript𝜃𝐺𝑊\theta_{GW}italic_θ start_POSTSUBSCRIPT italic_G italic_W end_POSTSUBSCRIPT.. For each inteferometer j𝑗jitalic_j, therefore, we need to compute a time-delay ΔtjΔsubscript𝑡𝑗\Delta t_{j}roman_Δ italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to compute the antenna response functions Fj,+(t+Δtj),Fj,×(t+Δtj)subscript𝐹𝑗𝑡Δsubscript𝑡𝑗subscript𝐹𝑗𝑡Δsubscript𝑡𝑗F_{j,+}(t+\Delta t_{j}),F_{j,\times}(t+\Delta t_{j})italic_F start_POSTSUBSCRIPT italic_j , + end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_F start_POSTSUBSCRIPT italic_j , × end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) as well as the correct time-shift for the GW template

hj(t)=p=+,×Fj,p(t+Δtj)hp(t+Δtj;θGW)subscript𝑗𝑡subscript𝑝subscript𝐹𝑗𝑝𝑡Δsubscript𝑡𝑗subscript𝑝𝑡Δsubscript𝑡𝑗subscript𝜃𝐺𝑊\displaystyle h_{j}(t)=\sum_{p=+,\times}F_{j,p}(t+\Delta t_{j})h_{p}(t+\Delta t% _{j};\theta_{GW})italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = ∑ start_POSTSUBSCRIPT italic_p = + , × end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_j , italic_p end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_h start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_t + roman_Δ italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_θ start_POSTSUBSCRIPT italic_G italic_W end_POSTSUBSCRIPT ) (24)

that we use to compute the time-domain residuals rj(t)=dj(t)hj(t)subscript𝑟𝑗𝑡subscript𝑑𝑗𝑡subscript𝑗𝑡r_{j}(t)=d_{j}(t)-h_{j}(t)italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) - italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ). We apply memspectrum to rj(t)subscript𝑟𝑗𝑡r_{j}(t)italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) with the fixed value of k𝑘kitalic_k and calculate the detector likelihood for r~jsubscript~𝑟𝑗\tilde{r}_{j}over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT using Eq. (2.1) and PSD as in Eq. (12). The coherent likelihood is then given by the product of the individual likelihoods. As our analysis template, we adopt the fast machine learning based MLGW model MLGW , an aligned spin model trained on TEOBResumS teobresums , that has been shown to perform well on LVK events detected during O1 and O2 MLGW . Our sampler is a nested sampling algorithm john_veitch_2020_4109277 and the specific inference model is implemented as part of granite, a dedicated inference model for ground-based interferometric detectors. We compare our results with the combined posterior samples available from GWOSC GWOSC and available at https://zenodo.org/records/6513631.

Figure 6: Posterior samples for \mathcal{M}caligraphic_M and mass ratio q𝑞qitalic_q from the LVK (blue) and using memspectrum (red). The samples are largely consistent among the two models, with the MESA model providing a more conservative estimate.
Refer to caption
Figure 7: Posterior samples for the sky position angles from the LVK(blue) and using memspectrum(red). The samples are largely consistent among the two models, with the MESA model providing a more conservative estimate.
Refer to caption
Figure 8: Whitened reconstructed waveforms and data from our analysis for the Hanford detector (top panel) and the Livingston detector (bottom panel). The shaded turquoise area indicates the 90% credible region over the waveforms space while the purple contours indicate the 90% credible regions over the whitened data.
Refer to caption
Figure 9: Posterior distributions for AR process orders in the Hanford (red) and Livingston (blue).
Refer to caption
Figure 10: Top panel: Posterior PSDs for the Hanford (orange line) and Livingstone (blue line) as inferred by our analysis. Bottom panel: Relative uncertainty around the median for both the Hanford(blue) and the Livingston(orange) PSDs.
Refer to caption

In Figs. 6,  7 and 8 we show the posteriors for the set of intrinsic parameters, extrinsic parameters and reconstructed waveform from our analysis. Our results can be summarised as follows: our posterior samples are in general consistent with what has been released by the LVK, however our credible regions tend to be larger. This is expected since our likelihood includes additional uncertainty due to the explicit sampling over the process order, hence the PSD. For the particular 4 seconds of data, sampled at 4096 Hz, the recovered orders are kH1=11075+9subscript𝑘𝐻1superscriptsubscript110759k_{H1}=1107_{-5}^{+9}italic_k start_POSTSUBSCRIPT italic_H 1 end_POSTSUBSCRIPT = 1107 start_POSTSUBSCRIPT - 5 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 9 end_POSTSUPERSCRIPT and kL1=11468+8subscript𝑘𝐿1superscriptsubscript114688k_{L1}=1146_{-8}^{+8}italic_k start_POSTSUBSCRIPT italic_L 1 end_POSTSUBSCRIPT = 1146 start_POSTSUBSCRIPT - 8 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + 8 end_POSTSUPERSCRIPT, Fig. 9. The corresponding PSDs and uncertainties are shown in Fig. 10. The full joint posterior distribution recovered when marginalising over the AR order is shown in Appendix C.

6 Summary and discussion

We presented a case study of the application of Maximum Entropy principle to the realm of spectral estimation. Albeit the methodology hereby presented is grounded on solid theoretical foundations and its merits are widely recognised, Maximum Entropy methods have yet to be adopted routinely in the study of problems related to time series. The superior nature of maximum entropy methods, and in particular of Burg’s method, is exemplified by the closed form estimate of the power spectral density and by the theoretical bridge between spectral analysis and AR processes. Moreover, the method presents, in our view, two main advantages when compared with more traditional ones; first there is no need to choose an arbitrary window function to correct the data and, second it provides as straightforward way to compute predictions given past observations. Accompanying this work, we provide a publicly available Python implementation, called memspectrum, that we used to perform the numerical studies presented in this work..

Since the order of the AR process is not yet determined by the theory, we opted for an in-depth investigation of several proposals in the literature and found that different loss functions are required for different situations, with the FPE loss function being the most indicated to deal with gravitational wave data. Along these lines, we directly compared the PSDs computed with MESA with the canonical Welch’s algorithm. As outlined in Sec. 4, MESA provides PSD estimates with smaller variance and better accuracy than Welch algorithm. The use of MESA is particularly useful for short time series samples, where Welch’s method is outperformed in both precision and confidence. As an examples, Figures  5 and 5 illustrate that MESA’s performance over a 10-second interval is more closely aligned with Welch’s performance over a 100-second interval than Welch’s performance over a 10-second interval alone.

This observation suggests a promising avenue to pursue in future developments of gravitational waves data analysis: for short time series, comparable with the length of binary black hole systems as observed by LIGO, Virgo and KAGRA, the computational cost of MESA is moderate and the inferred PSD is an accurate representation of the true underlying PSD. By applying MESA to 4 seconds of data in correspondence to GW150914, we demonstrated that it is possible to simultaneously estimate the signal and noise parameters, hence effectively marginalise over the noise PSD, without the need to

  • assume a specific functional form for the PSD;

  • estimate the PSD in an off-source segment of data.

Both items are of particular interest for several reasons that we shall discuss in what follows. Several proposals exist in the literature attempt to marginalise over the PSD, mostly using a parametric model for the PSD Littenberg_2013 ; Edwards_2015 ; lalinference . MAXENT fixes the functional form for us exploiting the correspondence with AR processes, providing a one-parameter family of models that are particularly easy to sample, thus grounding the noise properties marginalisation in solid theoretical foundations and in an easy-to-use numerical implementation. The latter point is also particularly relevant, especially in the context of future GW detectors. Future detectors are in fact expected to be operating in the signal dominated regime, with several sources – potentially from different classes – constantly present within the detectors’ data streams. In these cases the common procedure of estimating the PSD from off-sources segments is bound to fail and or provide biases inferences. MAXENT and MESA model and are relevant only for the segment of data under consideration, and make no assumptions over what is not part of the analysis. We believe, and we will show in a future study, that using MESA can be a natural solution for computing single-source posteriors whenever multiple sources are overlap**. This is possible since, in our formulation, everything that we did not label as signal will be part of the residuals, over which we apply MESA.

Furthermore, MESA provides a simple, but robust and quite accurate, albeit for short times, predictor for the time series. This fact is remarkable and can be used in time series analysis for several purposes. As an example, an anomaly detection pipeline could be built using the forecasts of MESA: the predictions can form a baseline to compare the actual observations with. Whenever the observed data are outside the expectations, an anomaly detection can be claimed. Of course such predictions can be done with a more accurate (perhaps nonlinear) model; however MESA has the advantage of being simple and fast to construct, while providing decent predictions. At the same time, several instruments present gaps in their data stream, for instance LISA is expected to show such gaps (e.g. lisa_gaps and references therein), MESA forecasting capabilities could be used to fill those gaps with predicted data from past observations. In conclusion, we reiterate that MESA is a theoretically sound, computationally feasible and reliable way of studying the properties of stochastic processes and we hope that the investigations presented in this work will further stimulate developments and applications of this method.

Acknowledgments

We are grateful to S. Biscoveanu, D. Laghi, M. Maugeri, C. Rossi and S. Shore for useful comments and discussions.
This research has made use of data, software and/or web tools obtained from the Gravitational Wave Open Science Center (https://www.gw-openscience.org/), a service of LIGO Laboratory, the LIGO Scientific Collaboration and the Virgo Collaboration. LIGO Laboratory and Advanced LIGO are funded by the United States National Science Foundation (NSF) as well as the Science and Technology Facilities Council (STFC) of the United Kingdom, the Max-Planck-Society (MPS), and the State of Niedersachsen/Germany for support of the construction of Advanced LIGO and construction and operation of the GEO600 detector. Additional support for Advanced LIGO was provided by the Australian Research Council. Virgo is funded, through the European Gravitational Observatory (EGO), by the French Centre National de Recherche Scientifique (CNRS), the Italian Istituto Nazionale di Fisica Nucleare (INFN) and the Dutch Nikhef, with contributions by institutions from Belgium, Germany, Greece, Hungary, Ireland, Japan, Monaco, Poland, Portugal, Spain.

References

  • (1) L.S. Finn, Physical Review D 46(12), 5236–5249 (1992). DOI 10.1103/physrevd.46.5236. URL http://dx.doi.org/10.1103/PhysRevD.46.5236
  • (2) P. Welch, IEEE Transactions on audio and electroacoustics 15(2), 70 (1967)
  • (3) N.R. Lomb, Astrophysics and Space Science 39(2), 447 (1976). DOI 10.1007/BF00648343
  • (4) J.D. Scargle, The Astrophysical Journal 263, 835 (1982). DOI 10.1086/160554
  • (5) N.J. Cornish, T.B. Littenberg, Classical and Quantum Gravity 32(13), 135012 (2015). DOI 10.1088/0264-9381/32/13/135012. URL http://dx.doi.org/10.1088/0264-9381/32/13/135012
  • (6) T.B. Littenberg, N.J. Cornish, Physical Review D 91(8) (2015). DOI 10.1103/physrevd.91.084034. URL http://dx.doi.org/10.1103/PhysRevD.91.084034
  • (7) E.T. Jaynes, Physical Review 106, 620 (1957)
  • (8) E. Jaynes, G. Bretthorst, Probability Theory: The Logic of Science (Cambridge University Press:, 2003)
  • (9) E.T. Jaynes, Proceedings of the IEEE 70(9), 939 (1982). DOI 10.1109/PROC.1982.12425
  • (10) J. Burg, Maximum Entropy Spectral Analysis. Stanford Exploration Project (Stanford University, 1975). URL https://books.google.it/books?id=Xug_AAAAIAAJ
  • (11) C.E. Shannon, Bell System Technical Journal 27(3), 379 (1948)
  • (12) R.T. Cox, American Journal of Physics 14(1), 1 (1946). DOI 10.1119/1.1990764
  • (13) P. Gregory, Multivariate Gaussian from maximum entropy (Cambridge University Press, 2005), p. 450–454. DOI 10.1017/CBO9780511791277.020
  • (14) R.M. Gray, Toeplitz and Circulant Matrices: A Review (Now Foundations and Trends, 2006). DOI 10.1561/0100000006
  • (15) D.W.F. P. M. Woodward, W. Higinbotham, Probability and information theory, with applications to radar, 2nd edn. (Pergamon Press, 1964). URL http://cds.cern.ch/record/2031792
  • (16) B. Allen, W.G. Anderson, P.R. Brady, et al., Phys. Rev. D 85, 122006 (2012). DOI 10.1103/PhysRevD.85.122006. URL https://link.aps.org/doi/10.1103/PhysRevD.85.122006
  • (17) J.G. Ables, Astronomy and Astrophysics Supplement 15, 383 (1974)
  • (18) M. Bartlett, Louvain Economic Review 34(2), 227–227 (1968). DOI 10.1017/S077045180004077X
  • (19) T.J. Ulrych, T.N. Bishop, Reviews of Geophysics 13(1), 183 (1975). DOI 10.1029/RG013i001p00183. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/RG013i001p00183
  • (20) H. Wold, Journal of the Institute of Actuaries 70(1), 113–115 (1939). DOI 10.1017/S0020268100011574
  • (21) J.G. Berryman, GEOPHYSICS 43(7), 1384 (1978)
  • (22) H. Akaike, Annals of the Institute of Statistical Mathematics pp. 137–151 (1998). DOI 10.1007/978-1-4612-1694-0˙11. URL https://doi.org/10.1007/978-1-4612-1694-0_11
  • (23) S. Kay, Modern Spectral Estimation. Prentice-Hall signal processing series (Prentice-Hall, 1988)
  • (24) E. Cuoco, G. Calamai, L. Fabbroni, et al., Classical and Quantum Gravity 18(9), 1727–1751 (2001). DOI 10.1088/0264-9381/18/9/309. URL http://dx.doi.org/10.1088/0264-9381/18/9/309
  • (25) A.R. Rao, R.L. Kashyap, L. Mao, Water Resources Research 18(4), 1097 (1982). DOI 10.1029/WR018i004p01097. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/WR018i004p01097
  • (26) R.J. Bhansali, Annals of Statistics. 14(1), 315 (1986). DOI 10.1214/aos/1176349858. URL https://doi.org/10.1214/aos/1176349858
  • (27) N. Levinson, Journal of Mathematics and Physics 25(1-4), 261 (1946)
  • (28) A.J. Owens, Journal of Geophysical Research: Space Physics 83(A4), 1673 (1978)
  • (29) B. Abbott, R. Abbott, T. Abbott, et al., Physical Review X 9(3) (2019). DOI 10.1103/physrevx.9.031040. URL http://dx.doi.org/10.1103/PhysRevX.9.031040
  • (30) B. Abbott, R. Abbott, T. Abbott, et al. LIGO Document P1900011-Power Spectral Densities (PSD) release for GWTC-1. LIGO Document Service: https://dcc.ligo.org/LIGO-P1900011/public (2019)
  • (31) N.J. Cornish, T.B. Littenberg, B. Bécsy, et al., Phys. Rev. D 103(4), 044006 (2021). DOI 10.1103/PhysRevD.103.044006
  • (32) K. Chatziioannou, C.J. Haster, T.B. Littenberg, et al., Physical Review D 100(10) (2019). DOI 10.1103/physrevd.100.104004. URL http://dx.doi.org/10.1103/PhysRevD.100.104004
  • (33) C.R. Harris, K.J. Millman, S.J. van der Walt, et al., Nature 585, 357–362 (2020). DOI 10.1038/s41586-020-2649-2
  • (34) P. Virtanen, R. Gommers, T.E. Oliphant, et al., Nature Methods 17, 261 (2020). DOI 10.1038/s41592-019-0686-2
  • (35) J. Veitch, V. Raymond, B. Farr, et al., Phys. Rev. D 91, 042003 (2015). DOI 10.1103/PhysRevD.91.042003. URL https://link.aps.org/doi/10.1103/PhysRevD.91.042003
  • (36) B. Abbott, R. Abbott, T. Abbott, et al., Physical Review Letters 116 (2016). DOI 10.1103/PhysRevLett.116.061102
  • (37) S. Schmidt, M. Breschi, R. Gamba, et al., Phys. Rev. D 103(4), 043020 (2021). DOI 10.1103/PhysRevD.103.043020
  • (38) A. Nagar, S. Bernuzzi, W. Del Pozzo, et al., Phys. Rev. D 98(10), 104052 (2018). DOI 10.1103/PhysRevD.98.104052
  • (39) J. Veitch, W.D. Pozzo, M. Williams, et al. johnveitch/cpnest: Fix for python ¡ 3.8 versioning (2020). DOI 10.5281/zenodo.4109277. URL https://doi.org/10.5281/zenodo.4109277
  • (40) R. Abbott, , O. Bulashenko, et al., Astrophys. J. Suppl. 267(2), 29 (2023). DOI 10.3847/1538-4365/acdc9f
  • (41) T.B. Littenberg, M. Coughlin, B. Farr, W.M. Farr, Physical Review D 88(8) (2013). DOI 10.1103/physrevd.88.084044. URL http://dx.doi.org/10.1103/PhysRevD.88.084044
  • (42) M.C. Edwards, R. Meyer, N. Christensen, Physical Review D 92(6) (2015). DOI 10.1103/physrevd.92.064011. URL http://dx.doi.org/10.1103/PhysRevD.92.064011
  • (43) Q. Baghi, J.I. Thorpe, J. Slutsky, et al., Phys. Rev. D 100, 022003 (2019). DOI 10.1103/PhysRevD.100.022003. URL https://link.aps.org/doi/10.1103/PhysRevD.100.022003
  • (44) T.E. Barnard. The maximum entropy spectrum and the Burg technique. Technical report no. 1: Advanced signal processing. NASA STI/Recon Technical Report N (1975)
  • (45) K. Vos. A Fast Implementation of Burg’s Algorithm. https://opus-codec.org/docs/vos_fastburg.pdf (2013)

Appendix A Details of PSD computation

A.1 MESA solution

We derive the expression for the MAXENT spectral estimator following the approach proposed by burg1975maximum . Unlike the standard approach, we do not enforce the constraints in Eq. (11) with the standard Lagrange Multipliers approach. We write instead the PSD S(f)𝑆𝑓S(f)italic_S ( italic_f ) as the Fourier Transform of the sample autocorrelation function:

S(f)=12Nyn=r¯neı2πnΔt,𝑆𝑓12𝑁𝑦superscriptsubscript𝑛subscript¯𝑟𝑛superscript𝑒italic-ı2𝜋𝑛Δ𝑡S(f)=\frac{1}{2Ny}\sum_{n=-\infty}^{\infty}\bar{r}_{n}e^{-\imath 2\pi n\Delta t},italic_S ( italic_f ) = divide start_ARG 1 end_ARG start_ARG 2 italic_N italic_y end_ARG ∑ start_POSTSUBSCRIPT italic_n = - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_ı 2 italic_π italic_n roman_Δ italic_t end_POSTSUPERSCRIPT , (25)

and, plugging it in the entropy gain expression eq. (10), we obtain:

ΔH=NyNylog(12Nyn=r¯neı2πfnΔt)𝑑f.Δ𝐻superscriptsubscript𝑁𝑦𝑁𝑦12𝑁𝑦superscriptsubscript𝑛subscript¯𝑟𝑛superscript𝑒italic-ı2𝜋𝑓𝑛Δ𝑡differential-d𝑓\Delta H=\int_{-Ny}^{Ny}\log\left(\frac{1}{2Ny}\sum_{n=-\infty}^{\infty}\bar{r% }_{n}e^{-\imath 2\pi fn\Delta t}\right)df.roman_Δ italic_H = ∫ start_POSTSUBSCRIPT - italic_N italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_y end_POSTSUPERSCRIPT roman_log ( divide start_ARG 1 end_ARG start_ARG 2 italic_N italic_y end_ARG ∑ start_POSTSUBSCRIPT italic_n = - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_ı 2 italic_π italic_f italic_n roman_Δ italic_t end_POSTSUPERSCRIPT ) italic_d italic_f . (26)

Note that this expression already takes into account the constraints in eq. (11).

We now introduce a set of coefficients λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, defined as the derivative of ΔHΔ𝐻\Delta Hroman_Δ italic_H with respect to the autocorrelation function rssubscript𝑟𝑠r_{s}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Explicitly they are:

λsδHδr¯s=12NyNyNyS(f)1eı2πfsΔt𝑑fsubscript𝜆𝑠𝛿𝐻𝛿subscript¯𝑟𝑠12𝑁𝑦superscriptsubscript𝑁𝑦𝑁𝑦𝑆superscript𝑓1superscript𝑒italic-ı2𝜋𝑓𝑠Δ𝑡differential-d𝑓\lambda_{s}\coloneqq\frac{\delta H}{\delta\bar{r}_{s}}=\frac{1}{2Ny}\int_{-Ny}% ^{Ny}S(f)^{-1}e^{-\imath 2\pi fs\Delta t}dfitalic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≔ divide start_ARG italic_δ italic_H end_ARG start_ARG italic_δ over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG 2 italic_N italic_y end_ARG ∫ start_POSTSUBSCRIPT - italic_N italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N italic_y end_POSTSUPERSCRIPT italic_S ( italic_f ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_ı 2 italic_π italic_f italic_s roman_Δ italic_t end_POSTSUPERSCRIPT italic_d italic_f (27)

and we will show that S(f)1𝑆superscript𝑓1S(f)^{-1}italic_S ( italic_f ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT can be written as a Fourier Expansion in terms of such coefficients. Then, the determination of the values for the λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT uniquely solves the problem of power-spectral density estimation.

Some properties for the coefficients can be worked out easily. First, since S(f)𝑆𝑓S(f)italic_S ( italic_f ) is real, the λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT show the property

λs=λs.subscript𝜆𝑠superscriptsubscript𝜆𝑠\lambda_{s}=\lambda_{-s}^{*}.italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT - italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .

The second property is obtained considering that the autocorrelation function rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT can only be computed for a finite time interval n[N,N]𝑛𝑁𝑁n\in[-N,N]italic_n ∈ [ - italic_N , italic_N ] and that the PSD estimation must not depend on the unavailable values rnsubscript𝑟𝑛r_{n}italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT: this is part of the constraint in eq. (11) This requirement can be implemented as:

δHδr¯s=0 for |s|>N,𝛿𝐻𝛿subscript¯𝑟𝑠0 for 𝑠𝑁\frac{\delta H}{\delta\bar{r}_{s}}=0\text{ for }|s|>N,divide start_ARG italic_δ italic_H end_ARG start_ARG italic_δ over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG = 0 for | italic_s | > italic_N ,

that means

λs=0 for |s|>N.subscript𝜆𝑠0 for 𝑠𝑁\lambda_{s}=0\text{ for }|s|>N.italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0 for | italic_s | > italic_N .

From Eq. (27) and from the properties above, is easily seen from the properties of the Fourier transform that S(f)𝑆𝑓S(f)italic_S ( italic_f ) can be expressed via a Fourier Series

S(f)1=s=NNλseı2πfsΔt.𝑆superscript𝑓1superscriptsubscript𝑠𝑁𝑁subscript𝜆𝑠superscript𝑒italic-ı2𝜋𝑓𝑠Δ𝑡S(f)^{-1}=\sum_{s=-N}^{N}\lambda_{s}e^{-\imath 2\pi fs\Delta t}.italic_S ( italic_f ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_s = - italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - italic_ı 2 italic_π italic_f italic_s roman_Δ italic_t end_POSTSUPERSCRIPT . (28)

Defining z=eı2πfΔt𝑧superscript𝑒italic-ı2𝜋𝑓Δ𝑡z=e^{-\imath 2\pi f\Delta t}italic_z = italic_e start_POSTSUPERSCRIPT - italic_ı 2 italic_π italic_f roman_Δ italic_t end_POSTSUPERSCRIPT the previous Fourier expansion becomes a Laurent Polynomial in z𝑧zitalic_z:

S(f)1=λ0+s=1Nλszs+s=1Nλszs.𝑆superscript𝑓1subscript𝜆0superscriptsubscript𝑠1𝑁subscript𝜆𝑠superscript𝑧𝑠superscriptsubscript𝑠1𝑁subscriptsuperscript𝜆𝑠superscript𝑧𝑠S(f)^{-1}=\lambda_{0}+\sum_{s=1}^{N}\lambda_{s}z^{s}+\sum_{s=1}^{N}\lambda^{*}% _{s}z^{-s}.italic_S ( italic_f ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - italic_s end_POSTSUPERSCRIPT . (29)

It is easy to show that if z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is a root for the polynomial (z0)1superscriptsuperscriptsubscript𝑧01(z_{0}^{*})^{-1}( italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT is also a root: for every root laying outside the unit circle there will be another root inside of it and vice-versa. These properties allow us to rewrite the Fourier expansion (29) as 1975STIN…7714318B :

S(f)=PNΔt(s=0Naszz)(s=0Naszs)𝑆𝑓subscript𝑃𝑁Δ𝑡superscriptsubscript𝑠0𝑁subscript𝑎𝑠superscript𝑧𝑧superscriptsubscript𝑠0𝑁subscriptsuperscript𝑎𝑠superscript𝑧𝑠S(f)=\frac{P_{N}\Delta t}{\left(\sum_{s=0}^{N}a_{s}z^{z}\right)\left(\sum_{s=0% }^{N}a^{*}_{s}z^{-s}\right)}italic_S ( italic_f ) = divide start_ARG italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT roman_Δ italic_t end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - italic_s end_POSTSUPERSCRIPT ) end_ARG (30)

with a0=1subscript𝑎01a_{0}=1italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 and ΔtΔ𝑡\Delta troman_Δ italic_t the uniform sampling interval for the time series. The vector obtained as (1,a1,,aN)1subscript𝑎1subscript𝑎𝑁(1,a_{1},\dots,a_{N})( 1 , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) is the prediction error filter. The power spectral density S(f)𝑆𝑓S(f)italic_S ( italic_f ) is uniquely determined if both the prediction error filter and PNsubscript𝑃𝑁P_{N}italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT coefficients are computed.

To compute the assubscript𝑎𝑠a_{s}italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is convenient to plug into Eq. (11) the Laurent Polynomial exansion for S(f)𝑆𝑓S(f)italic_S ( italic_f ) eq. (30) and then integrating over z𝑧zitalic_z (taking values on 𝕊1superscript𝕊1{\mathbb{S}^{1}}blackboard_S start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT). In this way the equation becomes:

PN2πı𝕊1zs1n=0Nanznn=0Nanzn𝑑z=r¯s.subscript𝑃𝑁2𝜋italic-ısubscriptcontour-integralsuperscript𝕊1superscript𝑧𝑠1superscriptsubscript𝑛0𝑁subscript𝑎𝑛superscript𝑧𝑛superscriptsubscript𝑛0𝑁subscriptsuperscript𝑎𝑛superscript𝑧𝑛differential-d𝑧subscript¯𝑟𝑠\frac{P_{N}}{2\pi\imath}\oint_{\mathbb{S}^{1}}\frac{z^{-s-1}}{\sum_{n=0}^{N}a_% {n}z^{n}\sum_{n=0}^{N}a^{*}_{n}z^{-n}}dz=\bar{r}_{s}.divide start_ARG italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_π italic_ı end_ARG ∮ start_POSTSUBSCRIPT blackboard_S start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_z start_POSTSUPERSCRIPT - italic_s - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT - italic_n end_POSTSUPERSCRIPT end_ARG italic_d italic_z = over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT . (31)

Substituting ssr𝑠𝑠𝑟s\to s-ritalic_s → italic_s - italic_r, multiplying by assubscriptsuperscript𝑎𝑠a^{*}_{s}italic_a start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and summing over s𝑠sitalic_s, the previous equation becomes

s=0Nasr¯sr=PN2πızr1s=0Naszs𝑑zsuperscriptsubscript𝑠0𝑁subscript𝑎𝑠subscript¯𝑟𝑠𝑟subscript𝑃𝑁2𝜋italic-ıcontour-integralsuperscript𝑧𝑟1superscriptsubscript𝑠0𝑁subscript𝑎𝑠superscript𝑧𝑠differential-d𝑧\sum_{s=0}^{N}a_{s}\bar{r}_{s-r}=\frac{P_{N}}{2\pi\imath}\oint\frac{z^{r-1}}{% \sum_{s=0}^{N}a_{s}z^{s}}dz∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_s - italic_r end_POSTSUBSCRIPT = divide start_ARG italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_π italic_ı end_ARG ∮ divide start_ARG italic_z start_POSTSUPERSCRIPT italic_r - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG italic_d italic_z (32)

For a wide-sense stationary processes, all the poles lay outside the unit circle so that the previous integral can be easily computed obtaining the following, well known, equations:

s=0Nasr¯rssuperscriptsubscript𝑠0𝑁subscript𝑎𝑠subscript¯𝑟𝑟𝑠\displaystyle\sum_{s=0}^{N}a_{s}\bar{r}_{r-s}∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_r - italic_s end_POSTSUBSCRIPT =PN if r=0formulae-sequenceabsentsubscript𝑃𝑁 if 𝑟0\displaystyle=P_{N}\quad\text{ if }r=0= italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT if italic_r = 0 (33)
s=0Nasr¯rssuperscriptsubscript𝑠0𝑁subscript𝑎𝑠subscript¯𝑟𝑟𝑠\displaystyle\sum_{s=0}^{N}a_{s}\bar{r}_{r-s}∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_r - italic_s end_POSTSUBSCRIPT =0 if r0.formulae-sequenceabsent0 if 𝑟0\displaystyle=0\qquad\text{ if }r\neq 0.= 0 if italic_r ≠ 0 . (34)

A.2 Levinson recursion

The solution of the Eqs. (33-34) fully determines the functional form of the power spectral density estimator (30). The method for solving the equations is called the Levinson-Durbin recursion doi:10.1002/sapm1946251261 and it is described in the following. For each order N𝑁Nitalic_N of the iteration we define the quantities:

ΔNsubscriptΔ𝑁\displaystyle\Delta_{N}roman_Δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT =n=0Nanr¯Nn+1absentsuperscriptsubscript𝑛0𝑁subscript𝑎𝑛subscript¯𝑟𝑁𝑛1\displaystyle=\sum_{n=0}^{N}a_{n}\bar{r}_{N-n+1}= ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over¯ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_N - italic_n + 1 end_POSTSUBSCRIPT (35)
cNsubscript𝑐𝑁\displaystyle c_{N}italic_c start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT =ΔNPN,absentsubscriptΔ𝑁subscript𝑃𝑁\displaystyle=-\frac{\Delta_{N}}{P_{N}},= - divide start_ARG roman_Δ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG start_ARG italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG , (36)

The Levinson recursion computes the N𝑁Nitalic_Nth order quantities given the N1𝑁1N-1italic_N - 1th order quantities:

PN=PN1(1|cN1|2)subscript𝑃𝑁subscript𝑃𝑁11superscriptsubscript𝑐𝑁12P_{N}=P_{N-1}\left(1-|c_{N-1}|^{2}\right)italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ( 1 - | italic_c start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (37)

and

(1a1aN1aN)=(1b1bN10)+cN1(0bN1b11).matrix1subscript𝑎1subscript𝑎𝑁1subscript𝑎𝑁matrix1subscript𝑏1subscript𝑏𝑁10subscript𝑐𝑁1matrix0superscriptsubscript𝑏𝑁1subscriptsuperscript𝑏11\begin{pmatrix}1\\ a_{1}\\ \vdots\\ a_{N-1}\\ a_{N}\end{pmatrix}=\begin{pmatrix}1\\ b_{1}\\ \vdots\\ b_{N-1}\\ 0\end{pmatrix}+c_{N-1}\begin{pmatrix}0\\ b_{N-1}^{*}\\ \vdots\\ b^{*}_{1}\\ 1\end{pmatrix}.( start_ARG start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) = ( start_ARG start_ROW start_CELL 1 end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL end_ROW end_ARG ) + italic_c start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT ( start_ARG start_ROW start_CELL 0 end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ) . (38)

where b𝑏bitalic_b holds the value of the assubscript𝑎𝑠a_{s}italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT coefficients at order N1𝑁1N-1italic_N - 1. The 0-th order element can be easily initialized reminding that a0=1subscript𝑎01a_{0}=1italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 (always) and that P0subscript𝑃0P_{0}italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be determined from (33). Its values turns out to be:

P0=R(0),subscript𝑃0𝑅0P_{0}=R(0),italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_R ( 0 ) , (39)

Δ0subscriptΔ0\Delta_{0}roman_Δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and c0subscript𝑐0c_{0}italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are uniquely determined from their definitions and they are:

Δ0=R(1);c0=R(1)R(0).formulae-sequencesubscriptΔ0𝑅1subscript𝑐0𝑅1𝑅0\Delta_{0}=R(1);\quad c_{0}=-\frac{R(1)}{R(0)}.roman_Δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_R ( 1 ) ; italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - divide start_ARG italic_R ( 1 ) end_ARG start_ARG italic_R ( 0 ) end_ARG . (40)

These expressions allow us to compute a𝑎\vec{a}over→ start_ARG italic_a end_ARG and PNsubscript𝑃𝑁P_{N}italic_P start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT to any order by simply iterating (37) and (38). Substituting them in equation (30) the problem of the estimation for the power spectral density via maximum entropy principle is solved. Burg’s method for spectral analysis is solved via Levinson is implemented in the released memspectrum package. Another faster recursion method is available in Vos and it is also available in memspectrum.

Appendix B Computational Time for both MESA methods and Welch

In this appendix we shortly introduce the computational times required by the MESA method (considering both the standard and Fast implementation) and the Welch method. They are just inserted to give an idea of what the time differences between the methods are. These are obtained via the python %timeit special function, run on a personal machine

Computational times
Batch Length MESA std MESA Fast Welch
1s 22 ± 1.22 ms 19.619.619.619.6 ± 0.620.620.620.62 ms 335 ± 9.24 µs
5s 158 ± 21.7 ms 42.4 ± 0.35 ms 839 ± 4.61 µs
10s 187 ± 11.6 ms 51.5 ± 3.67 ms 1.74 ± 0.06 µs
100s 1.96 ± 0.34 ms 205 ± 5.09 18.8 ± 0.14 ms
1000s 17.1 ± 0.61 ms 1.33 ± 0.02 ms 235 ± 3.69 ms
Table 1: Comparison of the computational times for the estimate of the power spectral densities with our implementation of MESA (both standard and Fast implementations) and Welch’s method

Appendix C Full posterior distribution for GW150914

Figure 11: Full posterior distribution for GW150914 when marginalising over the AR process orders.
Refer to caption