Fast Gibbs sampling for the local and global trend Bayesian exponential smoothing model
Abstract
In Smyl et al. [Local and global trend Bayesian exponential smoothing models. International Journal of Forecasting, 2024.], a generalised exponential smoothing model was proposed that is able to capture strong trends and volatility in time series. This method achieved state-of-the-art performance in many forecasting tasks, but its fitting procedure, which is based on the NUTS sampler, is very computationally expensive. In this work, we propose several modifications to the original model, as well as a bespoke Gibbs sampler for posterior exploration; these changes improve sampling time by an order of magnitude, thus rendering the model much more practically relevant. The new model, and sampler, are evaluated on the M3 dataset and are shown to be competitive, or superior, in terms of accuracy to the original method, while being substantially faster to run.
Keywords: Exponential smoothing; Gibbs sampling; Scale mixtures
1 Introduction
Exponential smoothing (ETS) remains a standard forecasting procedure used in practice due to its simplicity, robustness and accuracy. In its most basic version, forecasts are produced by using the weighted sum of past observations, with the weights exponentially decaying in time. This basic version has been further extended to model trend and seasonality in either an additive or multiplicative form [6]; this is often referred to as the classical Holt-Winters method [27]. Many additional extensions of the classical framework exist; perhaps most notably, Gardner and Mckenzie [4] proposed a damped version of trend to make forecasts more conservative, particularly when the forecast horizon is long.
Modern implementations of the ETS model, such as in the R forecast
package [8] and the more recent fable
package [17], can provide practitioners with fully automatic model selection, in which no expert knowledge is required during to make forecasts. In order to facilitate the generation of probabilistic forecasts, assumptions must be made regarding the distribution of the errors, or innovations. The classical choice is to assume that the errors are normally distributed with zero mean and a constant variance over time [9].
In the existing literature, most implementations of the ETS model are approached from a frequentist perspective. However, implementations within a Bayesian framework, such as those of Andrawis and Atiya [1], Bermúdez et al. [2, 3], have demonstrated promising results. The drawback of Bayesian approaches has traditionally been that implementation requires a certain amount of specialised expertise, specifically with regards to posterior sampling via MCMC. The development of generic Bayesian tools such as Stan [23] and JAGS [18] has eased the pain of the modelling and programming process, but Bayesian inference still sees relatively little application within the context of exponential smoothing models. This is presumably due to the fact that existing Bayesian implementations have not shown large accuracy gains vis à vis frequentist implementations while usually being extremely slow to fit, particularly as the models become more sophisticated.
The recently proposed local and global trend (LGT) exponential smoothing model [22] extends the classical ETS model to capture trends that grow faster than linear but slower than exponential, and relaxes the error assumption to accommodate non-normally distributed and heteroscedastic errors. This model has been able to achieve outstanding accuracy on well established benchmarks, attaining state-of-the-art performance on univariate forecasting tasks. However, while effective, a major issue with the LGT model is the high computational complexity of the Bayesian sampling procedure used to explore the posterior. In [22] the proposed LGT models were primarily implemented via Stan [24], with only some preliminary results for a bespoke Gibbs sampling implementation for a simplified, non-seasonal version of the model being provided. While this simplified Gibbs sampler promised speed improvements of an order of magnitude over the Stan implementations, while retaining comparable accuracy, Smyl et al. [22] did not provide details on derivation or implementation.
In this paper, we consolidate the seasonal and non-seasonal variants of LGT within a single model formulation, and then extend the preliminary Gibbs sampling procedure to handle this unified model. We also provide all derivations and details required for implementation. Comprehensive experiments performed on the M3 competition benchmarking dataset [14] demonstrate that the proposed Gibbs sampler is not only highly accurate, but crucially, orders of magnitude faster than the original Stan implementations. This dramatic speedup has the advantage of rendering the Bayesian global and local exponential smoothing model useable in practice, yielding a procedure with acceptable time complexity that achieves state-of-the-art accuracy in many forecasting tasks. Moreover, through the novel use of the powerful horseshoe shrinkage prior for estimation of the seasonality adjustments, the resulting procedure is highly robust to the potential misspecification of seasonality. In addition to the original Stan implementation, this newly proposed Gibbs sampler is available in the R Rlgt
package on Github111https://github.com/cbergmeir/Rlgt. This package provides a complete implementation of the proposed procedure, with detailed documentation and comprehensive examples.
This paper is structured as follows. In Section 2, we review the LGT model originally proposed in Smyl et al. [22]. In Section 3, we review the basics of Bayesian inference and Monte Carlo Markov chain (MCMC) sampling approaches for posterior approximation. Section 4 introduces the modified LGT model and details the proposed Gibbs sampling procedure for exploring the posterior distribution. A comprehensive experimental study on the M3 dataset is provided in Section 6. We further discuss the robustness of seasonal priors under different scenarios with an ablation study in Section 7. Section 8 concludes our work.
2 The local and global trend model
Let denotes the realisation of the time series at time . Then, Smyl et al. [22] define the local and global trend exponential smoothing model as:
(1) |
where
(2) | ||||
(3) | ||||
(4) | ||||
(5) |
denotes a Student -distribution with degrees-of-freedom , location and scale . Table 1 details the parameters of this model and their interpretations. The one-step-ahead forecast is formed as a linear combination of the (smoothed) level value, , and local trend, , at the previous time step. The LGT model extends the classical non-seasonal, (damped) linear trend ETS model in three major ways. First, in contrast to the classical choice of normally distributed errors, the values of series under the LGT are instead assumed to follow a Student -distribution with degree-of-freedom , location and scale . The additional degrees-of-freedom parameter controls the heaviness of the tail of the distribution; as tends to infinity, the -distribution converges to a normal distribution, and as the tails of the -distribution become increasingly heavier. Such a generalised, heavy-tailed distribution allows for the LSGT model to better capture the volatility in a time series, and provides resistance to the influence of outliers of the series. The second important difference from the classical ETS model is the introduction of the “global” trend term used when forming the one-step-ahead forecast (2). The linear weight and power parameters and are constant over the entire series, and in this sense are global to the series. The expression is a generalisation of the linear and exponential trends [22], and has been demonstrated to perform well in capturing trends that grows faster than linear but slower than exponential (for ). This term can also model the damped trend that is popular in forecasting [4] if is taken to be negative. The third difference from the classical ETS models is the introduction of heteroscedasticity through the use use of a dynamic scale term , given by (5), which is formed from a linear combination of a powered version of the prediction , plus an offset term . In practice, the scale of errors is very likely to vary with time, and (5) accommodates the possibility of a larger scale of error for larger values of the series, with the rate of growth controlled by the power parameter .
Description | |
---|---|
degree-of-freedom parameter in the student -distribution | |
coefficient trend of the global trend | |
power coefficient of the global trend, in | |
dam** coefficient of the local trend, in | |
level smoothing parameter, in | |
local trend smoothing parameter, in | |
seasonality smoothing parameter, in | |
coefficient of the size of error, positive | |
power coefficient of the size of error, in | |
minimum value of the size of error, positive | |
initial local trend | |
initial seasonality, positive, i = 1,…, m |
A seasonal version of the LGT, called the seasonal global trend (SGT) model was also introduced in Smyl et al. [22]. Under the SGT, the time series is again modelled using a Student -distribution, as per (1), but the model forecasts and are modified to handle multiplicative seasonality effects:
(6) | ||||
(7) | ||||
(8) | ||||
(9) |
with
(10) |
The SGT model is an extension of the classical Holt-Winters model that also possesses the improvements discussed above for the non-seasonal version, and with parameters described in Table 1, and is presented in Smyl et al. [22] as a separate model from the non-seasonal LGT. For simplicity, the local trend component is not included in (6) when forming the one-step-ahead forecasts, as empirical evidence suggested this term provided no benefit when forecasting seasonal series. The seasonality terms are multiplicative factors, and their overall effect should be not change the scale of the data; the sum constraint (10) is introduced to ensure this. For further details on the LGT and SGT models we refer the reader to Smyl et al. [22] for a thorough discussion of the model and parameter space.
While these models are flexible and powerful extensions of the classical ETS model that have demonstrated state-of-the-art performance in forecasting benchmarks, they are also substantial more complex and contain a number of additional free parameters that must be fitted. As the series to which these techniques can be applied may often be short, a Bayesian framing of the problem was used in Smyl et al. [22] for model fitting and forecasting. Monte Carlo Markov chain (MCMC) sampling via the generic sampling tool “Stan” was used for posterior approximation. This process is computationally expensive, meaning that the overall fitting time, even for short series is often prohibitively long. This is the primary weakness of the LGT/SGT models in comparison to other forecasting techniques, and is partly due to the use of a generic tool which cannot directly exploit properties of the model, and partly due to the model formulation which introduces additional dependencies between model parameters, which is known to have detrimental effects on posterior exploration via MCMC. To address this weakness, this paper proposes a unified, modified LGT/SGT model and an accompanying Gibbs sampler that dramatically speeds up the MCMC sampling process. The next section discusses the fundamentals of Bayesian inference and MCMC sampling procedures, which prepares for the formal introduction of proposed sampler later.
3 Monte Carlo Markov Chain: A Brief Review
In the Bayesian framework, the parameters of the model are assumed to be random variables that follow a prior distribution . The posterior distribution of , after seeing the sample data , is given by
where is the likelihood function and
is the marginal probability of the data. From the posterior distribution, one can obtain full information about the model parameters. However, the high dimensional integral required to compute the normalising term is usually intractable, so in practice, a simulation approach, such as Monte Carlo Markov chain, is frequently used to approximate the posterior distribution. In the MCMC approach the posterior is approximated via a set of samples, say , that are randomly drawn (simulated) from the posterior distribution. A key strength of the MCMC approach is that it is simulation consistent, in the sense that the sample distribution will converge almost surely to the exact posterior distribution as . Estimates of parameters or other posterior quantities such as intervals can readily be obtained from the collection of posterior samples.
Recently, powerful generic Bayesian tools such as Stan have become available for Bayesian modelling and posterior exploration. These allow the non-specialist to define a Bayesian hierarchy and obtain posterior samples via Hamiltonian Monte Carlo approach. However, this generality comes at a price, as the No-U-Turn sampler (NUTS) [5] used by Stan, can be computationally expensive for even moderate numbers of model parameters. This can lead to low sampling efficiency relative to run-time, particularly if some of the model parameters exhibit a high statistical dependency. In contrast, by exploiting the specific statistical properties of a problem one can often apply computationally cheap algorithms, such as Gibbs sampling, for relatively sophisticated models. As such, depending on the structure of the model and prior distributions in question, generic Bayesian tools may be unnecessarily computationally costly, and it may be possible to obtain large computational speed-ups by develo** bespoke sampling algorithms. We will now briefly review several key random number generation algorithms which are often used as building blocks for sampling algorithms.
3.1 Gibbs sampling
Gibbs sampling is a key MCMC algorithm. They idea underlying Gibbs sampling is that we may sample from a joint density by iteratively sampling each random variable (in our case, model parameters) iteratively from their conditional densities [20]. This means that instead of sampling from a high-dimensional joint distribution directly, the sampling process reduces to sampling from a sequence of conditional posteriors that are potentially easier to sample from. For example, if we want to sample from a posterior distribution , then we may instead iteratively sample from , and (the order in which we choose to sample is irrelevant). Such a process allows us a free choice of sampling algorithm for each of the conditional distributions, and is most efficient when the conditional distributions for the parameters can be identified as some well-studied distributions, for which efficient random sampling algorithms exist. A weakness of Gibbs sampling is the high degree of dependency that is often present in the random samples, particularly if the parameters exhibit a high degree of statistical dependency. Variants of the basic Gibbs sampler have been introduced to mitigate this problem; specifically grou** or collapsing [12]. When sampling the joint distribution of and is possible, we can group and , and the sampling process becomes
-
1.
Sample , from ;
-
2.
Sample from .
This will generally act to reduce correlation in the MCMC chain. Alternatively, if the marginal distribution, say by integrating out the auxiliary variable , is easy to sample from, one can implement a collapsed Gibbs sampler:
-
1.
Sample from ;
-
2.
Sample from ;
-
3.
Sample the auxiliary variable from .
In this case, by integrating out parameters we can again reduce correlation in the resulting Markov chain. In this paper, a bespoke sampler is developed based on Gibbs sampling. A majority of the conditional posteriors can be written as recognisable distributions that are straightforward to sample from using standard random sampling algorithms. For those conditional distributions for which this is not the case, we utilise either the Metropolis-Hastings algorithm or the grid sampling algorithm to generate random samples.
3.1.1 Scale Mixtures
As noted, Gibbs sampling is most effective when the conditional distributions can be identified as some well studied distributions. Usually, this is only the case when the prior and likelihoods are conjugate, which in turn requires that the distributions involved be members of the exponential family. In the case that non-exponential family distributions are used, it is still possible to retain conditional conjugacy by the use of continuous mixtures. The most common mixture representation is known as the scale-mixture-of-normals. A density is representable by a scale-mixture-of-normals if it can be written as
where denotes the probability density of a normal distribution with mean and standard deviation , and is a mixing density. For example, the Student t-distribution can be expressed as a normal-inverse-gamma mixture [11]. That is, if a random variable follows a Student t-distribution with degree-of-freedom , location and scale , i.e.,
then the density can be equivalently expressed as the following mixture,
(11) | ||||
(12) |
where denotes the inverse-gamma distribution with shape and scale . Here, the variable is often referred to as a latent or auxiliary variable. The Student t-distribution density can be recovered by marginalising over . While the introduction of a latent variable into a Bayesian hierarchy increases the number of random variables that need to be sampled, it brings the considerable advantage that the Student-, which is not a member of the exponential family, is now representable as a mixture of exponential family distributions. This opens the possibility of conditional conjugacy by choice of appropriate prior distributions, which itself substantially facilitates efficient Gibbs sampling. We use this technique extensively in this paper when constructing the Gibbs sampler in Section 5.
3.2 The Metropolis-Hastings algorithm
The Metropolis-Hastings algorithm is a versatile and powerful sampling method that is particular useful when direct sampling from the target distribution is difficult. Given a proposal distribution, , samples can be generated according to the following procedure [19]:
-
1.
Generate a proposal from ;
-
2.
generate , and accept if
Here, the quantity denotes the -th sample in the Markov chain. A very common choice of proposal distribution is
where denotes a multivariate normal distribution, with and functions that determine the mean and covariance of the multivariate normal distribution, respectively. The parameter is often called a “step-size”, and usually controls the overall scale of the proposal; this generally needs to be chosen so that the Metropolis-Hastings procedure yields an acceptance rate around 50% to 60%. In this paper we use a specific variation of Metropolis-Hastings, derived from the algorithm in Titsias and Papaspiliopoulos [25], in which is determined using the gradient of the negative log-likelihood and is determined based on both the step-size and the curvature of the prior distribution. The step-size is automatically tuned using the algorithm presented in Schmidt and Makalic [21].
3.3 Grid sampling
A finite grid approximation (“grid sampling”) is a simple and fast way to approximate a posterior distribution [16]. Generating a sampling using a grid sampler consists of the following steps. First, generate a set of finite candidates, say from the parameter space , and compute the corresponding posterior probability density of the conditional posterior distribution, at each of these candidates. Then, normalise these density values and treat them as a multinomial distribution over the set , i.e.,
Finally, we draw a sample from this multinomial distribution. It is important to note that samples drawn from a grid-sampler will represent only a quantised approximation of the original continuous distribution; however, in many cases, this may be adequate if is chosen to be sufficiently large, or if model is relatively insensitive to the precise value of the parameter being sampled. An advantage of grid sampling is that each draw is independent, which helps to reduce overall correlation in the Markov chain. With an appropriately chosen grid, grid sampling can be both an efficient and accurate alternative to methods such as Metropolis-Hastings or rejection sampling.
4 The Local-Seasonal-Global Trend (LSGT) Model
In this work, we present a unified version of the LGT and SGT, which we call the LSGT model. In addition to unify the two models into a single formulation, we also make several adjustments to the model specification; these are designed to reduce statistical dependency between model parameters, as well to simplify posterior sampling. The LSGT models observation as
(13) |
where
(14) | ||||
(15) | ||||
(16) | ||||
(17) | ||||
(18) |
subject to
(19) |
The parameters of this model are described in Table 2. The LSGT model includes both the seasonal and non-seasonal variants with no specific distinction between the two; instead, we can recover either variant by setting some of the model parameters to specific constants. When all the seasonality factors are set to 1, i.e., there is no seasonal modification, the LSGT model reduces to a version of the original LGT model. Setting yields the seasonal version of the LSGT.
The LSGT model also makes several changes in model formulation to the LGT and SGT models of Smyl et al. [22] discussed in Section 2. An important modification is in the way in heteroscedasticity is incorporated into the model. In the LSGT model, the conditional scale , given by (18), depends on the global level rather then the one-step-ahead forecast as in the original LGT/SGT models. This has two effects: (i) it decouples the scale from the location (forecast) in the Student- distribution, reducing correlation between the parameters and ; and (ii) the conditional distribution for the weights and reduces to a linear regression. A second point of difference is that in the original LGT/SGT formulation, the heteroskedasticity is handled by summing the standard deviations of the homoskedastic and time varying components. This formulation is somewhat unnatural, as the variances of sums of random variables are additive, rather than their standard deviations. This formulation also introduces substantial correlation between the standard deviation of the homoskedastic component, , and the scale of the heteroskedastic component, as both must be adjusted simultaneously to maintain the same overall scale of errors. In contrast, from (18) it is clear that LSGT directly models the conditional variance of as a scaled mixture of the homoskedastic and heteroskedastic terms. The parameter controls the overall scale of the error terms, while the mixing parameter controls how much contribution is made to the variance by the homoskedastic and heteroskedastic terms, with the model reducing to a purely heteroskedastic form when . This formulation has two benefits: (i) it reduces the correlation between the parameters that determine substantially, and (ii) it allows us to easily utilise a scale-mixture representation of the Student- distribution to simplify sampling. The final modification relates to the way in which the seasonality adjustments are handled. As these quantities appear as multiplicative factors when forecasting the level (15), they are smoothed on the logarithmic scale in the LSGT model, as per (17). As the seasonal factors should not introduce an overall change in scale, the sum-constraint (19) ensures that they have a zero sum in the logarithmic scale, or equivalently, that their product is equal to one.
Description | |
---|---|
degree-of-freedom parameter in the student -distribution | |
coefficient trend of the global trend | |
power coefficient of the global trend, in | |
dam** coefficient of the local trend, in | |
level smoothing parameter, in | |
local trend smoothing parameter, in | |
seasonality smoothing parameter, in | |
scale of error, positive, constant for each time period | |
mixture of homoscedastic error and heteroscedastic error parameter, in | |
power coefficient of the heteroscedastic error, in | |
initial local trend | |
initial seasonality, positive, i = 1,…, m |
4.1 Prior distributions
As we are using a Bayesian approach to learn the LSGT model we require the specification of suitable prior distributions over all model parameters. To avoid our choice of prior distributions introducing a strong estimation bias we choose to use weakly informative priors where appropriate. The overall error scale is assigned a standard uninformative scale-invariant prior . The coefficients and , and the initial value of the local trend , are all assigned weakly informative Cauchy prior distributions:
This choice of prior distribution a priori preferences smaller values of the coefficients, while still allowing large values to be a priori plausible. By default we take and , allowing the prior distributions for and to automatically adapt to the scale of the time series. The smoothing parameters are defined on , and are assigned beta prior distributions
The default choice of hyperparameters is and ; this distribution masses more prior probability near (say) than . This is appropriate as small changes to when is close to one result in much larger changes in model response than similar magnitude changes when is close to zero. The heteroscedastic mixing parameter is assigned a uniform distribution on .
The power parameters and are sampled using a grid sampler (see Section 3.3). We use a uniformly spaced grid of candidate values for both parameters (over the range of permissible values, see Table 2). The degrees-of-freedom parameter is also grid sampled; however in this case a simple uniform spacing is inappropriate. This is because the change in the behaviour of the -distribution as varies is not uniform on the real line, i.e., increasing from to is not equivalent to increasing from to , i.e., the effect of increasing by some amount depends on the value of . Taking this into account, we choose the candidates in the -grid so that the symmetric Kullback–Leibler (KL) divergence [10] between all neighbouring pairs in the candidate set is equal, i.e., all neighbouring -distributions are equally “distant” in terms of symmetric KL divergence.
Instead of using Cauchy priors as in Smyl et al. [22], we assign the initial seasonal factors horseshoe priors. The prior hierarchy for the horseshoe prior is
(20) | ||||
(21) | ||||
(22) |
An important characteristic of the horseshoe prior is its infinitely tall spike (pole) at zero. This massing of prior probability at the origin means that if the true effects are zero, or close to zero, they will be aggressively shrunk away. In the case of the log-seasonal terms, a implies , i.e., no seasonality. This property provides the LSGT model a greater robustness to the misspecification of seasonality effects than the Cauchy priors used in the original SGT model.
5 Posterior sampling for the LSGT model
We now describe a Gibbs sampler for the LSGT model (13)–(19) using the prior distributions discussed in Section 4.1.
5.1 Scale-mixture representations
The continuous scale-mixture technique described in Section 3.1.1 is employed to simplify posterior sampling. The use of scale-mixture representations allows for conditional conjugacy even in the case of non-exponential family distributions (such as the Cauchy), at the expense of the introduction of additional latent variables that must also be sampled. We use the scale-mixture-of-normals representation of the -distribution given by (11)–(12) to rewrite the response model (13) as
Moreover, the Cauchy distribution is a special case of the Student t-distribution with . The Cauchy prior distribution for the parameter with scale can therefore be written as
by introducing the latent variable , with similar representations for the parameters and . The half-Cauchy distribution, used in the horseshoe prior, can also be expressed as an inverse-gamma scale-mixture of inverse-gamma distributions [26]. Specifically, if
then . Using this the horseshoe prior for the seasonality factors in 20, 21 and 22 can be written as [13],
(23) | ||||
(24) | ||||
(25) | ||||
(26) |
where and are latent variables. For convenience the complete Bayesian LSGT hierarchy, including the scale-mixture representations is given in Appendix A.
5.2 The Gibbs sampler
Consider a time series , and the corresponding one-step-ahead forecasts produced by the LSGT model, . We now present a Gibbs sampling procedure for sampling from the posterior of the LSGT model. The Gibbs sampler uses the scale-mixture-of-normals representation for the Student -distribution in Steps 1 to Step 5, and integrates out the latent variables, , for Step 6 onwards. The Gibbs sampler repeatedly iterates the following steps:
-
1.
Sample the global variance from the inverse-gamma distribution
Note that if the model is homoscedastic, .
-
2.
Sample the latent variables from the inverse-gamma distributions
for .
-
3.
Sample the degrees-of-freedom using a grid sampler (see Section 5.2.3).
-
4.
Sample the global trend coefficient from the normal distribution , where
and is given by (18); then sample the latent variable from the inverse-gamma distribution
-
5.
If we are using a non-seasonal model:
-
(a)
Sample the local trend coefficient from the normal distribution , where
and sample the latent variable from the inverse-gamma distribution
-
(b)
Sample the initial local trend from the normal distribution where
and sample the latent variable from the inverse-gamma distribution
-
(a)
- 6.
-
7.
If we are using a seasonal model:
-
8.
Sample the global trend power parameter using a grid sampler (see Section 5.2.3).
-
9.
If we are using a heteroscedastic model, sample the heteroscedastic power parameter , and the heteroscedastic mixing parameter using a grid sampler (see Section 5.2.3).
Derivations of the conditional distributions for the coefficients and , and initial local trend , are detailed in Appendices B and C, respectively.
5.2.1 Sampling , and
We group , and (if we are using a seasonal model) and sample them in a single step using a gradient-assisted Metropolis-Hastings algorithm [21]. As , a logistic transformation is first performed to transform the parameter space into the real line, i.e., we sample rather than . The latent variables are integrated out of the likelihood for better sampling convergence. The negative log-likelihood is
(27) |
where . Unlike the basic Metropolis-Hastings algorithm, the gradients of with respect to , and are utilized to improve the efficiency of the sampler. Note that (27) depends on , and through the one-step-ahead predictions and scales . The gradients for and for the non-seasonal model can be calculated using the chain rule, with details provided in Appendix D. For the parameter, the gradients are time consuming to compute so we do not utilize them (i.e., we set the gradient for to zero). As the underlying algorithm is a Metropolis-Hastings algorithm, the gradients can be computed approximately (or not computed at all, as in the case of ) without affecting the correctness of the sampling; more accurate computations simply lead to improved efficiency.
5.2.2 Sampling the initial seasonal factors
5.2.3 Sampling with a grid sampler
A grid sampler (see Section 3.3) is implemented for sampling , , , and . The negative log posterior for , conditional on the latent variables , is given by
where denotes the gamma function. We use the set of candidate values determined using the procedure in Section 4.1. When sampling the power parameters and , and the heteroscedastic mixing parameter , we integrate the latent variables out of the likelihood. The negative-log conditional posteriors for these parameters are given by
where . The quantities and are formed using (14) and (18), respectively. The candidate values for these three parameters are set uniformly based on the corresponding parameter limits.
6 Experiments
The proposed model extends the classical ETS model, which is a univariate forecasting procedure that does not utilize global learning across series. The M3 competition [14] provides a standard benchmark dataset for univariate methods. It consists of a mix of seasonal and non-seasonal series: 645 yearly series, 756 quarterly series, 1428 monthly series, and 174 other series. We use the M3 dataset from the Mcomp R package [7]. Table 3 summarizes the series lengths () and corresponding forecast horizon () in each category.
Category | ||
---|---|---|
Yearly | 14-41 | 6 |
Monthly | 48-126 | 18 |
Quarterly | 16-64 | 8 |
Other | 63-96 | 8 |
6.1 Evaluation metrics
Following the M3 competition, and the experimental analysis in Smyl et al. [22], we use the symmetric mean absolute percentage error (sMAPE) and the mean absolute scaled error (MASE) metrics to measure forecasting performance. These metrics are given by
(28) | ||||
(29) |
respectively. The denominator in (29) is the average error of the in-sample (seasonal) naïve forecasts, where denotes the periodicity; is set to one for non-seasonal series (such as yearly series), to 4 for quarterly series, and to 12 for monthly series, respectively. Probabilistic forecasts are evaluated using the mean scaled interval score (MSIS), as per Makridakis et al. [15]. The MSIS is given by
(30) |
where denotes the indicator function, which returns a one if the condition is true and a zero otherwise. The quantities and that appear in the numerator of (30) are used to denote the upper and lower bounds of the prediction interval, respectively. The quantity is the desired level of coverage; for example, if we are considering a prediction interval. The MSIS is an omnibus measure that penalises both the width of the forecasting interval and the attained coverage of the prediction interval.
6.2 Results and analysis
We consider both homoscedastic and heteroscedastic and variants of LSGT using the Gibbs sampler. The left-hand columns of Table 4 presents accuracy in terms of sMAPE and MASE, and the average running time per series, for both the LSGT and the original L/SGT model (with heteroscedastic errors) sampled using Stan results . The running time reported is the average running time of the models on the first 100 series in each category, executed with a single core on the same machine, for maximal comparability. The LGT Stan models have previously achieved state-of-the-art performance on the M3 dataset, as reported in Smyl et al. [22]. Compared with the Stan sampler, the Gibbs implementations obtain slightly improved accuracy in both measures, with the improvements largest for the yearly (non-seasonal) series. When considering the model fitting time, the proposed Gibbs sampler takes significantly less computation time in comparison to the Stan implementation, and renders the LSGT model a feasible tool for deploment in practice. In regards to the different error models, the heteroscedastic models perform better than homoscedastic models on all categories except for yearly series.
sMAPE | MASE | Avg Runtime (s) | Below 99p | Below 95p | Below 5p | Below 1p | MSIS 90p | MSIS 98p | |
Yearly series | |||||||||
LSGT Gibbs (homoscedastic error) | 14.91 | 2.55 | 3.79 | 97.44 | 90.80 | 7.80 | 2.43 | 19.36 | 40.57 |
LSGT Gibbs (heteroscedastic error) | 14.99 | 2.50 | 4.63 | 98.94 | 94.11 | 4.96 | 1.19 | 16.47 | 27.92 |
LGT Stan | 15.18 | 2.48 | 60.03 | 97.16 | 91.42 | 6.23 | 2.04 | 17.38 | 32.64 |
Monthly series | |||||||||
LSGT Gibbs (homoscedastic error) | 13.94 | 0.83 | 12.82 | 97.93 | 93.66 | 5.18 | 1.40 | 5.38 | 8.67 |
LSGT Gibbs (heteroscedastic error) | 13.76 | 0.82 | 14.67 | 98.36 | 94.64 | 4.64 | 1.24 | 5.22 | 8.52 |
SGT Stan | 13.77 | 0.83 | 163.84 | 97.51 | 92.55 | 5.21 | 1.69 | 5.10 | 8.20 |
Quarterly series | |||||||||
LSGT Gibbs (homoscedastic error) | 8.78 | 1.06 | 9.22 | 97.30 | 92.26 | 10.20 | 3.27 | 7.46 | 14.23 |
LSGT Gibbs (heteroscedastic error) | 8.78 | 1.06 | 10.70 | 97.59 | 92.97 | 8.61 | 2.12 | 7.19 | 13.06 |
SGT Stan | 8.87 | 1.07 | 374.12 | 96.13 | 90.16 | 11.79 | 4.76 | 7.64 | 15.95 |
Other series | |||||||||
LSGT Gibbs (homoscedastic error) | 4.21 | 1.70 | 5.19 | 99.64 | 97.34 | 4.02 | 0.50 | 10.8 | 17.1 |
LSGT Gibbs (heteroscedastic error) | 4.16 | 1.69 | 7.71 | 99.64 | 97.27 | 4.45 | 0.86 | 10.51 | 16.53 |
LGT Stan | 4.25 | 1.72 | 150.88 | 99.43 | 97.49 | 4.60 | 1.44 | 10.69 | 16.68 |
The right-hand side of Table 4 provides the performance of interval coverage and the MSIS scores in terms of 90% and 98% prediction intervals of the two samplers. The Gibbs samplers achieves better coverage in comparison to the Stan implementation in most categories other than “Other series”. More generally, the results suggest that the M3 series tend to be better modelled using heteroscedasticity assumptions, particularly the quarterly series. It is also worth pointing out that Smyl et al. [22] commented that the L/SGT models can produce slightly narrow intervals. The intervals generated by the LSGT Gibbs sampler tend to be wider as we specify a more larger space of candidate values in our grid. In contrast, the values used in the original Stan implementation appear to be insufficiently diverse; reducing the minimum in the Stan implementation could potentially fix this problem, though sampling small values of could also make the underlying sampling algorithm quite unstable, as Stan is known to have some issues handling heavy tailed distributions. Additionally, the implicit prior on used in the LSGT model (in which the candidates are equi-distant in terms of symmetric KL divergence) would be potentially quite difficult to implement in Stan, as it does not allow for easy sampling from discrete parameter spaces. In regards to the MSIS scores, the LSGT Gibbs sampler achieves superior results when compared to the Stan version in all categories but monthly, and remains competitive even in this setting.
We additionally performed Wilcoxon signed rank tests of the proposed two Gibbs variants and the original Stan model. We rank the methods based on per-series performance with respect to sMAPE, MASE, MSIS90, and MSIS98. Table 5 provide the average per-series ranking and the corresponding -values of the testing results. From previous Table 4, it can be seen that the Gibbs samplers achieved better accuracy than the original Stan L/SGT with respect to point forecast evaluation metrics. In line with the previous results, Table 5 show that the Gibbs samplers rank higher than the Stan version, even though the overall performance is not statistically significant at the level. In terms of interval forecasting, the Stan model ranks slightly higher on average compared to both Gibbs variants. From Table 4, we see that the Gibbs variants achieve higher accuracy for all but the monthly series. However, the monthly series constitute approximately half of the overall M3 dataset, and it is therefore expected that the ranking results will be largely dominated by the performance on the monthly series; additionally, rankings do not take into account the degree of difference in performance, so the larger improvements of the LSGT on yearly series, for example, are not as impactful. However, overall, the proposed Gibbs samplers are clearly highly accurate and strongly competitive with, if not superior to, the original Stan implementation in terms of forecasting metrics, while being substantially faster.
Gibbs (homo) - Stan | Gibbs (hetero) - Stan | Gibbs (homo) - Gibbs (hetero) | |
---|---|---|---|
Testing metric: sMAPE | |||
Method left avg rank | 1.44 | 1.44 | 1.51 |
Method right avg rank | 1.56 | 1.56 | 1.49 |
p-value | 0.75 | 0.69 | 0.94 |
Testing metric: MASE | |||
Method left avg rank | 1.44 | 1.44 | 1.51 |
Method right avg rank | 1.56 | 1.56 | 1.49 |
p-value | 0.65 | 0.62 | 0.96 |
Testing metric: MSIS90 | |||
Method left avg rank | 1.66 | 1.66 | 1.47 |
Method right avg rank | 1.34 | 1.34 | 1.53 |
p-value | 0.002 | 0.003 | 0.94 |
Testing metric: MSIS98 | |||
Method left avg rank | 1.80 | 1.83 | 1.39 |
Method right avg rank | 1.20 | 1.17 | 1.61 |
p-value | 5.93e-14 | 5.04e-16 | 0.44 |
7 Ablation study
Instead of assigning Cauchy priors to the initial seasonal factors as per the original paper LGT model in Smyl et al. [22], we utilise horseshoe priors (see the hierarchy 23, 24, 25 and 26). As previously discussed (Section 4.1), these are a special class of priors that encourage sparsity by massing prior probability around the origin of the prior. If the log-seasonality terms are all shrunk to zero, then the multiplicative seasonality terms will be equal to one and no seasonality adjustment will occur. The motivation behind using these types of priors is to provide some robustness in the case that the user specifies seasonality, but there is no evidence in the data to support it. It is therefore of interest to test the performance of the horseshoe priors vis à vis Cauchy priors which do not encourage sparsity.
Table 6 summarizes the results of the ablation test. We applied the LSGT and Stan SGT/LGT models with seasonality to the monthly and yearly series. For the monthly series, we tried both horseshoe and Cauchy priors for the LSGT with a seasonality of 12. The upper part of Table 6 shows the results for these two priors; for this data, which likely has strong seasonal effects there is no real difference between the performance of the Cauchy and horseshoe priors, as they both have heavy tails. We are also interested in how robust the two priors are and their ability to distinguish if no seasonality actually occurs, even under seasonal presumptions. The lower half of Table 6 compares the results of the non-seasonal models and seasonal models with an arbitrary periodicity of applied to the yearly series. Models that use horseshoe priors remain competitive, while models with Cauchy priors perform worse under both accuracy metrics. This suggests that the horseshoe priors are more robust and likely to achieve better results even when a seasonal model is accidentally chosen for series that may not have much evidence of seasonality. The original SGT Stan implementation used a larger scale parameter of the Cauchy prior, i.e., a heavier tail, which has poorer ability to shrink towards zero. The final entry of Table 6 shows that this choice of prior results in very poor performance in comparison to the use of the horseshoe prior.
sMAPE | MASE | |
Monthly series | ||
LSGT Gibbs (homoscedastic, horseshoe prior) | 13.94 | 0.83 |
LSGT Gibbs (heteroscedastic, horseshoe prior) | 13.76 | 0.82 |
LSGT Gibbs (homoscedastic, Cauchy prior) | 13.92 | 0.83 |
LSGT Gibbs (heteroscedastic, Cauchy prior) | 13.78 | 0.83 |
Yearly series | ||
LSGT Gibbs (homoscedastic error) | 14.91 | 2.55 |
LSGT Gibbs (heteroscedastic error) | 14.99 | 2.50 |
LSGT Gibbs (homoscedastic, horseshoe prior) | 15.35 | 2.62 |
LSGT Gibbs (heteroscedastic, horseshoe prior) | 15.37 | 2.55 |
LSGT Gibbs (homoscedastic, Cauchy prior) | 15.81 | 2.72 |
LSGT Gibbs (heteroscedastic, Cauchy prior) | 15.56 | 2.61 |
SGT Stan (Cauchy prior ) | 16.56 | 2.74 |
8 Conclusion
In this paper we have presented a fast and accurate Gibbs sampler for posterior exploration of the LSGT model. The LSGT is an extension of the classical exponential smoothing model which has the ability to capture the an heteroscedastic error structure, and super-linear but sub-exponential trends, with non-normal errors. We have combined the seasonal and non-seasonal variants presented in the work of Smyl et al. [22] into a single formulation, and modified the model to improve statistical coherence and the efficiency of the sampling process. In comparison to the original Stan implementation, the proposed Gibbs sampler demonstrated highly accurate performance, and importantly, is much faster, significantly reducing the computational effort required to explore the posterior distribution. The novel use of horseshoe priors in place of Cauchy priors for the seasonal factors has been demonstrated to improve the robustness of the model under both seasonal and non-seasonal conditions.
Despite the new Gibbs sampler being considerably faster than the Stan implementation, it still remains orders of magnitude slower than the classic ETS models. However, the LGT model is designed for data-scarce case, rather than the setting of big data, where global models are potentially more suitable. The promising features of the LSGT model, coupled with an efficient sampling algorithm, means that the LSGT is a feasible, and attractive algorithm for real-world univariate, seasonal and non-seasonal, forecasting applications.
Appendix A Bayesian hierarchy for the LSGT model
The complete Bayesian hierarchy, including scale-mixture expansions, for the LSGT model is given below:
with
where
subject to
Appendix B Derivation of normal conjugate prior
The posterior for a normal (joint) likelihood with a normal prior also follows a normal distribution,
(31) |
then where
(32) |
The derivation is given as follows. The conditional posterior is obtained by multiplying the normal density,
As only the exponential terms depend on , so we drop the other terms and get
Then we expand the square term in the summation,
Again, if we drop the constants,
which is proportional to a normal distribution with mean and variance . If we further tidy up the above posterior,
thus we get
and
which matches (32).
Appendix C Derivation of the conditional for
The initial value of the local trend is fitted for the non-seasonal model, i.e., . From (16), (14) can be expressed w.r.t. ,
where denotes the remaining constant at time . The posterior distribution follows the pattern in (31) which can be derived by (32), with and . However, it is much more complicated to directly calculate the remaining constant in this case. Note that , the posterior mean can be expressed in an alternative form from (32) by substituting with , so that
given the current value of .
Appendix D Derivation of the gradients for sampling and
The gradients are calculated based on chain rule. We first derive the gradients for as the following. With defined in (27),
Since , we get
From (18), we have
Then, we calculate and recursively. According to (14) and for the non-seasonal model, we get
with 15, we then derive
and
with the initial states being
Appendix E Derivation of the gradients for sampling initial seasonal factors
From (27), we first get
Since ,
With (14) we can obtain
where terms containing are dropped for simplicity since they would equal to zero in the seasonal version. Then from 15 and 17, we obtain the following recursively
with initial states being
and
From (18), we have
which can be obtained based on chain rule with components already derived previously.
References
- Andrawis and Atiya [2009] Robert R Andrawis and Amir F Atiya. A new Bayesian formulation for Holt’s exponential smoothing. Journal of Forecasting, 28(3):218–234, 2009.
- Bermúdez et al. [2009] José D Bermúdez, Ana Corberán-Vallet, and Enriqueta Vercher. Multivariate exponential smoothing: A Bayesian forecast approach based on simulation. Mathematics and Computers in Simulation, 79(5):1761–1769, 2009.
- Bermúdez et al. [2010] José D Bermúdez, José Vicente Segura, and Enriqueta Vercher. Bayesian forecasting with the Holt–Winters model. Journal of the Operational Research Society, 61(1):164–171, 2010.
- Gardner and Mckenzie [1985] Everette S. Gardner and Ed. Mckenzie. Forecasting trends in time series. Management Science, 31(10):1237–1246, oct 1985. doi: 10.1287/mnsc.31.10.1237.
- Hoffman et al. [2014] Matthew D Hoffman, Andrew Gelman, et al. The No-U-Turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. J. Mach. Learn. Res., 15(1):1593–1623, 2014.
- Holt [2004] Charles C. Holt. Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting, 20(1):5–10, 2004. ISSN 0169-2070. doi: https://doi.org/10.1016/j.ijforecast.2003.09.015. URL https://www.sciencedirect.com/science/article/pii/S0169207003001134.
- Hyndman [2018] Rob Hyndman. Mcomp: Data from the M-Competitions, 2018. URL https://CRAN.R-project.org/package=Mcomp. R package version 2.8.
- Hyndman et al. [2024] Rob Hyndman, George Athanasopoulos, Christoph Bergmeir, Gabriel Caceres, Leanne Chhay, Mitchell O’Hara-Wild, Fotios Petropoulos, Slava Razbash, Earo Wang, and Farah Yasmeen. forecast: Forecasting functions for time series and linear models, 2024. URL http://pkg.robjhyndman.com/forecast. R package version 8.7.
- Hyndman and Athanasopoulos [2021] Rob J Hyndman and George Athanasopoulos. Forecasting: principles and practice, 3rd edition. OTexts, 2021.
- Kullback and Leibler [1951] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
- Lange et al. [1989] Kenneth L Lange, Roderick JA Little, and Jeremy MG Taylor. Robust statistical modeling using the t distribution. Journal of the American Statistical Association, 84(408):881–896, 1989.
- Liu [1994] Jun S Liu. The collapsed Gibbs sampler in Bayesian computations with applications to a gene regulation problem. Journal of the American Statistical Association, 89(427):958–966, 1994.
- Makalic and Schmidt [2015] Enes Makalic and Daniel F Schmidt. A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters, 23(1):179–182, 2015.
- Makridakis and Hibon [2000] Spyros Makridakis and Michele Hibon. The M3-competition: results, conclusions and implications. International journal of forecasting, 16(4):451–476, 2000.
- Makridakis et al. [2020] Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The M4 competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting, 36(1):54–74, 2020.
- McElreath [2018] Richard McElreath. Statistical rethinking: A Bayesian course with examples in R and Stan. Chapman and Hall/CRC, 2018.
- O’Hara-Wild et al. [2021] Mitchell O’Hara-Wild, Rob Hyndman, and Earo Wang. fable: Forecasting Models for Tidy Time Series, 2021. URL https://CRAN.R-project.org/package=fable. R package version 0.3.1.
- Plummer [2003] Martyn Plummer. Jags: A program for analysis of Bayesian graphical models using Gibbs sampling. 3rd International Workshop on Distributed Statistical Computing (DSC 2003); Vienna, Austria, 124, 04 2003.
- Robert [2015] Christian P. Robert. The Metropolis–Hastings Algorithm, pages 1–15. John Wiley & Sons, Ltd, 2015. ISBN 9781118445112. doi: https://doi.org/10.1002/9781118445112.stat07834. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/9781118445112.stat07834.
- Robert et al. [1999] Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods, volume 2. Springer, 1999.
- Schmidt and Makalic [2020] Daniel F Schmidt and Enes Makalic. Bayesian generalized horseshoe estimation of generalized linear models. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2019, Würzburg, Germany, September 16–20, 2019, Proceedings, Part II, pages 598–613. Springer, 2020.
- Smyl et al. [2024] Slawek Smyl, Christoph Bergmeir, Alexander Dokumentov, Xueying Long, Erwin Wibowo, and Daniel Schmidt. Local and global trend Bayesian exponential smoothing models. International Journal of Forecasting, 2024.
- Stan Development Team [2022] Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, 2022. URL http://mc-stan.org/. Version 2.31.0.
- Stan Development Team [2023] Stan Development Team. RStan: the R interface to Stan, 2023. URL https://mc-stan.org/. R package version 2.21.8.
- Titsias and Papaspiliopoulos [2018] Michalis K Titsias and Omiros Papaspiliopoulos. Auxiliary gradient-based sampling algorithms. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(4):749–767, 2018.
- Wand et al. [2011] Matthew P Wand, John T Ormerod, Simone A Padoan, and Rudolf Frühwirth. Mean field variational Bayes for elaborate distributions. 2011.
- Winters [1960] Peter R Winters. Forecasting sales by exponentially weighted moving averages. Management science, 6(3):324–342, 1960.