Rediscovering Bottom-Up: Effective Forecasting in Temporal Hierarchies

Lukas Neubauer
TU Wien
[email protected]
&Peter Filzmoser
TU Wien
[email protected]

Abstract

Forecast reconciliation has become a prominent topic in recent forecasting literature, with a primary distinction made between cross-sectional and temporal hierarchies. This work focuses on temporal hierarchies, such as aggregating monthly time series data to annual data. We explore the impact of various forecast reconciliation methods on temporally aggregated ARIMA models, thereby bridging the fields of hierarchical forecast reconciliation and temporal aggregation both theoretically and experimentally. Our paper is the first to theoretically examine the effects of temporal hierarchical forecast reconciliation, demonstrating that the optimal method aligns with a bottom-up aggregation approach. To assess the practical implications and performance of the reconciled forecasts, we conduct a series of simulation studies, confirming that the findings extend to more complex models. This result helps explain the strong performance of the bottom-up approach observed in many prior studies. Finally, we apply our methods to real data examples, where we observe similar results.

Keywords Temporal Hierarchical Forecast Reconcilation, Temporal Aggregation, Bottom-Up

1 Introduction

Forecast reconciliation has been a very popular topic in recent forecasting literature. It covers the questions on how to properly forecast time series which have been aggregated in a certain way. This aggregation could come from a cross-sectional aspect where a collection of time series is aggregated across different variables such as location or organizational unit. In contrast, the time series could also be aggregated on a temporal basis, such as monthly, quarterly, and annual time series. Naturally, both types of aggregation might be combined in any way, leading to cross-temporal hierarchies.

The field of hierarchical forecast reconciliation investigates how to handle forecasting those hierarchies such that the resulting forecasts match the aggregation properties of the hierarchy. In addition, it is often examined how the performance of the reconciliation methods yielding so-called coherent forecasts is compared to original, possibly non-coherent forecasts. A very recent and extensive review of forecast reconciliation is given in Athanasopoulos et al., (2024). Many extensions are discussed such as adding complex constraints (non-negativity, integer-based time series, …) or probabilistic forecasting.

In this paper we investigate temporal hierarchies as introduced by Athanasopoulos et al., (2017). The authors argue that already existing forecast reconciliation methods can be applied to temporally aggregated time series in a straightforward manner. However, no further assumptions besides the base forecasts being unbiased are investigated, especially since no work is available looking at the theoretical implications of reconciliation methods assuming certain data-generating processes. We fill this gap of research and examine the performance of forecast reconciliation in temporal hierarchies in the theoretical framework of temporally aggregated time series models such as ARIMA models.

The effects of temporal aggregation in autoregressive models were first studied by Amemiya and Wu, (1972). The authors prove that if some data is generated by an autoregressive model of order $p$ , then a non-overlap** aggregate of these data will also follow a similar generating process. Namely, the autoregressive order of the aggregate remains at the same order $p$ while there might exist a moving average part of a certain order as well. In fact, the authors give a maximum order for this moving average part of the process. Silvestrini and Veredas, (2008) give a generalized overview of this theory and extend it to general SARIMA models.

In temporal hierarchies, simple reconciliation techniques such as bottom-up approaches are often applied. A bottom-up forecast is generated by aggregating the forecasts of the disaggregated series. Ramírez et al., (2014) suggest that forecasts of aggregated time series can be improved by using bottom-up forecasts, as long as the aggregated model includes a significant moving average component. Without this component, the improvements may be minimal or nonexistent. In this work, we extend this analysis by considering more complex models and more intricate temporal hierarchies.

We take an additional step to analyze the performance of the bottom-up approach compared to more sophisticated reconciliation methods, thereby linking the fields of temporal forecast reconciliation and temporally aggregated time series models. Although this was experimentally examined in Athanasopoulos et al., (2017), the results have yet to be theoretically justified. In general, the connection between these two fields has not been established from a theoretical perspective.

The paper is structured as follows. In Section 2 we briefly discuss the ideas of hierarchical forecast reconciliation and recent advances, in particular regarding temporal hierarchies (Section 2.1) as well as the basics of temporally aggregated time series models (Section 2.3). This is followed by the linkage of those two topics in Section 3 where we discuss the theoretical implications of forecast reconciliation on the temporally aggregated time series. In Section 4, we investigate the discussed implications in a simulation study, followed by real data applications in Section 5. Finally, we give concluding remarks in Section 6.

2 Related Work

2.1 Hierarchical Forecast Reconciliation

First introduced by Hyndman et al., (2011), optimal forecast reconciliation is formulated as follows. Consider a multivariate time series $\mathbf{y}_{1},\dots,\mathbf{y}_{T}\in\mathbb{R}^{n}$ fulfilling possible linear constraints, namely $\mathbf{y}_{t}=S\mathbf{b}_{t}$ , where $S$ is a $n\times n_{b}$ summing matrix with $n_{b}<n$ , and $b_{t}$ denotes the bottom level series of the hierarchy. The summing matrix is defined by the type of hierarchy of interest. For example, a matrix with $n=7$ and $n_{b}=4$ given by

\displaystyle S=\begin{pmatrix}1&1&1&1\\ 1&1&0&0\\ 0&0&1&1\\ 1&0&0&0\\ 0&1&0&0\\ 0&0&1&0\\ 0&0&0&1\end{pmatrix}

(1)

could be understood as a $3$ -level hierarchy of $4$ districts and $2$ states of one country whereby the first two districts are part of the first state and so on. Such linear constraints are naturally fulfilled on the observed data because it is set up to do so. When forecasting such series, we want the forecasts to also adhere to the same constraints which leads to so-called coherent forecasts, namely $\hat{\mathbf{y}}_{t+h|t}=S\hat{\mathbf{b}}_{t+h|t}$ , where $\hat{\mathbf{y}}_{t+h|t}$ and $\hat{\mathbf{b}}_{t+h|t}$ denote the corresponding $h$ -step forecasts. However, by forecasting each time series of the hierarchy individually we will most likely not obtain such coherent forecasts. This is where forecast reconciliation proves crucial.

Historically, simple reconciliation methods such as bottom-up or top-down approaches have been and remain in use. The bottom-up approach starts at the bottom level of the hierarchy, using forecasts from this level to construct forecasts for the entire hierarchy. This method avoids information loss due to aggregation but can be challenging because the bottom level time series may be harder to forecast accurately due to noise or other factors. On the other hand, top-down reconciliation uses only top-level forecasts and requires a proportion vector $\mathbf{p}$ of size $n$ to break down these forecasts into coherent lower-level forecasts, with the main challenge being the identification of an appropriate breakdown vector.

In the seminal work by Hyndman et al., (2011), the following regression problem was proposed to achieve least-squares reconciliation. Let $\hat{\mathbf{y}}_{h}=\hat{\mathbf{y}}_{t+h|t}$ represent a vector containing $h$ -step base forecasts in a stacked manner, and let $S$ be a summation matrix defined by the hierarchy of interest. Base forecasts refer to any appropriate and possibly incoherent forecasts for the corresponding time series, which we assume are available at this stage. Then write

\displaystyle\hat{\mathbf{y}}_{h}=S\bm{\beta}_{h}+\bm{\epsilon}_{h},

(2)

where $\bm{\beta}_{h}$ are the regression coefficients indicating the unknown mean of the bottom level, and $\bm{\epsilon}_{h}$ is the unobservable reconciliation error with zero mean and covariance matrix $V_{h}$ .

Solving this regression problem using generalized least-squares leads to the generalized linear solution of $\hat{\bm{\beta}}_{h}=G_{h}\hat{\mathbf{y}}_{h}$ and reconciled forecasts $\tilde{\mathbf{y}}_{h}=SG_{h}\hat{\mathbf{y}}_{h}$ . The $n_{b}\times n$ matrix $G_{h}$ maps the base forecasts into appropriate bottom level forecasts and is given by

\displaystyle G_{h}=(S^{\prime}V_{h}^{-1}S)^{-1}S^{\prime}V_{h}^{-1}.

(3)

The regression problem was inspired by the authors’ findings that simple reconciliation methods, such as bottom-up or top-down, can all be expressed as $\tilde{\mathbf{y}}=SG\hat{\mathbf{y}}$ with an appropriate map** matrix $G$ . For example, setting $G=(0_{n\times(n-n_{b})}~{}I_{n})$ or $G=(\mathbf{p}~{}0_{n\times(n_{b}-1)})$ , where $0_{r\times q}$ denotes a $r\times q$ matrix of zeros of size, $I_{q}$ is the identity matrix of size $q$ and $\mathbf{p}$ is a proportion vector of size $n$ , yields the bottom-up or top-down methods, respectively. The regression problem (2) was introduced to determine the optimal map** matrix in a least-squares sense.

It is further argued in Hyndman et al., (2011) that if the base forecasts are unbiased, that is $\mathbb{E}[\hat{\mathbf{y}}_{h}]=\mathbb{E}[\mathbf{y}_{t+h}]$ , and $G$ is such that $SGS=S$ , then the reconciled forecasts are also unbiased. The condition of $SGS=S$ is equivalent to $SG$ being a projection matrix (Panagiotelis et al.,, 2021), ensuring that already coherent forecasts remain unchanged in this transformation.

One essential problem is that $V_{h}$ is not known and not even identifiable as shown in Wickramasuriya et al., (2019). In Hyndman et al., (2011), the authors avoided this by setting $V_{h}=k_{h}I_{n}$ with some consistency constant $k_{h}$ (which need not be computed since it cancels out in further calculation steps) and hence weighting all series equally, disregarding any level of aggregation or performance of base forecasts. This simplification results in an OLS solution and $G=(S^{\prime}S)^{-1}S^{\prime}$ . The transformation matrix $SG=S(S^{\prime}S)^{-1}S^{\prime}$ is then an orthogonal projection with respect to the Euclidean distance, ensuring minimal change of the forecasts while reducing squared forecast errors of all levels of the hierarchy (Panagiotelis et al.,, 2021). A scaled reconciliation method is introduced in Hyndman et al., (2016) where the authors set $V_{h}=k_{h}\text{diag}(W_{h})$ with $W_{h}=\text{Cov}(\mathbf{y}_{t+h|h}-\hat{\mathbf{y}}_{h})$ being the covariance matrix of the base forecasts, leading to a weighted linear solution.

In the work of Wickramasuriya et al., (2019) the so-called minimum trace estimator is proposed by minimizing the trace of the covariance of the reconciled errors subject to unbiasedness, thus

	$\displaystyle\min_{G}\text{tr}~{}\text{Cov}(\mathbf{y}_{T+h\|h}-\tilde{\mathbf{% y}}_{h})$	$\displaystyle=\min_{G}\text{tr}~{}\text{Cov}(\mathbf{y}_{T+h\|h}-SG\hat{\mathbf% {y}}_{h})$
		$\displaystyle=\min_{G}\text{tr}~{}SGW_{h}G^{\prime}S^{\prime},$		(4)

subject to $SGS=S$ . The trace of a $n\times n$ matrix is $\text{tr}(A)=\sum_{i=1}^{n}A_{ii}$ . This leads to $G_{h}=(S^{\prime}W_{h}^{-1}S)^{-1}S^{\prime}W_{h}^{-1}$ . Thus, instead of estimating $V_{h}$ , we now need to estimate the covariance of the base forecast errors, $W_{h}$ , which is more feasible. This method is equivalent to the generalized linear solution, with the regression-based solution being a special case. The transformation matrix $SG$ now represents an oblique projection. By drop** the assumption of an orthogonal projection, we allow for greater forecast improvements on average. However, Panagiotelis et al., (2021) argue that for some realizations, the performance of the reconciled forecasts may be worsened.

Estimating $W_{h}$ presents difficulties, especially for complex hierarchies and forecast horizons beyond $h>1$ , due to the limited sample size determined by the number of top-level observations. Therefore, it may be practical to revert to simpler estimates as previously described. Additionally, Wickramasuriya et al., (2019) propose sample and shrinkage estimators by setting $W_{h}=k_{h}\hat{W}_{1}$ and $W_{h}=k_{h}(\lambda\text{diag}(\hat{W}_{1})+(1-\lambda)\hat{W}_{1}),\lambda\in% (0,1)$ , respectively, with appropriate consistency constants. The shrinkage estimator is particularly useful when $n>T$ , which can result in a singular sample covariance matrix.

The authors of Wickramasuriya et al., (2019) also give a different type of estimator, denoted by structural scaling. It is proposed to set $W_{h}=k_{h}\text{diag}(S\mathbf{1}_{n_{b}})$ implying that each forecast is scaled according to the number of series in its level of the hierarchy. Here, $\mathbf{1}_{n_{b}}$ is a vector with $n_{b}$ entries of one.

Overall, the minimum trace method addresses three key aspects. Firstly, it produces coherent forecasts, which is the most crucial factor. Secondly, as long as the base forecasts are unbiased, the reconciled forecasts will also be unbiased. Lastly, it enhances forecast performance by minimizing the forecast error variance across all series on average.

2.2 Temporal Hierarchical Forecast Reconciliation

While forecast reconciliation has not been developed with temporal hierarchies in mind, it can be applied to them naturally as discussed in Athanasopoulos et al., (2017). Temporal hierarchies allow for even more sophisticated methods for estimating the covariance matrix of the base forecast errors.

Let $y_{t}$ with $t=1,\dots,T$ be a univariate time series of interest of a certain frequency $m$ . A $k$ -aggregate, where $k$ is a factor of $m$ , is defined to be

\displaystyle y_{j}^{[k]}=\sum_{t=t^{\ast}+(j-1)k}^{t^{\ast}+jk-1}y_{t},\quad j% =1,\dots,\lfloor T/k\rfloor,

(5)

where $t^{\ast}$ is the starting point of the aggregation to ensure non-overlap** aggregates. The resulting frequency is then $M_{k}=m/k$ . To have a common index across all levels of aggregation, the authors set $i=1,\dots,\lfloor T/m\rfloor$ and

\displaystyle y_{M_{k}(i-1)+z}^{[k]}=y_{j}^{[k]},\quad z=1,\dots,M_{k},

(6)

such that $i$ controls the top-level steps and $z$ determines the steps within each aggregation period. That way we can write one time step of the hierarchy as the vector given by

\displaystyle\mathbf{y}_{i}=\left(y_{i}^{[m]},\dots,{\mathbf{y}_{i}^{[{k_{2}}]% }}^{\prime},{\mathbf{y}_{i}^{[{k_{1}}]}}^{\prime}\right)^{\prime},

(7)

where $\mathbf{y}_{i}^{[k]}=\left(y_{M_{k}(i-1)+1}^{[k]},y_{M_{k}(i-1)+2}^{[k]},\dots% ,y_{M_{k}i}^{[k]}\right)^{\prime}$ denotes the stacked entries of the time series at aggregation level $k$ . This implies that $\mathbf{y}_{i}=S\mathbf{y}_{i}^{[1]}$ , where $S$ is an appropriate summing matrix as defined in general forecast reconciliation.

According to Athanasopoulos et al., (2017) we write the levels of aggregation in descending order as $\{k_{p},\dots,k_{2},1\}$ with $k_{p}=m$ . For a quarterly-biannual-annual aggregation scheme this yields $k\in\{4,2,1\}$ . A corresponding visualization is available in Figure 1.

Figure 1: Visualization of an annual-biannual-quarterly temporal hierarchy.

The fact that $\mathbf{y}_{i}=S\mathbf{y}_{i}^{[1]}$ suggests we can set up a very similar regression problem based on the base forecasts as in Eq. (2). The minimum trace approach then yields

\displaystyle\mathbf{\tilde{y}}_{h}=S(S^{\prime}W_{h}^{-1}S)^{-1}S^{\prime}W_{% h}^{-1}\mathbf{\hat{y}}_{h},

(8)

where $\mathbf{\hat{y}}_{h}$ are the stacked base forecasts across the entire hierarchy, and $W_{h}=\text{Cov}(\mathbf{y}_{h}-\mathbf{\hat{y}}_{h})$ denotes the covariance matrix of the stacked base forecast errors. Specifically, this means that on each aggregation level, we require $M_{k}h$ -step forecasts, which can be already challenging to obtain properly.

As in conventional forecast reconciliation, the estimation of $W_{h}$ can be difficult because the sample size is bounded by the number of observations on the top level of the aggregation hierarchy. Thus, the authors propose several simplified covariance estimators. One of them is similar to the scaled reconciliation of Hyndman et al., (2011) by setting $W_{h}=k_{h}\text{diag}(\hat{W}_{1})$ , while structural scaling is also proposed as in Wickramasuriya et al., (2019).

Temporal aggregation allows for more refined methods to enhance the estimation of the covariance matrix. Nystrup et al., (2020) suggest modeling the autocorrelation structure of the forecasts, leading to four different estimators. The autocovariance scaling estimator estimates the full autocovariance matrix at each aggregation level, while the Markov scaling assumes a first-order Markov structure, estimating only lag $1$ correlations per aggregation level. Additionally, the authors propose using GLASSO to estimate the inverse cross-correlation matrix and a cross-correlation shrinkage estimator, similar to Wickramasuriya et al., (2019). It is worth noting that all correlation-based estimators can be combined with variance and structural scaling variances.

In a subsequent work by Nystrup et al., (2021), the authors explore dimension reduction. They propose an eigendecomposition of the cross-correlation matrix and construct a filtered precision matrix by selecting the first few eigenvectors and applying shrinkage to the eigenvalues. Such an estimator is especially useful when forecasting a very deep and complex hierarchy.

2.3 Temporal Aggregation

Temporal aggregation of series was first studied in the seminal work of Amemiya and Wu, (1972). A rather recent review of the most relevant advances in this field can be found in Silvestrini and Veredas, (2008). The models discussed in these works are mostly ARIMA-based, and we will briefly explain the essential ideas and results.

Consider a univariate time series $y_{t},t=1,\dots,T$ observed at some frequency. A $k$ -aggregate series is defined, equivalent to Eq. (5), by

\displaystyle y_{t}^{\ast}=\sum_{i=0}^{k}w_{i}y_{t-i}.

(9)

To obtain non-overlap** aggregates, a new time scale is introduced by setting $T=kt$ , and thus $y^{\ast}_{T}=y^{\ast}_{kt}$ with $y^{\ast}_{T+1}=y^{\ast}_{k(t+1)}$ . Hence, $y^{\ast}$ is a series at lower frequency because observations are only available every $k$ time steps.

The more general definition of Eq. (9) allows for different types of aggregation. The most common one is the so-called flow aggregation with $w_{i}=1$ . This type of aggregation is just the sum in each aggregation period. Another type is stock aggregation. One usually sets $k=0,w_{0}=1$ . Thus, only the last observation in each period is equal to the period’s aggregate. As in most literature, we also focus on the flow type of aggregation.

Now assume that the higher frequency series $y$ seen as a random process is an $\text{ARIMA}(p,d,q)$ model. We are interested in the model specification of $y^{\ast}$ after aggregation. The theory gives us that $y^{\ast}$ is again an ARIMA model as discussed in Silvestrini and Veredas, (2008, Section 3.3). We have that

\displaystyle y^{\ast}\sim\text{ARIMA}(p,d,r),\quad r\leq\left\lfloor\frac{p(k% -1)+(d+1)(k-1)+q}{k}\right\rfloor.

(10)

The autoregressive and integrated orders of the aggregated series remain unchanged, while the moving average order increases. The theory also provides a method to compute the exact parameters of the aggregated series. Specifically, the roots of the autoregressive polynomial of the AR component of the aggregated series are equal to the $k$ -th power of the AR roots of the disaggregated model. Thus, assuming stationarity, the AR effect in the aggregate model diminishes as the aggregation period increases. Simultaneously, the MA effect becomes more significant. However, calculating the MA coefficients is more complex. These coefficients can be determined by comparing the autocorrelation functions of the aggregated model and the transformed disaggregated model, leading to several potentially non-linear equations. The unknowns in these equations include the MA coefficients, the innovation variance, and a possible non-zero mean.

This theory has also been extended to more complex ARIMA models like ARIMAX or even SARIMA where the results are very much similar. There are even results when looking at volatility models such as GARCH.

The reason why the aggregated MA order in Eq. (10) is only bounded above by the right-hand side is due to the possibility of polynomial term cancellation in the disaggregated model, which can result in much simpler models. An extreme example is provided in Ramírez et al., (2014), where the authors show that if the disaggregated model is an $\text{AR}(9)$ model with non-zero coefficients at lags $3,6,9$ , then the $3$ -aggregated series will simplify to an $\text{AR}(3)$ model. This simplification is reasonable because the disaggregated series already contains the essential aggregation information.

In the same work of Ramírez et al., (2014), the forecast performance of aggregation is also investigated. The authors argue that if the aggregated series exhibits a moving average part, then its forecast error can be reduced when performing an according bottom-up forecast using the disaggregated series. This makes sense since aggregation leads to a loss of information. However, this is only the case if the moving average part is significant. If not, then the improvements are very small or even non-existent.

Since it might not be clear how such model aggregation works on paper, we put a thorough calculation of the simple $\text{AR}(1)$ model in A.

3 Temporal Hierarchical Forecast Reconciliation in Temporally Aggregated Models

In this section, we will theoretically integrate the fields of temporal forecast reconciliation and temporally aggregated ARIMA models. To the best of our knowledge, this is the first time such an integration has been attempted. While Athanasopoulos et al., (2017) utilized the theory of temporally aggregated ARIMA models, their approach was primarily experimental. They examined the performance of temporal forecast reconciliation methods, such as variance scaling, and compared them to a simple bottom-up approach under varying levels of uncertainty. Specifically, they conducted experiments with fixed model orders and parameters, fixed orders alone, or automatically selected models based on model selection criteria. The authors found that temporal forecast reconciliation and bottom-up methods perform equally well in highly certain settings, but the performance of bottom-up methods declines when models are misspecified.

In general, the data-generating process has not been of much interest so far in the field of temporal forecast reconciliation because it has been developed as a post-hoc procedure to transform base forecasts coherently. In the theory of temporally aggregated models, the combination of forecasts of different levels to achieve coherent or even better forecasts has not been looked at.

Our contribution is as follows: Utilizing the theoretical model of aggregation, we will derive the theoretical covariance matrix of the base forecast errors, denoted as $W$ , given in Lemma 1. This covariance matrix will then be employed to perform the minimum trace estimation manually. Through matrix algebra, we will demonstrate in Theorem 1 that the resulting map** matrix $G$ corresponds to a bottom-up forecast. Consequently, we show that within the framework of aggregated ARIMA models, the optimal forecast reconciliation technique is indeed the bottom-up approach.

Building on the insights from Section 2.3, we aim to manually implement the minimum trace reconciliation method. To do this, we need the covariance matrix of the base forecast errors, which we can readily compute. To maintain simplicity, we will initially focus on the straightforward case of an $\text{AR}(1)$ model and subsequently discuss more complex models. The first result in Lemma 1 is about the covariance structure of the aggregated model. Its proof can be found in A.

Lemma 1.

The covariance matrix $W_{1}$ of $1$ -step forecast errors in a $k$ -aggregated $\text{AR}(1)$ model with parameter $\phi$ and innovation variance $\sigma^{2}$ is equal to

\displaystyle W_{1}

\displaystyle=\begin{pmatrix}\sigma_{\ast}^{2}&\sigma^{2}\mathbf{1}_{k}^{% \prime}\Phi\Phi^{\prime}\\ \sigma^{2}\Phi\Phi^{\prime}\mathbf{1}_{k}&\sigma^{2}\Phi\Phi^{\prime}\end{pmatrix}

(11)

where $\mathbf{1}_{k}$ denotes a vector of ones of length $k$ , $\Phi$ is a lower triangle matrix given by

\displaystyle\Phi

\displaystyle=\begin{pmatrix}1&0&0&\dots&0\\ \phi&1&\ddots&\ddots&\vdots\\ \vdots&\ddots&\ddots&\ddots&\vdots\\ \phi^{k-2}&\ddots&\ddots&\ddots&0\\ \phi^{k-1}&\phi^{k-2}&\dots&\phi&1\end{pmatrix},

(12)

and $\sigma_{\ast}^{2}$ denotes the innovation variance of the aggregated model.

Based on Lemma 1 we now manually compute the optimal unbiased reconciliation matrix, summarised in Theorem 1. The proof is available in A.

Theorem 1.

The minimum trace reconciliation method in a $k$ -aggregated $\text{AR}(1)$ model is equal to a bottom-up approach, implying that

\displaystyle SG^{\ast}=\begin{pmatrix}0&\mathbf{1}_{k}^{\prime}\\ \mathbf{0}_{k}&I_{k}\end{pmatrix},

where $\mathbf{0}_{k}$ is a vector of zeros of length $k$ and $G^{\ast}$ denotes the optimal map** matrix from problem (2.1).

Theorem 1 indicates that the optimal unbiased reconciliation method for the aggregated $\text{AR}(1)$ model is the bottom-up approach. Consequently, the forecasts at the bottom level remain unchanged, with no potential for enhancing forecast accuracy. Conversely, the aggregated forecast is disregarded in any form of combination. This outcome elucidates why the bottom-up approach frequently demonstrates effectiveness in both simulation studies and real-world data applications, thus bolstering its practicality.

Before moving on to the experimental part of this study, we aim to illustrate how this theorem works using a sample-based approach. In Figure 2, the average transformation matrix $SG$ for a two-level hierarchy is presented. To do this, we simulated 100 models and estimated the complete sample covariance matrix based on the simulations. The models used consist of an $\text{AR}(1)$ model with parameters $\phi=0.8,\sigma^{2}=1$ at the lower level, which is then combined into an $\text{ARMA}(1,1)$ model at the higher level of the hierarchy with $k\in\{4,1\}$ . The nodes of the hierarchy are shown on both axes, with $1-1$ representing the entry at the top level and $2-i$ representing the $i$ -th step of the lower level. This precisely specifies the transformation matrix as used in Theorem 1. The first row shows the effects of the base forecasts on the reconciled top-level forecast. It is evident that there is little impact from the top-level base forecast, with nearly equal weights close to 1 for the bottom level base forecasts. Similarly, the following $4$ rows demonstrate the weights for the reconciled bottom level forecasts, with a zero column followed by the identity matrix $I_{4}$ . This indicates that the reconciled bottom level forecasts closely match the bottom level base forecasts. In summary, the tendency for a bottom-up reconciliation approach is clear.

Refer to caption — Figure 2: Full sample transformation matrix $SG$ for $n=100,\phi=0.8,h=1,k\in\{4,1\},\sigma^{2}=1$ . The colors correspond to the mean value over $100$ repetitions. The standard errors are given in parentheses.

In Section 4, we further investigate this theorem experimentally to gain a deeper understanding. A natural extension of Theorem 1 is to increase the depth of the hierarchy. Figure 13 in B shows the transformation matrix $SG$ for a three-level hierarchy with $k\in\{4,2,1\}$ , similar to Figure 2. While the results are less clear-cut, the tendency towards a bottom-up approach remains evident. Specifically, the reconciled first-level forecast is constructed using similar components from the lowest level, whereas the reconciled bottom level relies solely on base bottom level data. The standard errors, indicated in parentheses, show that the first three columns are close to zero, meaning that the forecasts for the first and second levels of the hierarchy do not carry much weight. In other words, the forecast for the first half-year is derived from the first two quarters, and similarly for the second half-year.

4 Experiments

In this section, we experimentally investigate different types of forecast reconciliation methods in the framework of temporally aggregated time series models and beyond.

We evaluate the results based on percentage errors, namely for aggregation parameter $k$ we obtain a relative mean squared error of

\displaystyle\text{rMSE}^{[k]}(\tilde{\mathbf{y}},\hat{\mathbf{y}})=\frac{\sum% _{i}\left\|\tilde{\mathbf{y}}_{i}^{[k]}-\mathbf{y}_{i}^{[k]}\right\|_{2}^{2}}{% \sum_{i}\left\|\hat{\mathbf{y}}_{i}^{[k]}-\mathbf{y}_{i}^{[k]}\right\|_{2}^{2}% }-1,

where $\tilde{\mathbf{y}}_{i}^{[k]}$ denotes the $i$ -th vector of reconciled forecasts of aggregation level $k$ , $\hat{\mathbf{y}}_{i}^{[k]}$ is the $i$ -th vector of the base forecasts of aggregation level $k$ , and $\|\cdot\|_{2}^{2}$ is the squared Euclidean norm. We analyze both in-sample (training) reconciliation errors and out-of-sample (test) reconciliation errors to assess generalizability, aggregating the corresponding observations accordingly. Depending on the level of aggregation, we may encounter multi-step ahead forecasts. To simplify, we aggregate these multi-step forecasts, providing a single error measure for each aggregation level.

The test reconciliation forecasts are acquired through the following procedure. The reconciliation method employed is trained exclusively on the training data, meaning that the covariance matrix and the corresponding base ARIMA models are estimated solely based on the training data. Subsequently, forecasts for $h$ steps ahead are generated for the test data in a cumulative manner, effectively utilizing the test data for the base test forecasts.

MSE values are computed for each level of the hierarchy as well on an overall level by taking the sum of MSEs across all levels. The reason we consider MSE instead of a different error measure is that the minimum trace reconciliation method exactly minimizes the sum of the error variances.

For a robustness check of the results, we also consider a relative mean absolute error and use it to calculate percentage errors. Namely,

\displaystyle\text{rMAE}^{[k]}(\tilde{\mathbf{y}},\hat{\mathbf{y}})=\frac{\sum% _{i}\left\|\tilde{\mathbf{y}}_{i}^{[k]}-\mathbf{y}_{i}^{[k]}\right\|_{1}}{\sum% _{i}\left\|\hat{\mathbf{y}}_{i}^{[k]}-\mathbf{y}_{i}^{[k]}\right\|_{1}}-1,

(13)

where $\|\cdot\|_{1}$ is the absolute-value norm. This error measure is inherently less sensitive to outliers. We have focused on reporting results for rMSE to keep things concise. The conclusions remain consistent even when considering rMAE or similar relative error measures.

Overall, if a percentage error is below $0$ , it indicates that the reconciled forecasts perform better, whereas errors above $0$ suggest the opposite. It is important to note that we are only examining relative errors, focusing on the performance of the temporally reconciled forecasts rather than the base forecasts. Our aim is to evaluate how different types of temporal forecast reconciliation methods perform.

4.1 Autoregressive Models of Order 1

In the first experiment, we want to demonstrate the implications of Theorem 1. We simulate stationary $\text{AR}(1)$ data on the bottom level of the hierarchy and aggregate them to obtain the remaining levels of the hierarchy. The parameters we vary are

•

Sample size on the top level $n=20,50,100$ ,
•

AR parameter $\phi=-0.9,\dots,0.9$ ,
•

Innovation variance on the bottom level $\sigma^{2}=1,5$ ,
•

Hierarchy size $k\in\{4,1\},\{5,1\},\{12,4,1\}$ ,
•

Forecast horizon $h=1,2$ , and
•

Fixed order of the ARMA models to remove model uncertainty which corresponds to Scenario $2$ of Athanasopoulos et al., (2017), or automated model selection (Scenario $3$ ).

For each setting we simulate $N=50$ time series and compute training and test rMSE values. The training data always consist of $75\%$ of the total data.

The covariance estimators we focus on in this simulation are

•

OLS: $\hat{W}_{h}=k_{h}I$ ,
•

Full Cov.: $\hat{W}_{h}=\frac{1}{0.75n}\sum_{i=1}^{0.75n}\left(\hat{\mathbf{e}}_{i}^{(h)}% \right)\left(\hat{\mathbf{e}}_{i}^{(h)}\right)^{\prime}$ , where $\hat{\mathbf{e}}_{i}^{(h)}$ denote the $i$ -th vector of $h$ -step residuals of the base forecasts, and
•
Spectral Scaling (Nystrup et al.,, 2021):
1. 1.
  
  Shrink the empirical cross-correlation matrix $R$ to $R_{\text{shrink}}=(1-\nu)R+\nu I$
2. 2.
  
  Eigen-decompose this shrunk cross-correlation matrix by $R_{\text{shrink}}=V\Lambda_{\text{shrink}}V^{\prime}$ where $R=V\Lambda V^{\prime}$ .
3. 3.
  
  Reconstruct the filtered precision matrix by $Q=(WAW^{\prime}+cI)^{-1}$ such that $W$ contains the first $n_{\text{eig}}$ columns of $V$ and $A=\text{diag}((1-\nu)\lambda_{1}+\nu-c,\dots,(1-\nu)\lambda_{\text{neig}}+\nu-c)$ with $c$ being the average of the remaining smallest shrunken eigenvalues.
4. 4.
  
  Set $\hat{W}_{h}^{-1}=D_{\text{var}}^{-1/2}QD_{\text{var}}^{-1/2}$ where $D_{\text{var}}$ corresponds to variance scaling.
The two hyperparameters $\nu,n_{\text{eig}}$ are chosen in a time series cross-validation procedure. The authors do not follow this procedure and rather rely on an optimally chosen shrinkage parameter $\nu$ (Ledoit and Wolf, (2012)) and a fixed number of chosen eigenvectors $n_{\text{eig}}$ .

Other estimators, including various shrinkage estimators and scaling variants, were initially considered in this simulation but produced results very similar to those listed. Additionally, the bottom-up approach was also examined.

4.1.1 One-Step Ahead

At first, we take a look at the performance of the bottom-up approach compared to using the full covariance matrix for reconciliation. Figure 4 shows the difference of in-sample rMSE values for $h=1,k\in\{4,1\},\sigma^{2}=1$ as well as fixed orders of the models to remove model uncertainty. We clearly observe that both methods result in very similar improvements once the covariance matrix can be estimated properly. The differences are driven by the top level of the hierarchy since most changes are to be expected there. Thus, the theoretical results also hold in this simulation setting. Figure 4 shows the test differences in rMSE. While the differences are indeed higher than expected, the theoretical results still hold on the test sets, and we can conclude that the full covariance matrix reconciliation method is equivalent to the bottom-up approach. Interestingly, most differences are present at larger values of $\phi$ .

Table 1 presents the training and test rMSE values for the selected reconciliation methods and parameters, grouped by buckets of the AR parameter. This allows us to distinguish between high negative or positive correlation as well as almost random walks. We observe that most improvements occur at the top level of the hierarchy, while reconciliation at the bottom level yields worse results, especially out-of-sample. Overall, we notice similar improvements for the bottom-up approach compared to more sophisticated methods once the sample size is sufficiently large. Note that the highest improvements are observed for a large AR parameter across all methods.

Table 1: Mean rMSE per buckets of

\phi

for

h=1,k\in\{4,1\},\sigma^{2}=1

and fixed order of the used models. The standard errors are given in parentheses.

			Training rMSE			Test rMSE
Level	n	Recon. Type	[-0.9,-0.5]	(-0.5,0.5]	(0.5,0.9]	[-0.9,-0.5]	(-0.5,0.5]	(0.5,0.9]
		Bottom-Up	0.15 (0.02)	0.16 (0.01)	0.03 (0.02)	-0.08 (0.03)	-0.04 (0.02)	-0.11 (0.03)
		Full Cov.	-0.03 (0.01)	0.01 (0.01)	-0.10 (0.01)	0.11 (0.04)	0.09 (0.02)	-0.06 (0.03)
		Spectral	-0.03 (0.00)	0.01 (0.00)	-0.08 (0.01)	-0.02 (0.02)	-0.01 (0.01)	-0.04 (0.04)
	20	OLS	-0.01 (0.00)	0.01 (0.00)	-0.04 (0.00)	-0.06 (0.01)	-0.04 (0.00)	-0.08 (0.01)
		Bottom-Up	0.02 (0.01)	0.06 (0.00)	-0.08 (0.01)	-0.11 (0.01)	-0.07 (0.01)	-0.14 (0.01)
		Full Cov.	-0.05 (0.00)	-0.01 (0.00)	-0.12 (0.01)	-0.05 (0.01)	0.01 (0.01)	-0.13 (0.01)
		Spectral	-0.03 (0.00)	0.00 (0.00)	-0.11 (0.01)	-0.05 (0.01)	-0.02 (0.01)	-0.13 (0.01)
	50	OLS	-0.02 (0.00)	0.00 (0.00)	-0.05 (0.00)	-0.04 (0.00)	-0.03 (0.00)	-0.06 (0.00)
		Bottom-Up	-0.03 (0.00)	0.01 (0.00)	-0.11 (0.01)	-0.09 (0.01)	-0.04 (0.00)	-0.17 (0.01)
		Full Cov.	-0.05 (0.00)	-0.01 (0.00)	-0.13 (0.01)	-0.06 (0.01)	0.00 (0.00)	-0.15 (0.01)
		Spectral	-0.04 (0.00)	-0.01 (0.00)	-0.12 (0.01)	-0.05 (0.01)	-0.02 (0.00)	-0.15 (0.01)
Level 1	100	OLS	-0.02 (0.00)	0.00 (0.00)	-0.05 (0.00)	-0.04 (0.00)	-0.01 (0.00)	-0.06 (0.00)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-0.06 (0.01)	-0.06 (0.01)	-0.08 (0.01)	0.17 (0.02)	0.24 (0.03)	0.17 (0.03)
		Spectral	-0.04 (0.00)	-0.04 (0.00)	-0.05 (0.01)	0.07 (0.01)	0.07 (0.01)	0.12 (0.02)
	20	OLS	-0.01 (0.00)	-0.03 (0.00)	0.00 (0.01)	0.02 (0.00)	0.04 (0.01)	0.16 (0.03)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-0.02 (0.00)	-0.03 (0.00)	-0.03 (0.00)	0.05 (0.01)	0.09 (0.01)	0.04 (0.01)
		Spectral	-0.01 (0.00)	-0.02 (0.00)	-0.01 (0.00)	0.02 (0.01)	0.05 (0.01)	0.03 (0.01)
	50	OLS	0.00 (0.00)	-0.01 (0.00)	0.04 (0.01)	0.01 (0.00)	0.03 (0.01)	0.11 (0.01)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-0.01 (0.00)	-0.02 (0.00)	-0.02 (0.00)	0.02 (0.00)	0.02 (0.00)	0.02 (0.01)
		Spectral	-0.01 (0.00)	-0.01 (0.00)	0.00 (0.00)	0.01 (0.00)	0.01 (0.00)	0.02 (0.01)
Level 2	100	OLS	0.00 (0.00)	0.00 (0.00)	0.05 (0.00)	0.01 (0.00)	0.01 (0.00)	0.10 (0.01)
		Bottom-Up	0.07 (0.01)	0.12 (0.01)	0.02 (0.01)	-0.08 (0.01)	-0.05 (0.01)	-0.12 (0.03)
		Full Cov.	-0.04 (0.01)	-0.01 (0.01)	-0.11 (0.01)	0.10 (0.02)	0.10 (0.01)	-0.05 (0.02)
		Spectral	-0.03 (0.00)	-0.01 (0.00)	-0.08 (0.01)	0.01 (0.01)	0.00 (0.01)	-0.04 (0.04)
	20	OLS	-0.01 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.04 (0.00)	-0.03 (0.00)	-0.07 (0.00)
		Bottom-Up	0.01 (0.00)	0.04 (0.00)	-0.08 (0.01)	-0.07 (0.01)	-0.06 (0.01)	-0.14 (0.01)
		Full Cov.	-0.04 (0.00)	-0.01 (0.00)	-0.11 (0.01)	-0.01 (0.01)	0.02 (0.00)	-0.12 (0.01)
		Spectral	-0.02 (0.00)	0.00 (0.00)	-0.10 (0.01)	-0.02 (0.00)	-0.01 (0.01)	-0.12 (0.01)
	50	OLS	-0.01 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.02 (0.00)	-0.02 (0.00)	-0.05 (0.00)
		Bottom-Up	-0.02 (0.00)	0.01 (0.00)	-0.10 (0.01)	-0.06 (0.01)	-0.03 (0.00)	-0.15 (0.01)
		Full Cov.	-0.03 (0.00)	-0.01 (0.00)	-0.12 (0.01)	-0.03 (0.00)	0.00 (0.00)	-0.14 (0.01)
		Spectral	-0.03 (0.00)	-0.01 (0.00)	-0.11 (0.00)	-0.02 (0.00)	-0.01 (0.00)	-0.14 (0.01)
Overall	100	OLS	-0.01 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.02 (0.00)	-0.01 (0.00)	-0.05 (0.00)

4.1.2 Deeper Hierarchy

Table 2: Mean rMSE per buckets of

\phi

for

h=1,k\in\{12,4,1\},\sigma^{2}=1

and fixed order of the used models. The standard errors are given in parentheses.

			Training rMSE			Test rMSE
Level	n	Recon. Type	[-0.9,-0.5]	(-0.5,0.5]	(0.5,0.9]	[-0.9,-0.5]	(-0.5,0.5]	(0.5,0.9]
		Bottom-Up	0.16 (0.02)	0.20 (0.01)	0.08 (0.02)	-0.05 (0.04)	-0.06 (0.02)	-0.13 (0.03)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.03 (0.01)	0.01 (0.00)	-0.07 (0.01)	-0.06 (0.01)	0.00 (0.01)	0.02 (0.05)
	20	OLS	0.00 (0.00)	0.01 (0.00)	-0.02 (0.00)	-0.05 (0.01)	-0.05 (0.00)	-0.08 (0.01)
		Bottom-Up	0.02 (0.01)	0.07 (0.00)	-0.04 (0.01)	-0.10 (0.01)	-0.05 (0.01)	-0.15 (0.01)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.04 (0.00)	0.00 (0.00)	-0.09 (0.01)	-0.04 (0.01)	0.00 (0.01)	-0.07 (0.02)
	50	OLS	-0.01 (0.00)	0.00 (0.00)	-0.03 (0.00)	-0.05 (0.00)	-0.02 (0.00)	-0.06 (0.00)
		Bottom-Up	-0.03 (0.00)	0.03 (0.00)	-0.08 (0.01)	-0.08 (0.01)	-0.04 (0.00)	-0.13 (0.01)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.04 (0.00)	0.00 (0.00)	-0.10 (0.01)	-0.04 (0.01)	-0.01 (0.00)	-0.10 (0.01)
Level 1	100	OLS	-0.02 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.03 (0.00)	-0.02 (0.00)	-0.05 (0.00)
		Bottom-Up	-0.02 (0.00)	0.01 (0.00)	-0.02 (0.00)	-0.05 (0.01)	-0.01 (0.00)	-0.04 (0.01)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.07 (0.00)	-0.05 (0.00)	-0.09 (0.01)	0.02 (0.01)	0.06 (0.01)	0.18 (0.05)
	20	OLS	-0.03 (0.00)	-0.03 (0.00)	-0.04 (0.01)	0.00 (0.01)	0.02 (0.01)	0.12 (0.02)
		Bottom-Up	-0.03 (0.00)	0.00 (0.00)	-0.02 (0.00)	-0.04 (0.00)	-0.01 (0.00)	-0.04 (0.00)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.05 (0.00)	-0.03 (0.00)	-0.05 (0.00)	0.02 (0.01)	0.03 (0.01)	0.03 (0.01)
	50	OLS	-0.02 (0.00)	-0.01 (0.00)	0.00 (0.00)	0.00 (0.00)	0.01 (0.00)	0.05 (0.01)
		Bottom-Up	-0.04 (0.00)	0.00 (0.00)	-0.03 (0.00)	-0.05 (0.00)	-0.01 (0.00)	-0.03 (0.00)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.04 (0.00)	-0.01 (0.00)	-0.04 (0.00)	-0.02 (0.00)	0.01 (0.00)	-0.01 (0.00)
Level 2	100	OLS	-0.02 (0.00)	-0.01 (0.00)	0.01 (0.00)	-0.01 (0.00)	0.01 (0.00)	0.04 (0.01)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.03 (0.00)	-0.04 (0.00)	-0.06 (0.01)	0.04 (0.01)	0.05 (0.01)	0.19 (0.04)
	20	OLS	0.00 (0.00)	-0.02 (0.00)	-0.01 (0.01)	0.01 (0.00)	0.01 (0.00)	0.14 (0.02)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.02 (0.00)	-0.02 (0.00)	-0.03 (0.00)	0.03 (0.01)	0.02 (0.00)	0.06 (0.01)
	50	OLS	0.00 (0.00)	-0.01 (0.00)	0.02 (0.00)	0.01 (0.00)	0.01 (0.00)	0.07 (0.01)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.01 (0.00)	-0.01 (0.00)	-0.01 (0.00)	0.01 (0.00)	0.01 (0.00)	0.02 (0.00)
Level 3	100	OLS	0.00 (0.00)	0.00 (0.00)	0.03 (0.00)	0.01 (0.00)	0.00 (0.00)	0.05 (0.01)
		Bottom-Up	0.06 (0.01)	0.14 (0.01)	0.06 (0.02)	-0.09 (0.01)	-0.06 (0.01)	-0.14 (0.02)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.04 (0.00)	-0.01 (0.00)	-0.08 (0.01)	-0.03 (0.01)	0.00 (0.01)	0.01 (0.04)
	20	OLS	-0.01 (0.00)	0.00 (0.00)	-0.03 (0.00)	-0.04 (0.00)	-0.04 (0.00)	-0.07 (0.01)
		Bottom-Up	0.00 (0.00)	0.05 (0.00)	-0.04 (0.01)	-0.08 (0.01)	-0.04 (0.01)	-0.14 (0.01)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.04 (0.00)	-0.01 (0.00)	-0.09 (0.01)	-0.02 (0.01)	0.00 (0.01)	-0.06 (0.02)
	50	OLS	-0.01 (0.00)	0.00 (0.00)	-0.03 (0.00)	-0.03 (0.00)	-0.02 (0.00)	-0.05 (0.00)
		Bottom-Up	-0.02 (0.00)	0.02 (0.00)	-0.07 (0.01)	-0.06 (0.01)	-0.03 (0.00)	-0.12 (0.01)
		Full Cov.	-	-	-	-	-	-
		Spectral	-0.04 (0.00)	-0.01 (0.00)	-0.09 (0.01)	-0.03 (0.00)	-0.01 (0.00)	-0.10 (0.01)
Overall	100	OLS	-0.01 (0.00)	0.00 (0.00)	-0.03 (0.00)	-0.02 (0.00)	-0.01 (0.00)	-0.04 (0.00)

Table 2 displays the training errors for a three-level hierarchy using fixed-order models. Note that in this scenario, the full covariance matrix cannot be estimated due to the simple models producing a singular covariance matrix of the base forecast errors. This issue also arises with automatically selected base models. For the other methods, we observe similar improvements at the top level. Interestingly, the spectral method based on dimension reduction performs exceptionally well, yielding better results than the bottom-up reconciliation method based on in-sample errors. Out-of-sample this relationship is turned over and the bottom-up approach generalizes more efficiently.

4.1.3 Multi-Step Ahead

Table 3: Mean rMSE per buckets of

\phi

for

h=2,k\in\{4,1\},\sigma^{2}=1

and fixed order of the used models. The standard errors are given in parentheses.

			Training rMSE			Test rMSE
Level	n	Recon. Type	[-0.9,-0.5]	(-0.5,0.5]	(0.5,0.9]	[-0.9,-0.5]	(-0.5,0.5]	(0.5,0.9]
		Bottom-Up	0.02 (0.01)	0.06 (0.00)	-0.11 (0.02)	-0.01 (0.02)	-0.02 (0.01)	0.02 (0.02)
		Full Cov.	0.00 (0.02)	0.02 (0.01)	-0.14 (0.03)	0.18 (0.05)	0.16 (0.03)	0.21 (0.05)
		Spectral	-0.01 (0.00)	0.00 (0.00)	-0.15 (0.02)	-0.01 (0.01)	0.00 (0.01)	0.15 (0.05)
	20	OLS	0.00 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.01 (0.00)	-0.01 (0.00)	-0.01 (0.00)
		Bottom-Up	-0.01 (0.00)	0.03 (0.00)	-0.14 (0.01)	-0.04 (0.01)	-0.02 (0.01)	-0.04 (0.01)
		Full Cov.	-0.04 (0.00)	-	-0.19 (0.01)	0.02 (0.01)	-	0.01 (0.01)
		Spectral	-0.02 (0.00)	0.00 (0.00)	-0.17 (0.01)	-0.02 (0.00)	-0.01 (0.01)	-0.02 (0.01)
	50	OLS	-0.01 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.01 (0.00)	-0.01 (0.00)	-0.02 (0.00)
		Bottom-Up	-0.02 (0.00)	0.01 (0.00)	-0.16 (0.01)	-0.02 (0.00)	-0.02 (0.00)	-0.02 (0.01)
		Full Cov.	-0.04 (0.00)	0.00 (0.00)	-0.22 (0.01)	0.01 (0.01)	0.01 (0.00)	0.02 (0.01)
		Spectral	-0.02 (0.00)	0.00 (0.00)	-0.20 (0.01)	-0.02 (0.00)	0.00 (0.00)	0.01 (0.01)
Level 1	100	OLS	-0.01 (0.00)	0.00 (0.00)	-0.05 (0.00)	-0.01 (0.00)	-0.01 (0.00)	-0.01 (0.00)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-0.08 (0.01)	-0.05 (0.01)	-0.06 (0.02)	0.18 (0.03)	0.23 (0.03)	0.13 (0.02)
		Spectral	-0.03 (0.00)	-0.02 (0.00)	-0.04 (0.01)	0.01 (0.00)	0.03 (0.01)	0.06 (0.01)
	20	OLS	0.00 (0.00)	-0.01 (0.00)	0.12 (0.02)	0.00 (0.00)	0.01 (0.00)	0.02 (0.01)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-0.09 (0.01)	-	-0.06 (0.00)	0.11 (0.01)	-	0.04 (0.01)
		Spectral	-0.03 (0.00)	-0.01 (0.00)	-0.04 (0.00)	0.02 (0.00)	0.01 (0.00)	0.02 (0.00)
	50	OLS	0.00 (0.00)	-0.01 (0.00)	0.12 (0.01)	0.00 (0.00)	0.01 (0.00)	0.03 (0.01)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-0.09 (0.00)	-0.01 (0.00)	-0.06 (0.00)	0.08 (0.01)	0.03 (0.00)	0.03 (0.00)
		Spectral	-0.04 (0.00)	0.00 (0.00)	-0.04 (0.00)	0.01 (0.00)	0.01 (0.00)	0.02 (0.00)
Level 2	100	OLS	0.00 (0.00)	0.00 (0.00)	0.14 (0.01)	0.00 (0.00)	0.00 (0.00)	0.01 (0.00)
		Bottom-Up	0.01 (0.00)	0.04 (0.00)	-0.11 (0.02)	-0.01 (0.01)	-0.02 (0.01)	0.01 (0.02)
		Full Cov.	-0.04 (0.01)	0.00 (0.01)	-0.14 (0.03)	0.12 (0.02)	0.16 (0.02)	0.18 (0.04)
		Spectral	-0.02 (0.00)	0.00 (0.00)	-0.15 (0.02)	0.00 (0.00)	0.00 (0.01)	0.11 (0.04)
	20	OLS	0.00 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.01 (0.00)	-0.01 (0.00)	-0.01 (0.00)
		Bottom-Up	0.00 (0.00)	0.02 (0.00)	-0.13 (0.01)	-0.02 (0.00)	-0.02 (0.00)	-0.04 (0.01)
		Full Cov.	-0.06 (0.00)	-	-0.19 (0.01)	0.06 (0.01)	-	0.01 (0.01)
		Spectral	-0.03 (0.00)	0.00 (0.00)	-0.17 (0.01)	0.00 (0.00)	0.00 (0.00)	-0.02 (0.01)
	50	OLS	0.00 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.01 (0.00)	-0.01 (0.00)	-0.01 (0.00)
		Bottom-Up	-0.01 (0.00)	0.01 (0.00)	-0.15 (0.01)	-0.01 (0.00)	-0.01 (0.00)	-0.02 (0.01)
		Full Cov.	-0.07 (0.00)	0.00 (0.00)	-0.21 (0.01)	0.04 (0.00)	0.01 (0.00)	0.02 (0.01)
		Spectral	-0.03 (0.00)	0.00 (0.00)	-0.19 (0.01)	0.00 (0.00)	0.00 (0.00)	0.01 (0.01)
Overall	100	OLS	0.00 (0.00)	0.00 (0.00)	-0.04 (0.00)	0.00 (0.00)	0.00 (0.00)	-0.01 (0.00)

As we extend the forecast horizon, the results shift, with the bottom-up approach performing worse compared to using the full covariance matrix or even the reduced spectral-based one, as shown in Table 3. This trend holds in-sample; however, out-of-sample, the situation changes. The bottom-up method then produces the best test relative errors, as previously observed.

4.1.4 Odd Hierarchy Width

Table 4: Mean rMSE per buckets of

\phi

for

h=1,k\in\{5,1\},\sigma^{2}=1

and fixed order of the used models. The standard errors are given in parentheses.

			Training rMSE			Test rMSE
Level	n	Recon. Type	[-0.9,-0.5]	(-0.5,0.5]	(0.5,0.9]	[-0.9,-0.5]	(-0.5,0.5]	(0.5,0.9]
		Bottom-Up	0.05 (0.02)	0.17 (0.01)	0.02 (0.02)	-0.15 (0.03)	-0.09 (0.02)	-0.17 (0.02)
		Full Cov.	-0.10 (0.01)	0.02 (0.01)	-0.11 (0.01)	0.00 (0.04)	0.07 (0.02)	-0.07 (0.02)
		Spectral	-0.09 (0.01)	0.01 (0.00)	-0.08 (0.01)	-0.08 (0.02)	0.01 (0.01)	-0.11 (0.02)
	20	OLS	-0.03 (0.00)	0.00 (0.00)	-0.03 (0.00)	-0.07 (0.00)	-0.04 (0.00)	-0.07 (0.00)
		Bottom-Up	-0.07 (0.01)	0.06 (0.00)	-0.08 (0.01)	-0.16 (0.01)	-0.07 (0.01)	-0.15 (0.02)
		Full Cov.	-0.12 (0.01)	0.00 (0.01)	-0.12 (0.01)	-0.14 (0.01)	0.02 (0.01)	-0.12 (0.01)
		Spectral	-0.10 (0.01)	0.00 (0.00)	-0.10 (0.01)	-0.11 (0.01)	-0.02 (0.01)	-0.10 (0.02)
	50	OLS	-0.04 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.06 (0.00)	-0.02 (0.00)	-0.05 (0.00)
		Bottom-Up	-0.11 (0.01)	0.02 (0.00)	-0.12 (0.01)	-0.15 (0.01)	-0.05 (0.00)	-0.13 (0.01)
		Full Cov.	-0.13 (0.01)	-0.01 (0.00)	-0.13 (0.01)	-0.14 (0.01)	-0.01 (0.00)	-0.12 (0.01)
		Spectral	-0.11 (0.01)	0.00 (0.00)	-0.13 (0.01)	-0.12 (0.01)	-0.02 (0.00)	-0.12 (0.01)
Level 1	100	OLS	-0.04 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.05 (0.00)	-0.01 (0.00)	-0.05 (0.00)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-0.06 (0.00)	-0.06 (0.01)	-0.09 (0.01)	0.16 (0.02)	0.21 (0.02)	0.24 (0.03)
		Spectral	-0.03 (0.00)	-0.04 (0.00)	-0.05 (0.01)	0.07 (0.01)	0.11 (0.02)	0.16 (0.03)
	20	OLS	0.00 (0.00)	-0.03 (0.00)	0.01 (0.01)	0.02 (0.00)	0.05 (0.01)	0.21 (0.03)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-0.02 (0.00)	-0.03 (0.00)	-0.03 (0.00)	0.03 (0.00)	0.07 (0.01)	0.05 (0.01)
		Spectral	-0.01 (0.00)	-0.02 (0.00)	-0.01 (0.00)	0.01 (0.00)	0.03 (0.01)	0.06 (0.01)
	50	OLS	0.00 (0.00)	-0.01 (0.00)	0.04 (0.01)	0.01 (0.00)	0.02 (0.00)	0.12 (0.01)
		Bottom-Up	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)	0.00 (0.00)
		Full Cov.	-0.01 (0.00)	-0.02 (0.00)	-0.01 (0.00)	0.01 (0.00)	0.03 (0.00)	0.01 (0.00)
		Spectral	-0.01 (0.00)	-0.01 (0.00)	0.00 (0.00)	0.01 (0.00)	0.01 (0.00)	0.01 (0.00)
Level 2	100	OLS	0.00 (0.00)	0.00 (0.00)	0.06 (0.00)	0.01 (0.00)	0.01 (0.00)	0.08 (0.01)
		Bottom-Up	0.02 (0.01)	0.13 (0.01)	0.02 (0.02)	-0.13 (0.02)	-0.10 (0.01)	-0.17 (0.02)
		Full Cov.	-0.09 (0.01)	0.00 (0.01)	-0.11 (0.01)	0.02 (0.02)	0.07 (0.01)	-0.06 (0.02)
		Spectral	-0.07 (0.01)	0.00 (0.00)	-0.08 (0.01)	-0.04 (0.01)	0.01 (0.01)	-0.10 (0.02)
	20	OLS	-0.02 (0.00)	0.00 (0.00)	-0.03 (0.00)	-0.04 (0.00)	-0.03 (0.00)	-0.06 (0.00)
		Bottom-Up	-0.05 (0.01)	0.04 (0.00)	-0.08 (0.01)	-0.11 (0.01)	-0.06 (0.01)	-0.15 (0.01)
		Full Cov.	-0.08 (0.00)	-0.01 (0.01)	-0.12 (0.01)	-0.08 (0.01)	0.02 (0.01)	-0.11 (0.01)
		Spectral	-0.07 (0.00)	0.00 (0.00)	-0.10 (0.01)	-0.07 (0.01)	-0.02 (0.00)	-0.09 (0.02)
	50	OLS	-0.02 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.03 (0.00)	-0.02 (0.00)	-0.05 (0.00)
		Bottom-Up	-0.07 (0.00)	0.02 (0.00)	-0.11 (0.01)	-0.10 (0.01)	-0.04 (0.00)	-0.13 (0.01)
		Full Cov.	-0.08 (0.00)	-0.01 (0.00)	-0.13 (0.01)	-0.08 (0.01)	0.00 (0.00)	-0.12 (0.01)
		Spectral	-0.07 (0.00)	0.00 (0.00)	-0.12 (0.01)	-0.07 (0.01)	-0.01 (0.00)	-0.11 (0.01)
Overall	100	OLS	-0.02 (0.00)	0.00 (0.00)	-0.04 (0.00)	-0.03 (0.00)	-0.01 (0.00)	-0.04 (0.00)

So far, we have only considered even hierarchy widths such as $\{4,1\}$ or $\{12,4,1\}$ . These even aggregations result in a non-negative AR parameter at the top level, even if the bottom level model is generated with a negative one. Table 4 shows the training and test relative errors for the odd width hierarchy $\{5,1\}$ . We observe that for a negative AR parameter, the overall improvements are much more significant. In-sample, covariance-based methods still perform better in low sample size settings, with the difference becoming marginally small for larger sample sizes. However, the bottom-up method yields better results on the test set.

4.2 ARMA Models of Higher Order

For more complex models such as $\text{ARMA}(2,2)$ and its aggregates, computing the covariance matrices of forecast errors becomes very tedious. Therefore, we focus on experimental evaluation for these cases to investigate if the implications of Theorem 1 still hold.

As the complexity of an ARMA model increases, identifying the parameter space that yields stationary models becomes non-trivial. It is particularly challenging to define stationary parameter combinations for $p,q>2$ . To address this, we randomly draw stationary parameters using the partial correlation function as described by Jones, (1987).

For each combination of $p\in\{1,2\}$ and $q\in\{0,1,2\}$ , we randomly draw $100$ sets of parameters ${\phi_{1},\dots,\phi_{p},\theta_{1},\dots,\theta_{q}}$ . To mitigate the randomness of each realization, we further simulate $20$ time series for each of the $100$ random parameter sets.

Figure 5 shows the in-sample rMSE values for the full covariance estimator as well as the bottom-up approach for various sample sizes of the top-level. The setting is $h=1,k\in\{4,1\}$ and $\sigma^{2}=1$ as well as fixed-order models. As in the $\text{AR}(1)$ for varying AR parameters, we observe equivalent reconciliation performance for a larger sample size for any $\text{ARMA}(p,q)$ present. In the low sample size case we see that bottom-up performs worse with increasing model complexity. Interestingly, this difference becomes larger for higher model complexity. We also observe that the full covariance method can produce better forecasts on the bottom level. This improvement also increases with the complexity of the bottom level base model. Overall, the MA order does not seem as impactful as the AR order.

Figure 6 shows the test errors for the very same setting. As in the simple $\text{AR}(1)$ case, the roles of bottom-up and using the full covariance matrix estimator switch and the bottom-up approach perform better the more complex the base bottom model is set up to be.

In this analysis, we aggregate over the whole space of stationary models of a certain order. Hence we also take a look at the performance of $2$ -dimensional base models in a more detailed manner. Figure 7 shows the mean training rMSE differences between the full covariance-based reconciliation and the bottom-up approach for the randomly drawn stationary $\text{AR}(2)$ models. Based on this plot, there is no tendency for performance based on the space of the stationary parameters. Test errors are available in the Appendix in Figure 14.

Similarly, Figure 8 shows the training mean rMSE differences for $\text{ARMA}(1,1)$ models. Test errors are available in the Appendix in Figure 15.

5 Real Data Applications

5.1 A&E Emergency Service Demand

Following the data example of Athanasopoulos et al., (2017), we illustrate this paper’s work on the Accident & Emergency Service Demand dataset, available from the thief package in R. In this dataset, a number of demand statistics of A&E departments are recorded on a weekly basis from $2010-11-07$ to $2015-06-07$ .

Before any modeling, we perform some preprocessing. To ensure complete observations for the hierarchy, we remove the incomplete years $2011$ and $2015$ , resulting in $208$ weeks of data. Next, we decompose the weekly time series of interest into seasonal, trend, and remaining components using the stl function in R, and remove the seasonal component. For interpretability, we also demean the resulting non-seasonal weekly time series.

We analyze the Total Attendances time series and aggregate it on a monthly basis, resulting in a small hierarchy with $52$ months of data. The training data consists of the first $41$ months, or $164$ weeks, with the remaining data designated as test data. As before, we are focused on cumulative one-step-ahead forecasts at the top level of the hierarchy, which in this case would be month-by-month forecasts. Using automated model selection, the chosen models are $\text{ARIMA}(0,0,0)$ and $\text{ARIMA}(1,1,1)$ , respectively.

To stick to the framework of temporally aggregated ARIMA models, we fix the orders of the used models accordingly. This yields an $\text{ARIMA}(1,1,2)$ model for the monthly time series. The resulting model on the top level gives an AICc value of $406.47$ which is only around $0.6\%$ worse than the automatically selected model, hence it still seems like an appropriate model. Table 5 shows the corresponding errors. We observe better generability of the bottom-up approach compared to using the full covariance matrix. The spectral method does seem to perform quite well out-of-sample leading to similar results as the bottom-up approach. A common aspect is still the fact that each covariance-based reconciliation method achieves worse forecasts on the test set for the bottom level time series.

Table 5: Results for A&E Total Addendances in units of

\text{People}^{2}

and Wool Production in units of

(100~{}\text{tonnes})^{2}

with fixed-order models.

	Level	Training Base MSE	Test Base MSE	Recon. Type	Training Recon. MSE	Test Recon. MSE	Training rMSE	Test rMSE
A&E Total Attendances				Bottom-Up	1219.43	1981.83	0.08	-0.09
				Full Cov.	1124.78	2222.40	0.00	0.03
				Spectral	1169.64	2021.56	0.04	-0.07
	Annual	1125.60	2166.15	OLS	1132.26	2112.09	0.01	-0.02
				Bottom-Up	150.19	170.12	0.00	0.00
				Full Cov.	148.26	186.73	-0.01	0.10
				Spectral	148.44	171.55	-0.01	0.01
	Quarterly	150.19	170.12	OLS	147.74	176.55	-0.02	0.04
				Bottom-Up	1369.62	2151.95	0.07	-0.08
				Full Cov.	1273.04	2409.12	0.00	0.03
				Spectral	1318.08	2193.11	0.03	-0.06
	Overall	1275.79	2336.26	OLS	1280.00	2288.64	0.00	-0.02
Wool Production				Bottom-Up	156.23	293.50	0.05	1.23
				Full Cov.	119.19	330.31	-0.20	1.51
				Spectral	134.30	200.81	-0.10	0.52
	Annual	149.18	131.83	OLS	141.12	146.46	-0.05	0.11
				Bottom-Up	80.55	80.23	0.07	1.11
				Full Cov.	50.70	87.30	-0.33	1.30
				Spectral	59.84	53.83	-0.21	0.42
	Biannual	75.42	37.98	OLS	66.77	40.65	-0.11	0.07
				Bottom-Up	24.59	23.54	0.00	0.00
				Full Cov.	16.43	24.89	-0.33	0.06
				Spectral	19.11	16.60	-0.22	-0.29
	Quarterly	24.59	23.54	OLS	21.15	13.64	-0.14	-0.42
				Bottom-Up	261.37	397.27	0.05	1.05
				Full Cov.	186.32	442.49	-0.25	1.29
				Spectral	213.25	271.25	-0.14	0.40
	Overall	249.19	193.35	OLS	229.04	200.75	-0.08	0.04

Figure 9 shows the transformed time series as well as the base and reconciled forecasts, split by training and test set for the bottom-up and full covariance approach.

5.2 Wool Production

Another popular dataset is the woolyrnq dataset, available from the forecast package in R. It is about the quarterly production of woolen yarn in Australia, given in units of tonnes from March $1965$ to September $1994$ . We aggregate the data to biannual as well as annual frequency yielding a $3$ -level hierarchy with $k\in\{4,2,1\}$ . In order to have complete observations we remove the partially observed last year $1994$ . This then gives us $116$ quarters, $58$ half-years as well as $29$ years of data. As previously, we split the data into $80\%$ training data leading to $23$ training years.

In contrast to the A&E data, we do not perform any preprocessing besides de-meaning for interpretability purposes. A seasonality decomposition such as stl is not suitable for the annual time series, hence we do not perform it at all.

Table 5 presents the results for fixed order models. According to AICc, the most suitable model for the quarterly time series is an $\text{ARIMA}(3,1,2)$ model, which is already quite complex. The theory of aggregated ARIMA models then gives us $\text{ARIMA}(3,1,3)$ and $\text{ARIMA}(3,1,4)$ models for the biannual and annual time series, respectively. Despite the relatively small sample sizes for the biannual and annual data, these high-complexity models do not seem to suffer from overfitting. Using automated model selection, the corresponding models would be $\text{ARIMA}(0,1,0)$ and $\text{ARIMA}(1,1,1)$ , respectively, which produce very similar results. Therefore, we only present the results for the fixed-order case.

Nevertheless, we observe similar effects as with the A&E data. The bottom-up approach performs worse on the training data compared to covariance-based reconciliation methods. On the test data, both the bottom-up approach and the full covariance method exhibit poor generalization, while the spectral and OLS methods perform better. Notably, the full covariance method generalizes even worse than the bottom-up approach, a consistent finding across all data examples and simulations.

Figure 10 shows the transformed time series as well as the base and reconciled forecasts, split by training and test set for the bottom-up and full covariance approach.

5.3 Additional Datasets

We run experiments on some additional datasets and give an overall summary of the results. Based on the forecasting literature, especially hierarchical forecast reconciliation, we select the following $5$ datasets.

•

Energy (Panagiotelis et al.,, 2023): Daily electricity generation per source, available from the author’s GitHub repository¹¹1https://github.com/PuwasalaG/Probabilistic-Forecast-Reconciliation.
•

Food (Neubauer and Filzmoser,, 2024): Daily data from smart fridges with the goal of forecasting the demand for each fridge for the upcoming week in a one-step-ahead fashion.
•

M3 (Makridakis and Hibon,, 2000): Quarterly data of the M3 competition. The data was obtained from the R package Mcomp (Hyndman,, 2018).
•

Prison (Hyndman and Athanasopoulos,, 2018): Quarterly data about Australian prison population per state.
•

Tourism (Wickramasuriya et al.,, 2019; Girolimetto et al.,, 2023): Monthly data about visitor nights in Australian districts, taken from GitHub²²2https://github.com/daniGiro/ctprob.

This selection of datasets covers a wide range of frequencies and domains, summarised in Table 6. To ensure a non-singular covariance matrix estimate in order to be able to compute the full covariance reconciliation method, we maintain a relatively low order of aggregation. Specifically, we aggregate the energy data into weekly data, the M3 data into annual data, and so on. For each time series, we hold out $20\%$ of the data as test data. Table 7 also presents the training and test rMSE values for the selected reconciliation methods, summarized by trimmed means and corresponding standard errors. However, this presentation of the results does not provide much insight into the underlying dynamics. We observe that in-sample, the full covariance method performs well, but it does not generalize effectively. Similarly, the bottom-up approach does not produce the best results on the training data and also yields sub-optimal forecasts on the test data, contrary to the simulations. Comparing the two approaches we do observe that the full covariance method generalizes worse than the bottom-up method, confirming our simulation findings. Finally, the more sophisticated approach of utilizing the spectral decomposition performs well out-of-sample.

Table 6: Dataset properties.

N

denotes the number of total time series in the dataset, and

n_{\text{top}},n_{\text{bottom}}

give the range of the available lengths in the hierarchy given by

k

Dataset	$N$	$n_{\text{top}}$	$n_{\text{bottom}}$	$k\in$
Energy	23	51-51	357-357	$\{7,1\}$
Food	122	7-107	35-535	$\{5,1\}$
M3	756	8-18	32-72	$\{4,1\}$
Prison	8	12-12	48-48	$\{4,1\}$
Tourism	525	76-76	228-228	$\{3,1\}$

Table 7:

10\%

-trimmed overall means for

5

datasets and selected reconciliation methods. The standard errors are available in parentheses.

	Training rMSE				Test rMSE
Dataset	Bottom-Up	Full Cov.	OLS	Spectral	Bottom-Up	Full Cov.	OLS	Spectral
Energy	-0.03 (0.02)	-0.06 (0.01)	-0.02 (0.00)	-0.06 (0.01)	-0.02 (0.05)	-0.02 (0.05)	-0.02 (0.00)	-0.04 (0.03)
Food	0.04 (0.01)	-0.03 (0.01)	-0.01 (0.00)	-0.01 (0.00)	0.04 (0.02)	0.01 (0.01)	-0.01 (0.00)	0.00 (0.01)
M3	-0.17 (0.02)	-0.28 (0.01)	-0.11 (0.00)	-0.27 (0.01)	-0.09 (0.03)	-0.13 (0.03)	-0.11 (0.01)	-0.19 (0.02)
Prison	-0.18 (0.12)	-0.12 (0.17)	-0.11 (0.02)	0.00 (0.18)	-0.40 (0.12)	-0.30 (0.14)	-0.14 (0.03)	-0.01 (0.20)
Tourism	0.03 (0.00)	-0.05 (0.00)	0.00 (0.00)	-0.01 (0.00)	0.01 (0.01)	-0.02 (0.01)	-0.01 (0.00)	-0.01 (0.00)

We conduct an accuracy ranking based on multiple comparisons with the best (MCB) test, introduced by Koning et al., (2005), for each dataset, divided into training and test data. Figure 11 clearly demonstrates the statistically superior performance of the full covariance method compared to the bottom-up approach in-sample, while the performance difference becomes practically negligible on the test data, consistent with our theory and simulations.

Additionally, Figure 12 presents percentile plots comparing the four different approaches. These plots further illustrate that while the full covariance method performs well in-sample, its performance significantly deteriorates out-of-sample. Specifically, on the training data, more forecasts are improved by full covariance reconciliation, but this relationship largely reverses on the test data.

6 Conclusions

In this paper, we explored the theoretical implications of applying the minimum trace reconciliation method within the context of temporal hierarchies. By examining temporally aggregated ARMA models, we demonstrated that the optimal reconciliation method, when based on the true covariance matrix, is equivalent to a bottom-up approach. Our extensive simulation studies tested this theory across various scenarios involving different model complexities, hierarchy structures, and levels of uncertainty. The findings support our theory, indicating that the bottom-up method is a viable approach. This aligns with numerous literature findings where the bottom-up approach consistently produces useful results in suitable settings.

The simulation results also reveal that in-sample, covariance-based minimum trace reconciliation methods outperform the simple bottom-up approach. However, this relationship reverses out-of-sample, with the bottom-up approach generalizing better on the test data compared to the full covariance matrix across simulations and data examples. Further research is necessary to understand why this effect occurs so markedly. Additionally, other estimators were tested and showed improved performance over the full covariance matrix in certain settings, highlighting the potential for the ongoing research of new temporal hierarchical covariance estimators in the minimum trace approach.

Overall, our work contributes to the field of temporal forecast reconciliation by linking it to temporally aggregated ARMA models. We have theoretically established that the bottom-up approach is the optimal reconciliation method and reinforced this with comprehensive simulation studies and data illustrations. This supports the use of the bottom-up method in both theoretical and practical applications.

Computational details

The simulations and data examples were carried out in R 4.3.0. The corresponding source code of this paper in the form of an R package is available from GitHub at https://github.com/neubluk/FTATS. For convenience, all datasets except the $\text{M}3$ dataset are included in the package.

Declaration of Generative AI and AI-assisted technologies in the writing process

During the preparation of this work the authors used ChatGPT in order to improve readability and language. After using this tool/service, the authors reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.

Acknowledgments and Disclosure of Funding

We acknowledge support from the Austrian Research Promotion Agency (FFG), Basisprogramm project “Meal Demand Forecast” and Schrankerl GmbH for the cooperation and access to their data. We further acknowledge funding from the Austrian Science Fund (FWF) for the project “High-dimensional statistical learning: New methods to advance economic and sustainability policies” (ZK 35), jointly carried out by WU Vienna University of Economics and Business, Paris Lodron University Salzburg, TU Wien, and the Austrian Institute of Economic Research (WIFO).

Appendix A Calculations and Proofs

As in Silvestrini and Veredas, (2008), we illustrate this framework based on an $\text{AR}(1)$ model. Let $y_{t}\sim\text{AR}(1)$ be centered at $0$ with AR parameter $\phi\in(-1,1)$ and innovation variance $\sigma^{2}$ . According to Eq. (10) we obtain $y_{T}^{\ast}\sim\text{ARMA}(1,1)$ for any $k>1$ and AR parameter $\beta=\phi^{k}$ . The MA parameter $\eta$ as well as the noise $\sigma_{\ast}^{2}$ are computed as follows.

For lags $0,1$ we compute the autocovariances of $(1+\eta B)\epsilon^{\ast}_{T}$ with $B=L^{k}$ and $T(L)\epsilon_{t}$ with the aggregation polynomial $T(L)$ given by

	$\displaystyle T(L)$	$\displaystyle=\frac{1-\delta^{k}L^{k}}{1-\delta L}\frac{1-L^{k}}{1-L}$		(14)
		$\displaystyle=\sum_{i=0}^{k-1}\delta^{i}L^{i}\sum_{j=0}^{k-1}L^{j},$		(15)

with $\delta=\phi^{-1}$ being the inverse root of the corresponding AR polynomial and $L$ being the lag operator such that $Ly_{t}=Ly_{t-1}$ .

Because the MA order is $1$ , all lags greater than $1$ are zero. First note that

\displaystyle T(L)\epsilon_{t}=(1,\phi,\dots,\phi^{k-1})\overbrace{\begin{% pmatrix}1&\dots&\dots&\dots&1&0&\dots&\dots&0\\ 0&1&\dots&\dots&\vdots&1&0&\dots&0\\ \vdots&\ddots&\ddots&\ddots&\vdots&\vdots&\ddots&\ddots&\vdots\\ \vdots&\ddots&\ddots&\ddots&\vdots&\vdots&\ddots&\ddots&\vdots\\ \makebox[0.0pt][l]{$\smash{\underbrace{\phantom{\begin{matrix}0&\dots&\dots&0&% 1&\end{matrix}}}_{\text{$k\times k$}}}$}0&\dots&\dots&0&1&\makebox[0.0pt][l]{$% \smash{\underbrace{\phantom{\begin{matrix}1&\dots&\dots&1\end{matrix}}}_{\text% {$k\times(k-1)$}}}$}1&\dots&\dots&1\end{pmatrix}}^{=A}\begin{pmatrix}\epsilon_% {t}\\ \vdots\\ \epsilon_{t-(2k-2)}\end{pmatrix}.

Next, we set up the equations based on the auto-correlation functions to determine $\eta$ and $\sigma_{\ast}^{2}$ .

To this end, the variances are computed to be

	$\displaystyle\gamma^{\ast}(0)$	$\displaystyle=\text{Var}((1+\eta B)\epsilon^{\ast}_{T})$
		$\displaystyle=(1+\eta^{2})\sigma_{\ast}^{2},$		(16)

which must be equal to

$\displaystyle\gamma(0)$	$\displaystyle=\text{Var}(T(L)\epsilon_{t})$
	$\displaystyle=\sigma^{2}(1,\phi,\dots,\phi^{k-1})AA^{\prime}(1,\phi,\dots,\phi% ^{k-1})^{\prime}$
	$\displaystyle=\sigma^{2}\left(\sum_{j=0}^{k-1}\left(\sum_{i=0}^{j}\phi^{i}% \right)^{2}+\sum_{j=0}^{k-1}\left(\sum_{i=j}^{k-1}\phi^{i}\right)^{2}\right).$	(17)

Similarly, the lag $1$ auto-covariances are

	$\displaystyle\gamma^{\ast}(1)$	$\displaystyle=\text{Cov}((1+\eta B)\epsilon^{\ast}_{T},(1+\eta B)\epsilon^{% \ast}_{T-1})$
		$\displaystyle=\eta\sigma_{\ast}^{2},$		(18)

with needed equality to

$\displaystyle\gamma(1)$	$\displaystyle=\text{Cov}(T(L)\epsilon_{t},T(L)\epsilon_{t-k})$
	$\displaystyle=\sigma^{2}(1,\phi,\dots,\phi^{k-1})ACA^{\prime}(1,\phi,\dots,% \phi^{k-1})$
	$\displaystyle=\sigma^{2}\left(\sum_{j=1}^{k-1}\left(\sum_{i=j}^{k-1}\phi^{i}% \sum_{l=0}^{j-1}\phi^{l}\right)\right)$	(19)

where

	$\displaystyle C$	$\displaystyle=\frac{1}{\sigma^{2}}\text{Cov}\left((\epsilon_{t}\dots\epsilon_{% t-(2k-2)})^{\prime},(\epsilon_{t-k},\dots,\epsilon_{t-k-(2k-2)})^{\prime}\right)$
		$\displaystyle=\begin{pmatrix}0_{k\times(k-1)}&0_{k\times k}\\ I_{k-1}&0_{(k-1)\times k}\end{pmatrix}$

Solving the system of equations $\gamma(0)=\gamma^{\ast}(0),\gamma(1)=\gamma^{\ast}(1)$ using (A)-(A) yields

	$\displaystyle\sigma_{\ast}^{2}$	$\displaystyle=\sigma^{2}\frac{(1,\phi,\dots,\phi^{k-1})AA^{\prime}(1,\phi,% \dots,\phi^{k-1})^{\prime}}{1+\eta^{2}}$
	$\displaystyle\eta$	$\displaystyle=(1+\eta^{2})\rho_{1},$

where $\rho_{1}=\frac{\gamma(1)}{\gamma(0)}=\frac{\gamma^{\ast}(1)}{\gamma^{\ast}(0)}$ denotes the auto-correlation value at lag $1$ .

Proof of Lemma 1.

First, we compute the $h$ -step forecasts of the disaggregated series for $h=1,\dots,k$ . For the $\text{AR}(1)$ process this can be done recursively and we obtain residuals given by

\displaystyle e_{t}^{(h)}=\sum_{i=0}^{h-1}\phi^{i}\epsilon_{t+h-i}.

(20)

The corresponding pairwise covariances are quickly computed for $h_{1}\leq h_{2}$ by

	$\displaystyle\text{Cov}\left(e_{t}^{(h_{1})},e_{t}^{(h_{2})}\right)$	$\displaystyle=\sigma^{2}\sum_{l=0}^{h_{1}-1}\phi^{h_{2}-h_{1}+2l}$		(21)
		$\displaystyle=\sigma^{2}\phi^{h_{2}-h_{1}}\frac{1-\phi^{2h_{1}}}{1-\phi^{2}},$		(22)

hence for $\mathbf{e}_{t}=\left(e_{t}^{(1)},\dots,e_{t}^{(k)}\right)^{\prime}$ we obtain the covariance matrix on the bottom level $\text{Cov}(\mathbf{e}_{t})=\sigma^{2}\Phi\Phi^{\prime}$ .

For $y^{\ast}_{T}$ we perform a $1$ -step forecast, thus ${e^{\ast}_{T}}^{(1)}=\epsilon^{\ast}_{T+1}$ with $\text{Var}({e^{\ast}_{T}}^{(1)})=\sigma_{\ast}^{2}$ . To compute $\text{Cov}({e^{\ast}_{T}}^{(1)},e_{t}^{(h)})$ , we do as follows. First, write $\epsilon^{\ast}_{T+1}=y^{\ast}_{T+1}-\beta y^{\ast}_{T}-\eta\epsilon^{\ast}_{T}$ , then for $T=tk$ and $j=1,\dots,k$ we have

$\displaystyle\text{Cov}(\epsilon^{\ast}_{T+1},\epsilon_{tk+j})$	$\displaystyle=\sum_{i=0}^{k-1}\text{Cov}(y_{tk+k-i},\epsilon_{tk+j})$	(24)
	$\displaystyle=\sum_{i=0}^{k-1}\sum_{l=0}^{tk+k-i}\phi^{l}\text{Cov}(\epsilon_{% tk+k-i-l},\epsilon_{tk+j})$	(25)
	$\displaystyle=\sigma^{2}\sum_{i=0}^{k-j}\phi^{i}$	(26)
	$\displaystyle=\sigma^{2}\frac{1-\phi^{k-j+1}}{1-\phi},$	(27)

since $\text{Cov}(\epsilon_{tk+k-i-l},\epsilon_{tk+j})=\sigma^{2}$ if $l=k-i-j$ and $0$ otherwise. Together, we obtain the temporal cross-covariances of

	$\displaystyle\text{Cov}({e^{\ast}_{T}}^{(1)},e_{tk}^{(h)})$	$\displaystyle=\text{Cov}({e^{\ast}_{T}}^{(1)},\sum_{i=0}^{h-1}\phi^{i}\epsilon% _{tk+h-i})$		(28)
		$\displaystyle=\frac{\sigma^{2}}{1-\phi}\left(\frac{1-\phi^{h}}{1-\phi}-\phi^{k% -h+1}\frac{1-\phi^{2h}}{1-\phi^{2}}\right),$		(29)

hence the cross-covariance vector is given by

\displaystyle\text{Cov}(e^{\ast}_{T},\mathbf{e}_{tk})

\displaystyle=\sigma^{2}(1,\dots,1)\tilde{\Phi}\tilde{\Phi}.

(30)

∎

Proof of Theorem 1.

The minimizer of Eq. (2.1) is given by $G^{\ast}=(S^{\prime}W_{1}^{-1}S)^{-1}S^{\prime}W_{1}^{-1}$ . First, note that

\displaystyle W_{1}^{-1}S=\begin{pmatrix}\mathbf{0}_{k}^{\prime}\\ (\sigma^{2}\Phi\Phi^{\prime})^{-1}\end{pmatrix},

(31)

due to $\text{Cov}(e^{\ast}_{T},\mathbf{e}_{tk})=\sigma^{2}\mathbf{1}_{k}^{\prime}\Phi% \Phi^{\prime}$ . Then the minimizing $G^{\ast}$ matrix is obtained to be $G^{\ast}=(\mathbf{0}_{k}~{}I_{k})$ and hence

\displaystyle SG^{\ast}=\begin{pmatrix}0&\mathbf{1}_{k}^{\prime}\\ \mathbf{0}_{k}&I_{k}\end{pmatrix},

(32)

which is exactly the bottom-up forecast for the aggregated series. ∎

Appendix B Additional Plots

References

Amemiya and Wu, (1972) Amemiya, T. and Wu, R. Y. (1972). The effect of aggregation on prediction in the autoregressive model. Journal of the American Statistical Association, 67(339):628–632.
Athanasopoulos et al., (2024) Athanasopoulos, G., Hyndman, R. J., Kourentzes, N., and Panagiotelis, A. (2024). Forecast reconciliation: A review. International Journal of Forecasting, 40(2):430–456.
Athanasopoulos et al., (2017) Athanasopoulos, G., Hyndman, R. J., Kourentzes, N., and Petropoulos, F. (2017). Forecasting with temporal hierarchies. European Journal of Operational Research, 262(1):60–74.
Girolimetto et al., (2023) Girolimetto, D., Athanasopoulos, G., Di Fonzo, T., and Hyndman, R. J. (2023). Cross-temporal probabilistic forecast reconciliation: Methodological and practical issues. International Journal of Forecasting.
Hyndman, (2018) Hyndman, R. (2018). Mcomp: Data from the M-Competitions. R package version 2.8.
Hyndman et al., (2011) Hyndman, R. J., Ahmed, R. A., Athanasopoulos, G., and Shang, H. L. (2011). Optimal combination forecasts for hierarchical time series. Computational Statistics & Data Analysis, 55(9):2579–2589.
Hyndman and Athanasopoulos, (2018) Hyndman, R. J. and Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts.
Hyndman et al., (2016) Hyndman, R. J., Lee, A. J., and Wang, E. (2016). Fast computation of reconciled forecasts for hierarchical and grouped time series. Computational Statistics & Data Analysis, 97:16–32.
Jones, (1987) Jones, M. C. (1987). Randomly choosing parameters from the stationarity and invertibility region of autoregressive-moving average models. Journal of the Royal Statistical Society. Series C (Applied Statistics), 36(2):134–138.
Koning et al., (2005) Koning, A. J., Franses, P. H., Hibon, M., and Stekler, H. (2005). The m3 competition: Statistical tests of the results. International Journal of Forecasting, 21(3):397–409.
Ledoit and Wolf, (2012) Ledoit, O. and Wolf, M. (2012). Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics, 40(2):1024 – 1060.
Makridakis and Hibon, (2000) Makridakis, S. and Hibon, M. (2000). The m3-competition: results, conclusions and implications. International Journal of Forecasting, 16(4):451–476. The M3- Competition.
Neubauer and Filzmoser, (2024) Neubauer, L. and Filzmoser, P. (2024). Improving forecasts for heterogeneous time series by “averaging”, with application to food demand forecasts. International Journal of Forecasting.
Nystrup et al., (2021) Nystrup, P., Lindström, E., Møller, J. K., and Madsen, H. (2021). Dimensionality reduction in forecasting with temporal hierarchies. International Journal of Forecasting, 37(3):1127–1146.
Nystrup et al., (2020) Nystrup, P., Lindström, E., Pinson, P., and Madsen, H. (2020). Temporal hierarchies with autocorrelation for load forecasting. European Journal of Operational Research, 280(3):876–888.
Panagiotelis et al., (2021) Panagiotelis, A., Athanasopoulos, G., Gamakumara, P., and Hyndman, R. J. (2021). Forecast reconciliation: A geometric view with new insights on bias correction. International Journal of Forecasting, 37(1):343–359.
Panagiotelis et al., (2023) Panagiotelis, A., Gamakumara, P., Athanasopoulos, G., and Hyndman, R. J. (2023). Probabilistic forecast reconciliation: Properties, evaluation and score optimisation. European Journal of Operational Research, 306(2):693–706.
Ramírez et al., (2014) Ramírez, O. A., Mullen, J., and Collart, A. J. (2014). Insights into the appropriate level of disaggregation for efficient time series model forecasting. Journal of Applied Statistics, 41:2298 – 2311.
Silvestrini and Veredas, (2008) Silvestrini, A. and Veredas, D. (2008). Temporal aggregation of univariate and multivariate time series models: A survey. Journal of Economic Surveys, 22(3):458–497.
Wickramasuriya et al., (2019) Wickramasuriya, S. L., Athanasopoulos, G., and Hyndman, R. J. (2019). Optimal forecast reconciliation for hierarchical and grouped time series through trace minimization. Journal of the American Statistical Association, 114(526):804–819.