Proper Scoring Rules for Multivariate Probabilistic Forecasts based on Aggregation and Transformation
Abstract
Proper scoring rules are an essential tool to assess the predictive performance of probabilistic forecasts. However, propriety alone does not ensure an informative characterization of predictive performance and it is recommended to compare forecasts using multiple scoring rules. With that in mind, interpretable scoring rules providing complementary information are necessary. We formalize a framework based on aggregation and transformation to build interpretable multivariate proper scoring rules. Aggregation-and-transformation-based scoring rules are able to target specific features of the probabilistic forecasts; which improves the characterization of the predictive performance. This framework is illustrated through examples taken from the literature and studied using numerical experiments showcasing its benefits. In particular, it is shown that it can help bridge the gap between proper scoring rules and spatial verification tools.
1 Introduction
Probabilistic forecasting allows to issue forecasts carrying information about the prediction uncertainty. It has become an essential tool in numerous applied fields such as weather and climate prediction (Vannitsem et al., 2021; Palmer, 2012), earthquake forecasting (Jordan et al., 2011; Schorlemmer et al., 2018), electricity price forecasting (Nowotarski and Weron, 2018) or renewable energies (Pinson, 2013; Gneiting et al., 2023) among others. Moreover, it is slowly reaching fields further from "usual" forecasting, such as epidemiology predictions (Bosse et al., 2023) or breast cancer recurrence prediction (Al Masry et al., 2023). In weather forecasting, probabilistic forecasts often take the form of ensemble forecasts in which the dispersion among members captures forecast uncertainty.
The development of probabilistic forecasts has induced the need for appropriate verification methods. Forecast verification fulfills two main purposes: quantifying how good a forecast is given observations available and allowing one to rank different forecasts according to their predictive performance. Scoring rules provide a single value to compare forecasts with observations. Propriety is a property of scoring rules that encourages forecasters to follow their true beliefs and that prevents hedging. Proper scoring rules allow to assess calibration and sharpness simultaneously (Winkler, 1977; Winkler et al., 1996). Calibration is the statistical compatibility between forecasts and observations. Sharpness is the uncertainty of the forecast itself. Propriety is a necessary property of good scoring rules, but it does not guarantee that a scoring rule provides an informative characterization of predictive performance. In univariate and multivariate settings, numerous studies have proven that no scoring rule has it all, and thus, different scoring rules should be used to get a better understanding of the predictive performance of forecasts (see, e.g., Scheuerer and Hamill 2015; Taillardat 2021; Bjerregård et al. 2021). With that in mind, Scheuerer and Hamill (2015) "strongly recommend that several different scores be always considered before drawing conclusions." This amplifies the need for numerous complementary proper scoring rules that are well-understood to facilitate forecast verification. In that direction, Dorninger et al. (2018) states that: "gaining an in-depth understanding of forecast performance depends on gras** the full meaning of the verification results." Interpretability of proper scoring rules can arise from being induced by a consistent scoring function for a functional (e.g., the squared error is induced by a scoring function consistent for the mean; Gneiting 2011), knowing what aspects of the forecast the scoring rule discriminates (e.g., the Dawid-Sebastiani score only discriminates forecasts through their mean and variance; Dawid and Sebastiani 1999) or knowing the limitations of a certain proper scoring rule (e.g., the variogram score is incapable of discriminating two forecasts that only differ by a constant bias; Scheuerer and Hamill 2015). In this context, interpretable proper scoring rules become verification methods of choice as the ranking of forecasts they produce can be more informative than the ranking of a more complex but less interpretable scoring rule. Section 2 provides an in-depth explanation of this in the case of univariate scoring rules. It is worth noting that interpretability of a scoring rule can also arise from its decomposition into meaningful terms (see, e.g., Bröcker 2009). This type of interpretability can be used complementarily to the framework proposed in this article.
Scheuerer and Hamill (2015) proposed the variogram score to target the verification of the dependence structure. The variogram score of order () is defined as
where is the -th component of the random vector following , the are nonnegative weights and is an observation. The construction of the variogram score relies on two main principles. First, the variogram score is the weighted sum of scoring rules acting on the distribution of and on paired components of the observations . This aggregation principle allows the combination of proper scoring rules and summarizes them into a proper scoring rule acting on the whole distribution and observations . Second, the scoring rules composing the weighted sum can be seen as a standard proper scoring rule applied to transformations of both forecasts and observations. Let us denote the transformation related to the variogram of order , then the variogram score can be rewritten as
where is the univariate squared error and is the distribution of for following . This second principle is the transformation principle, allowing to build transformation-based proper scoring rules that can benefit from interpretability arising from a transformation (here, the variogram transformation ) and the simplicity and interoperability of the proper scoring rule they rely on (here, the squared error).
We review the univariate and multivariate proper scoring rules through the lens of interpretability and by mentioning their known benefits and limitations. We formalize these two principles of aggregation and transformation to construct interpretable proper scoring rules for multivariate forecasts. To illustrate the use of these principles, we provide examples of transformation-and-aggregation-based scoring rules from both the literature on probabilistic forecast verification and quantities of interest. We conduct a simulation study to empirically demonstrate how transformation-and-aggregation-based scoring rules can be used. Additionally, we show how the aggregation and transformation principle can help bridging the gap between the proper scoring rules framework and the spatial verification tools (Gilleland et al., 2009; Dorninger et al., 2018).
The remainder of this article is organized as follows. Section 2 gives a general review of verification methods for univariate and multivariate forecasts. Section 3 introduces the framework of proper scoring rules based on transformation and aggregation for multivariate forecasts. Section 4 provides examples of transformation-and-aggregation-based scoring rules, including examples from the literature. Then, Section 5 showcases through different simulation setups how the framework proposed in this article can help build interpretable proper scoring rules. Finally, Section 6 provides a summary as well as a discussion on the verification of multivariate forecasts. Throughout the article, we focus on spatial forecasts for simplicity. However, the points made remain valid for any multivariate forecasts, including temporal forecasts or spatio-temporal forecasts.
2 Overview of verification tools for univariate and multivariate forecasts
This section presents the zoology of available verification tools and briefly summarizes their benefits and limitations. First, we define scoring rules and their key properties. Then, we recall univariate scoring rules, starting with ones derived from scoring functions used in point forecasting. Finally, we provide an overview of verification tools for multivariate forecasts.
2.1 Calibration, sharpness, and propriety
Gneiting et al. (2007) proposed a paradigm for the evaluation of probabilistic forecasts: "maximizing the sharpness of the predictive distributions subject to calibration". Calibration is the statistical compatibility between the forecast and the observations. Sharpness is the concentration of the forecast and is a property of the forecast itself. In other words, the paradigm aims at minimizing the uncertainty of the forecast given that the forecast is statistically consistent with the observations. Tsyplakov (2011) states that the notion of calibration in the paradigm is too vague but it holds if the definition of calibration is refined. This principle for the evaluation of probabilistic forecasts has reached a consensus in the field of probabilistic forecasting (see, e.g., Gneiting and Katzfuss 2014; Thorarinsdottir and Schuhen 2018). The paradigm proposed in Gneiting et al. (2007) is not the first mention of the link between sharpness and calibration: for example, Murphy and Winkler (1987) mentioned the relation between refinement (i.e., sharpness) and calibration.
For univariate forecasts, multiple definitions of calibration are available depending on the setting. The most used definition is probabilistic calibration and, broadly speaking, consists of computing the rank of observations among samples of the forecast and checking for uniformity with respect to observations. If the forecast is calibrated, observations should not be distinguishable from forecast samples, and thus, the distribution of their ranks should be uniform. Probabilistic calibration can be assessed by probability integral transform (PIT) histograms (Dawid, 1984) or rank histograms (Anderson, 1996; Talagrand et al., 1997) for ensemble forecasts when observations are stationary (i.e., their distribution is the same across time). The shape of the PIT or rank histogram gives information about the type of (potential) miscalibration: a triangular-shaped histogram suggests that the probabilistic forecast has a systematic bias, a -shaped histogram suggests that the probabilistic forecast is under-dispersed and a -shaped histogram suggests that the probabilistic forecast is over-dispersed. Moreover, probabilistic calibration implies that rank histograms should be uniform but uniformity is not sufficient. For example, rank histograms should also be uniform conditionally on different forecast scenarios (e.g., conditionally on the value of the observations available when the forecast is issued). Additionally, under certain hypotheses, calibration tools have been developed to consider real-world limitations such as serial dependence (Bröcker and Ben Bouallègue, 2020). Statistical tests have been developed to check the uniformity of rank histograms (Jolliffe and Primo, 2008). Readers interested in a more in-depth understanding of univariate forecast calibration are encouraged to consult Tsyplakov (2013, 2020).
For multivariate forecasts, a popular approach relies on a similar principle: first, multivariate forecast samples are transformed into univariate quantities using so-called pre-rank functions and then the calibration is assessed by techniques used in the univariate case (see, e.g., Gneiting et al. 2008). Pre-rank functions may be interpretable and allow targeting the calibration of specific aspects of the forecast such as the dependence structure. Readers interested in the calibration of multivariate forecasts can refer to Allen et al. (2024) for a comprehensive review of multivariate calibration.
A scoring rule assigns a real-valued quantity to a forecast-observation pair , where is a probabilistic forecast and is an observation. In the negative-oriented convention, a scoring rule is proper relative to the class if
(1) |
for all , where is the expectation with respect to . In simple terms, a scoring rule is proper relative to a class of distribution if its expected value is minimal when the true distribution is predicted, for any distribution within the class. Forecasts minimizing the expected scoring rule are said to be efficient and the other forecasts are said to be sub-efficient. Moreover, the scoring rule is strictly proper relative to the class if the equality in (1) holds if and only if . This ensures the characterization of the ideal forecast (i.e., there is a unique efficient forecast and it is the true distribution). Moreover, proper scoring rules are powerful tools as they allow the assessment of calibration and sharpness simultaneously (Winkler, 1977; Winkler et al., 1996). Sharpness can be assessed individually using the entropy associated with proper scoring rules, defined by . The sharper the forecast, the smaller its entropy. Strictly proper scoring rules can also be used to infer the parameters of a parametric probabilistic forecast (see, e.g., Gneiting et al. 2005; Pacchiardi et al. 2024).
Regardless of all the interesting properties of proper scoring rules, it is worth noting that they have some limitations. Proper scoring rules may have multiple efficient forecasts (i.e., associated with their minimal expected value) and, in the general setting, no guarantee is given on their relevance. Moreover, strict propriety ensures that the efficient forecast is unique and that it is the ideal forecast (i.e., the true distribution), however, no guarantee is available for forecasts within the vicinity of the minimum in the general case. This is particularly problematic since, in practice, the unavailability of the ideal distribution makes it impossible to know if the minimum expected score is achieved. In the case of calibrated forecasts, the expected scoring rule is the entropy of the forecast and the ranking of forecasts is thus linked to the information carried by the forecast (see Corollary 4, Holzmann and Eulert 2014 for the complete result). These limitations may explain the plurality of scoring rules depending on application fields.
2.2 Univariate scoring rules
We recall classical univariate scoring rules to explain key concepts. Some univariate scoring rules will be useful for the multivariate scoring rules construction framework proposed in Section 3. Let denote the class of Borel probability measures on . We consider a probabilistic forecast in the form of its cumulative distribution function (cdf) and an observation. When the probabilistic forecast has a probability density function (pdf), it will be denoted .
The simplest scoring rules can be derived from scoring functions used to assess point forecasts. The squared error (SE) is the most popular and is known through its averaged value (the mean squared error; MSE) or the square root of its average (the root mean squared error; RMSE) which has the advantage of being expressed in the same units as the observations. As a scoring rule, the SE is expressed as
(2) |
where denotes the mean of the predicted distribution . The SE solely discriminates the mean of the forecast (see Appendix A); efficient forecasts for SE are the ones matching the mean of the true distribution. The SE is proper relative to , the class of Borel probability measures on with a finite second moment (i.e., finite variance). Note that the SE cannot be strictly proper as the equality of mean does not imply the equality of distributions.
Another well-known scoring rule is the absolute error (AE) defined by
(3) |
where is the median of the predicted distribution . The mean absolute error (MAE), the average of the absolute error, is the most seen form of the AE and it is also expressed in the same units as the observations. Efficient forecasts are forecasts that have a median equal to the median of the true distribution. The AE is proper relative to but not strictly proper. Similarly, the quantile score (QS), also known as the pinball loss, is a scoring rule focusing on quantiles of level defined by
(4) |
where is a probability level and is the predicted quantile of level . The case corresponds to the AE up to a factor . The QS of level is proper relative to but not strictly proper since efficient forecasts are ones correctly predicting the quantile of level (see, e.g., Friederichs and Hense 2008).
Another summary statistic of interest is the exceedance of a threshold . The Brier score (BS; Brier 1950) was initially introduced for binary predictions but allows also to discriminate forecasts based on the exceedance of a threshold . For probabilistic forecasts, the BS is defined as
(5) |
where is the predicted probability that the threshold is exceeded. The BS is proper relative to but not strictly proper. Binary events (e.g., exceedance of thresholds) are relevant in weather forecasting as they are used, for example, in operational settings for decision-making.
All the scoring rules presented above are proper but not strictly proper since they only discriminate against specific summary statistics instead of the whole distribution. Nonetheless, they are still used as they allow forecasters to verify specific characteristics of the forecast: the mean, the median, the quantile of level or the exceedance of a threshold . The simplicity of these scoring rules makes them interpretable, thus making them essential verification tools.
Some univariate scoring rules contain a summary statistic: for example, the formulas of the QS (4) or the BS (5) contain the exceedance of a threshold and the quantile of level , respectively. They can be seen as a scoring function applied to a summary statistic. This duality can be understood through the link between scoring functions and scoring rules through consistent functionals as presented in Gneiting (2011) or Section 2.2 in Lerch et al. (2017).
Other summary statistics can be of interest depending on applications. Nonetheless, it is worth noting that mispecifications of numerous summary statistics cannot be discriminated because of their non-elicitability. Non-elicitability of a transformation implies that no proper scoring rule can be constructed such that efficient forecasts are forecasts where the transformation is equal to the one of the true distribution. For example, the variance is known to be non-elicitable; however, it is jointly elicitable with the mean (see, e.g., Brehmer 2017). Readers interested in details regarding elicitable, non-elicitable and jointly elicitable transformations may refer to Gneiting (2011), Brehmer and Strokorb (2019) and references therein.
A strictly proper scoring rule should discriminate the whole distribution and not only specific summary statistics. The continuous ranked probability score (CRPS; Matheson and Winkler 1976) is the most popular univariate scoring rule in weather forecasting applications and can be expressed by the following expressions
(6) | ||||
(7) | ||||
(8) |
where and and are independent random variables following , with a finite first moment. Equations (7) and (8) show that the CRPS is linked with the BS and the QS. Broadly speaking, as the QS discriminates a quantile associated with a specific level, integrating the QS across all levels discriminates the quantile function that fully characterizes univariate distributions. Similarly, integrating the BS across all thresholds discriminates the cumulative distribution function that also fully characterizes univariate distributions. The CRPS is a strictly proper scoring rule relative to , the class of Borel probability measures on with a finite first moment. In addition, Equation (6) indicates the CRPS values have the same units as observations. In the case of deterministic forecasts, the CRPS reduces to the absolute error, in its scoring function form (Hersbach, 2000). The use of the CRPS for ensemble forecast is straightforward using expectations as in (6). Ferro et al. (2008) and Zamo and Naveau (2017) studied estimators of the CRPS for ensemble forecasts.
In addition to scoring rules based on scoring functions, some scoring rules use the moments of the probabilistic forecast . The SE (2) depends on the forecast only through its mean . The Dawid-Sebastiani score (DSS; Dawid and Sebastiani 1999) is a scoring rule depending on the forecast only through its first two central moments. The DSS is expressed as
(9) |
where and are the mean and the variance of the distribution . The DSS is proper relative to but not strictly proper, since efficient forecasts only need to correctly predict the first two central moments (see Appendix A). Dawid and Sebastiani (1999) proposed a more general class of proper scoring rules but the DSS, as defined in (9), can be seen as a special case of the logarithmic score (up to an additive constant), introduced further down.
Another scoring rule relying on the central moments of the probabilistic forecast up to order three is the error-spread score (ESS; Christensen et al. 2014). The ESS is defined as
(10) |
where , and are the mean, the variance and the skewness of the probabilistic forecast . The ESS is proper relative to . As for the other scoring rules only based on moments of the forecast presented above, the expected ESS compares the probabilistic forecast with the true distribution only via their four first moments (see Appendix A). Scoring rules based on central moments of higher order could be built following the process described in Christensen et al. (2014). Such scoring rules would benefit from the interpretability induced by their construction and the ease to be applied to ensemble forecasts. However, they would also inherit the limitation of being only proper.
When the probabilistic forecast has a pdf , scoring rules of a different type can be defined. Let denote the class of probability measures on that are absolutely continuous with respect to (usually taken as the Lebesgue measure) and have -density such that
The most popular scoring rule based on the pdf is the logarithmic score (also known as ignorance score; Good 1952; Roulston and Smith 2002). The logarithmic score is defined as
(11) |
for such that . In its formulation, the logarithmic score is different from the scoring rules seen previously. Good (1952) proposed the logarithmic score knowing its link with the theory of information: its entropy is the Shannon entropy (Shannon, 1948) and its expectation is related to the Kullback-Leibler divergence (Kullback and Leibler, 1951) (see Appendix A). The logarithmic score is strictly proper relative to the class . Moreover, inference via minimization of the expected logarithmic score is equivalent to maximum likelihood estimation (see, e.g., Dawid et al. 2015). The logarithmic score belongs to the family of local scoring rules, which are scoring rules only depending on , and its derivatives up to a finite order. Another local scoring rule is the Hyvärinen score (also known as the gradient scoring rule; Hyvärinen 2005) and it is defined as
for such that . The Hyvärinen score is proper relative to the subclass of such that the density exists, is twice continuously differentiable and satisfies as . It is worth noticing that the Hyvärinen score can be computed even if is only known up to a scale factor (e.g., up to a normalizing constant). This property allows circumventing the use of Monte Carlo methods or approximations of the normalizing constant when it is unavailable or hard to compute. This is a property of local proper scoring rules except for the logarithmic score (Parry et al., 2012). Readers eager to learn more about local proper scoring rules may refer to Parry et al. (2012) and Ehm and Gneiting (2012).
The logarithmic score and the Hyvärinen score do not allow to be zero. To overcome this limitation, scoring rules expressed in terms of the -norm have been proposed. The quadratic score is defined as
where . The quadratic score is strictly proper relative to the class .
The pseudospherical score is defined as
with . For , it reduces to the spherical score (see, e.g., Jose 2007). The pseudospherical score is strictly proper relative to the class . The four scoring rules presented above have been criticized as they do not encourage a high probability in the vicinity of the observation (Gneiting and Raftery, 2007). In particular, as the logarithmic score is more sensitive to outliers, probabilistic forecasts inferred by its minimization may be overdispersive (Gneiting et al., 2005). Moreover, the pdf is not always available, for example in the case of ensemble forecasts.
Readers may refer to the various reviews of scoring rules available (see, e.g., Bröcker and Smith 2007; Gneiting and Raftery 2007; Gneiting and Katzfuss 2014; Thorarinsdottir and Schuhen 2018; Alexander et al. 2022). Formulas of the expected scoring rules presented are available in Appendix A.
Strictly proper scoring rules can be seen as more powerful than proper scoring rules. This is theoretically true when the interest is in identifying the ideal forecast (i.e., the true distribution). Regardless, in practice, scoring rules are also used to rank probabilistic forecasts and with that in mind, a given ranking of forecasts in terms of the expectation of a strictly proper scoring rule (such as the CRPS) is harder to interpret than a ranking in terms of the expectation of a proper but more interpretable scoring rule (such as the SE). The SE is known to discriminate the mean, and thus, a better rank in terms of expected SE implies a better prediction of the mean of the true distribution. Conversely, a better ranking in terms of CRPS implies a better prediction of the whole prediction, but it might not be useful as is, and other verification tools will be needed to know what caused this ranking. When forecasts are not calibrated, there seems to be a trade-off between interpretability and discriminatory power and this becomes more prominent in a multivariate setting. However, simpler interpretable tools and discriminatory-powerful tools can be used complementarily. The framework proposed in Section 3 aims at hel** the construction of interpretable proper scoring rules.
2.3 Multivariate scoring rules
In a multivariate setting, forecasters cannot solely use univariate scoring rules as they are not able to discriminate forecasts beyond their -dimensional marginals. Univariate scoring rules cannot discriminate the dependence structure between the univariate margins. Multivariate forecasts can be applied in different setups: spatial forecasts, temporal forecasts, multivariable forecasts or any combination of these categories (e.g., spatio-temporal forecasts of multiple variables). Considering weather forecasting, a spatial forecast could aim at predicting temperatures across multiple locations. A temporal forecast could be focused on predicting rainfall at multiple lead times at a given location. A multivariable forecast could predict both eastward and northward components of the wind. In the following, we consider a multivariate probabilistic forecast and an observation.
Even if there is no natural ordering in the multivariate case, the notions of median and quantile can be adapted using level sets, and then scoring rules using these quantities can be constructed (see, e.g., Meng et al. 2023). Nonetheless, as the mean is well-defined, the squared error (SE) can be defined in the multivariate setting :
(12) |
where is the mean vector of the distribution . Similar to the univariate case, the SE is proper relative to . Moments are well-defined in the multivariate case allowing the multivariate version of the Dawid-Sebastiani score to be defined. The Dawid-Sebastiani score (DSS) was proposed in Dawid and Sebastiani (1999) as
where and are the mean vector and the covariance matrix of the distribution . The DSS is proper relative to and it becomes strictly proper relative to any convex class of probability measures characterized by their first two moments (Gneiting and Raftery, 2007). The second term in the DSS is the squared Mahalanobis distance between and .
To define a strictly proper scoring rule for multivariate forecast, Gneiting and Raftery (2007) proposed the energy score (ES) as a generalization of the CRPS to the multivariate case. The ES is defined by
(13) |
where and , the class of Borel probability measures on such that the moment of order is finite. The definition of the ES is related to the kernel form of the CRPS (6), to which the ES reduces for and . As pointed out in Gneiting and Raftery (2007), in the limiting case , the ES becomes the SE (12). The ES is strictly proper relative to (Székely, 2003; Gneiting and Raftery, 2007) and is suited for ensemble forecasts (Gneiting et al., 2008). Moreover, the parameter gives some flexibility: a small value of can be chosen and still lead to a strictly proper scoring rule, for example, when higher-order moments are ill-defined. The discrimination ability of the ES has been studied in numerous studies (see, e.g., Pinson and Girard 2012; Pinson and Tastu 2013; Scheuerer and Hamill 2015). Pinson and Girard (2012) studied the ability of the ES to discriminate among rival sets of scenarios (i.e., forecasts) of wind power generation. In the case of bivariate Gaussian processes, Pinson and Tastu (2013) illustrated that the ES appears to be more sensitive to misspecifications of the mean rather than misspecifications of the variance or dependence structure. The lack of sensitivity to misspecifications of the dependence structure has been confirmed in Scheuerer and Hamill (2015) using multivariate Gaussian random vectors of higher dimension. Moreover, the discriminatory power of the ES deteriorates in higher dimensions (Pinson and Tastu, 2013).
To overcome the discriminatory limitation of the ES, Scheuerer and Hamill (2015) proposed the variogram score (), a score targeting the verification of the dependence structure. The VS of order is defined as
(14) |
where is the -th component of the random vector following , are nonnegative weights and . The variogram score capitalizes on the variogram, used in spatial statistics to access the dependence structure. The VS cannot detect an equal bias across all components. The VS of order is proper relative to the class of Borel probability measures on such that the -th moments of all univariate margins are finite. The weights can be selected to emphasize or depreciate certain pair interactions. For example, in a spatial context, it can be expected the dependence between pairs decays with the distance: choosing the weights proportional to the inverse of the distance between locations can increase the signal-to-noise ratio and improve the discriminatory power of the VS (Scheuerer and Hamill, 2015).
When the pdf of the probabilistic forecast is available, multivariate versions of the univariate scoring rules based on the pdf are available. The multivariate versions of the scoring rules have the same properties and limitations as their univariate counterpart. The logarithmic score (11) has a natural multivariate version :
for such that . The logarithmic score is strictly proper relative to the class .
The Hyvärinen score (HS; Hyvärinen 2005) was initially proposed in its multivariate form
for such that , where is the Laplace operator (i.e., the sum of the second-order partial derivatives) and is the gradient operator (i.e., vector of the first-order partial derivatives). In the multivariate setting, the HS can also be computed if the predicted pdf is known up to a normalizing constant. The HS is proper relative to the subclass of such that the density exists, is twice continuously differentiable and satisfies as .
The quadratic score and pseudospherical score are directly suited to the multivariate setting :
where . The quadratic score is strictly proper relative to the class . The pseudospherical score is strictly proper relative to the class .
Additionally, other multivariate scoring rules have been proposed among which the marginal-copula score (Ziel and Berk, 2019) or wavelet-based scoring rules (see, e.g., Buschow et al. 2019). These scoring rules will be briefly mentioned in Section 4 in light of the proper scoring rule construction framework proposed in this article. Appendix B provides formulas for the expected multivariate scoring rules presented above.
2.4 Spatial verification tools
Spatial forecasts are a very important group of multivariate forecasts as they are involved in various applications (e.g., weather or renewable energy forecasting). Spatial fields are often characterized by high dimensionality and potentially strong correlations between neighboring locations. These characteristics make the verification of spatial forecasts very demanding in terms of discriminating misspecified dependence structures, for example. In the case of spatial forecasts, it is known that traditional verification methods (e.g., gridpoint-by-gridpoint verification) may result in a double penalty. The double-penalty effect was pinned in Ebert (2008) and refers to the fact that if a forecast presents a spatial (or temporal) shift with respect to observations, the error made would be penalized twice: once where the event was observed and again where the forecast predicted it. In particular, high-resolution forecasts are more penalized than less realistic blurry forecasts. The double-penalty effect may also affect spatio-temporal forecasts in general.
In parallel with the development of scoring rules, various application-focused spatial verification methods have been developed to evaluate weather forecasts. The efforts toward improving spatial verification methods have been guided by two projects: the intercomparison project (ICP; Gilleland et al. 2009) and its second phase, called Mesoscale Verification Intercomparison over Complex Terrain (MesoVICT; Dorninger et al. 2018). These projects resulted in the comparison of spatial verification methods with a particular focus on understanding their limitations and clarifying their interpretability. Only a few links exist between the approaches studied in these projects (and the work they induced) and the proper scoring rules framework. In particular, Casati et al. (2022) noted "a lack of representation of novel spatial verification methods for ensemble prediction systems". In general, there is a clear lack of methods focusing on the spatial verification of probabilistic forecasts. Moreover, to help bridging the gap between the two communities, we would like to recall the approach of spatial verification tools in the light of the scoring rule framework introduced above.
One of the goals of the ICP was to provide insights on how to develop methods robust to the double-penalty effect. In particular, Gilleland et al. (2009) proposed a classification of spatial verification tools updated later in Dorninger et al. (2018) resulting in a five-category classification. The classes differ in the computing principle they rely on. Not all spatial verification tools mentioned in these studies can be applied to probabilistic forecasts, some of them can solely be applied to deterministic forecasts. In the following description of the classes, we try to focus on methods suited to probabilistic forecasts or at least the special case of ensemble forecasts.
Neighborhood-based methods consist of applying a smoothing filter to the forecast and observation fields to prevent the double-penalty effect. The smoothing filter can take various forms (e.g., a minimum, a maximum, a mean, or a Gaussian filter) and be applied over a given neighborhood. For example, Stein and Stoop (2022) proposed a neighborhood-based CRPS for ensemble forecasts gathering forecasts and observations made within the neighborhood of the location considered. The use of a neighborhood prevents the double-penalty effect from taking place at scales smaller than that of the neighborhood. In this general definition, neighborhood-based methods can lead to proper scoring rules, in particular, see the notion of patches in Section 4.
Scale-separation techniques denote methods for which the verification is obtained after comparing forecast and observation fields across different scales. The scale-separation process can be seen as several single-bandpass spatial filters (e.g., projection onto a base of wavelets as wavelet-based scoring rules; Buschow et al. 2019). However, in order to obtain proper scoring rules, the comparison of the scale-specific characteristics needs to be performed using a proper scoring rule. Section 4 provides a discussion on wavelet-based scoring rules and their propriety.
Object-based methods rely on the identification of objects of interest and the comparison of the objects obtained in the forecast and observation fields. Object identification is application-dependent and can take the form of objects that forecasters are familiar with (e.g., storm cells for precipitation forecasts). A well-known verification tool within this class is the structure-amplitude-location (SAL; Wernli et al. 2008) method which has been generalized to ensemble forecasts in Radanovics et al. (2018). The three components of the ensemble SAL do not lead to proper scoring rules. They rely on the mean of the forecast within scoring functions inconsistent with the mean. Thus, the ideal forecast does not minimize the expected value. Nonetheless, the three components of the SAL method could be adapted to use proper scoring rules sensitive to the misspecification of the same features.
Field-deformation techniques consist of deforming the forecasts field into the observation field (the similarity between the fields can be ensured by a metric of interest). The field of distortion associated with the morphing of the forecast field into the observation field becomes a measure of the predictive performance of the forecast (see, e.g., Han and Szunyogh 2018).
Distance measures between binary images, such as exceedance of a threshold of interest, of the forecast and observation fields. These methods are inspired by development in image processing (e.g., Baddeley’s delta measure Gilleland 2011).
These five categories are partially overlap** as it can be argued that some methods belong to multiple categories (e.g., some distance measures techniques can be seen as a mix of field-deformation and object-based). They define different principles that can be used to build verification tools that are not subject to the double-penalty effect. The reader may refer to Dorninger et al. (2018) and references therein for details on the classification and the spatial verification methods not used thereafter. The frontier between the aforementioned spatial verification methods and the proper scoring rules framework is porous with, for example, wavelet-based scoring rules belonging to both. It appears that numerous spatial verification methods seek interpretability and we believe that this is not incompatible with the use of proper scoring rules. We propose the following framework to facilitate the construction of interpretable proper scoring rules.
3 A framework for interpretable proper scoring rules
We define a framework to design proper scoring rules for multivariate forecasts. Its definition is motivated by remarks on the multivariate forecasts literature and operational use. There seems to be a growing consensus around the fact that no single verification method has it all (see, e.g., Bjerregård et al. 2021). Most of the studies comparing forecast verification methods highlight that verification procedures should not be reduced to the use of a single method and that each procedure needs to be well suited to the context (see, e.g., Scheuerer and Hamill 2015; Thorarinsdottir and Schuhen 2018). Moreover, from a more theoretical point of view, (strict) propriety does not ensure discrimination ability and different (strictly) proper scoring rules can lead to different rankings of sub-efficient forecasts.
Standard verification procedures gradually increase the complexity of the quantities verified. Procedures often start by verifying simple quantities such as quantiles, mean, or binary events (e.g., prediction of dry/wet events for precipitation). If multiple forecasts have a satisfying performance for these quantities, marginal distributions of the multivariate forecast can be verified using univariate scoring rules. Finally, multivariate-related quantities, such as the dependence structure, can be verified through multivariate scoring rules. Forecasters rely on multiple verification methods to evaluate a forecast and ideally, the verification method should be interpretable by targeting specific aspects of the distribution or thanks to the forecaster’s experience. This type of verification procedure allows the forecaster to understand what characterizes the predictive performance of a forecast instead of directly looking at a strictly proper scoring rule giving an encapsulated summary of the predictive performance.
Various multivariate forecast calibration methods rely on the calibration of univariate quantities obtained by dimension reduction techniques. As the general principle of multivariate calibration leans on studying the calibration of quantities obtained by pre-rank functions, Allen et al. (2024) argue that calibration procedures should not rely on a single pre-rank function and should instead use multiple simple pre-rank functions and leverage the interpretability of the PIT/rank histograms associated. A similar principle can be applied to increase the interpretability of verification methods based on scoring rules.
As general multivariate strictly proper scoring rules fail to discriminate forecasts with respect to arbitrary misspecifications and they may lead to different ranking of sub-efficient forecasts, multivariate verification could benefit from using multiple proper scoring rules targeting specific aspects of the forecasts. Thereby, forecasters know which aspect of the observations are well-predicted by the forecast and can update their forecast or select the best forecast among others in the light of this better understanding of the forecast. To facilitate the construction of interpretable proper scoring rules, we define a framework based on two principles: transformation and aggregation.
The transformation principle consists of transforming both forecast and observation before applying a scoring rule. Heinrich-Mertsching et al. (2021) introduced this general principle in the context of point processes. In particular, they present scoring rules based on summary statistics targeting the clustering behavior or the intensity of the processes. In a more general context, the use of transformations was disseminated in the literature for several years (see Section 4). Proposition 1 shows how transformations can be used to construct proper scoring rules.
Proposition 1.
Let be a class of Borel probability measure on and let be a forecast and an observation. Let be a transformation and let be a scoring rule on that is proper relative to . Then, the scoring rule
is proper relative to . If is strictly proper relative to and is injective, then the resulting scoring rule is strictly proper relative to .
To gain interpretability, it is natural to have dimension-reducing transformations (i.e., ), which generally leads to not being injective and not being strictly proper. Nonetheless, as expressed previously, interpretability is important and it can mostly be leveraged if the transformation simplifies the multivariate quantities. Particularly, it is generally preferred to choose to make the quantity easier to interpret and focus on specific information contained in the forecast or the observation. Straightforward transformations can be projections on a -dimensional margin or a summary statistic relevant to the forecast type such as the total over a domain in the case of precipitations. Simple transformations may be preferred for their interpretability and their potential lack of discriminatory power can be made up for via the use of multiple simpler transformations. Numerous examples of transformations are presented, discussed, and linked to the literature in Section 4. The proof of Proposition 1 is provided in Appendix C.1.
The second principle is the aggregation of scoring rules. Aggregation can be used on scoring rules in order to combine them and obtain a single scoring rule summarizing the evaluation. It can be used to operate on scoring rules acting on different spaces, times or locations. Note that Dawid and Musio (2014) introduced the notion of composite score which is related to the aggregation principle but is closer to the combined application of both principles. Proposition 2 presents a general aggregation principle to build proper scoring rules. This principle has been known since proper scoring rules have been introduced.
Proposition 2.
Let be a set of proper scoring rules relative to . Let be nonnegative weights. Then, the scoring rule
is proper relative to . If at least one scoring rule is strictly proper relative to and , the aggregated scoring rule is strictly proper relative to .
It is worth noting that Proposition 2 does not specify any strict condition for the scoring rules used. For example, the scoring rules aggregated do not need to be the same or do not need to be expressed in the same units. Aggregated scoring rules can be used to summarize the evaluation of univariate probabilistic forecasts (e.g., aggregation of CRPS at different locations) or to summarize complementary scoring rules (e.g., aggregation of Brier score and a threshold-weighted CRPS). Unless stated otherwise, for simplicity, we will restrict ourselves to cases where the aggregated scoring rules are of the same type. Bolin and Wallin (2023) showed that the aggregation of scoring rules can lead to unintuitive behaviors. For the aggregation of univariate scoring rules, they showed that scoring rules do not necessarily have the same dependence on the scale of the forecasted phenomenon: this leads to scoring rules putting more (or less) emphasis on the forecasts with larger scales. They define and propose local scale-invariant scoring rules to make scale-agnostic scoring rules. When performing aggregation, it is important to be aware of potential preferences or biases of the scoring rules.
We only consider aggregation of proper scoring rules through a weighted sum. To conserve (strict) propriety of scoring rules, aggregations can take, more generally, the form of (strictly) isotonic transformations, such as a multiplicative structure when positive scoring rules are considered (Ziel and Berk, 2019).
The two principles of Proposition 1 and Proposition 2 can be used simultaneously to create proper scoring rules based on both transformations and aggregation as presented in Corollary 1.
Corollary 1.
Let be a set of transformations from to . Let be a set of proper scoring rules where is proper relative to , for all . Let be nonnegative weights. Then, the scoring rule
is proper relative to .
Strict propriety relative to of the resulting scoring rule is obtained as soon as there exists such that is strictly proper relative to , is injective and . The result of Corollary 1 can be extended to transformations with images in different dimensions and paired with different scoring rules (see Appendix D).
As we will see in the examples developed in the following section, numerous scoring rules used in the literature are based on these two principles of aggregation and transformation.
Decomposition of kernel scoring rules.
We briefly discuss the link between the transformation and aggregation principles for scoring rules and the specific class of kernel scoring rules. A kernel on is a measurable function satisfying the following two properties:
-
(symmetry) for all ;
-
(non-negativity) for all and , for all .
The kernel scoring rule associated with the kernel is defined on the space of predictive distributions
by
(15) |
where and are independent random variables following . Importantly, is proper on and, for an ensemble forecast with members , it takes the simple form
(16) |
making scoring rules particularly useful for ensemble forecasts.
The CRPS is surely the most widely used kernel scoring rule. Equation (6) shows that it is a associated with the kernel (the function is conditionally semi-definite negative so that is non-negative). For more details on kernel scoring rules, the reader should refer to Gneiting et al. (2005) or Steinwart and Ziegel (2021).
The following proposition reveals that a kernel scoring rule can always be expressed as an aggregation of squared errors (SEs) between transformations of the forecast-observation pair.
Proposition 3.
Let be the kernel scoring rule associated with the kernel . Then there exists a sequence of transformations , , such that
for all predictive distribution and observation .
In particular, the series on the right-hand side is always finite. The proof is provided in Appendix C.2 and relies on the reproducing kernel Hilbert space (RKHS) representation of kernel scoring rules. In particular, we will see that the sequence can be chosen as an orthonormal basis of the RKHS associated with the kernel .
This representation of kernel scoring rules can be useful to understand more deeply the comparison of the predictive forecast and observation . While the definition (15) is quite abstract, the series representation can be rewritten
with a random variable following . In other words, for , the observed value is compared to the predicted value under the predictive distribution using the SE; then all these contributions are aggregated in a series forming the kernel scoring rule.
To give more intuition, we study two important cases in dimension . The details of the computations are provided in Appendix C.3. For the Gaussian kernel scoring rule associated with the kernel
some computations yield the series representation
so that this score compares the probabilistic forecast and the observation through the transforms
For the CRPS, a possible series representation is obtained thanks to the following wavelet basis of functions: let (plateau function) and (triangle function) and consider the collection of functions
where is a position parameter and a scale parameter. Then, the CRPS can be written as
We can see that the CRPS compares forecast and observation through the SE after applying the plateau and triangle transformations for multiple positions and scales and then aggregates all the contributions.
4 Applications of the transformation and aggregation principles
4.1 Projections
Certainly, the most direct type of transformation is projections of forecasts and observations on their -dimensional marginals. We denote the projection on the -th component such that , for all . This allows the forecaster to assess the predictive performance of a forecast for a specific univariate marginal independently of the other variables. If is an univariate scoring rule proper relative to , then Proposition 1 leads to being proper relative to . This "new" scoring rule can be useful if a given marginal is of particular interest (e.g., location of high interest in a spatial forecast). However, it can be more interesting to aggregate such scoring rules across all -dimensional marginals. This leads to the following scoring rule
where is . This setting is popular for assessing the performance of multivariate forecasts and we briefly present examples from the literature falling under this setting. Aggregation of CRPS (6) across locations and/or lead times is common practice for plots or comparison tables with uniform weights (Gneiting et al., 2005; Taillardat et al., 2016; Rasp and Lerch, 2018; Schulz and Lerch, 2022; Lerch and Polsterer, 2022; Hu et al., 2023) or with more complex schemes such as weights proportional to the cosine of the latitude (Ben Bouallègue et al., 2024b). The SE (2) and AE (3) can be aggregated to obtain RMSE and MAE, respectively (Delle Monache et al., 2013; Gneiting et al., 2005; Lerch and Polsterer, 2022; Pathak et al., 2022). Bremnes (2019) aggregated QSs (4) across stations and different quantile levels of interest with uniform weights. Note that the multivariate SE (12) can be rewritten as the sum of univariate SE across -marginals: .
The second simplest choice is the -dimensional case, allowing to focus on pair dependency. We denote the projection on the -th and -th components (i.e., the pair of components) such that . In this setting, has to be a bivariate proper scoring rule to construct a proper scoring rule . The aggregation of such scoring rules becomes
As suggested in Scheuerer and Hamill (2015) for the VS (14), the weights can be chosen appropriately to optimize the signal-to-noise ratio. For example, in a spatial setting where the dependence between locations is believed to decrease with the distance separating them, the weights can be chosen to be proportional to the inverse of the distance. This bivariate setting is less used in the literature, we present two articles using or mentioning scoring rules within this scope. In a general multivariate setting, Ziel and Berk (2019) suggests the use of a marginal-copula scoring rule where the copula score is the bivariate copula energy score (i.e., the aggregation of the energy scores across all the regularized pairs). To focus on the verification of the temporal dependence of spatio-temporal forecasts, Ben Bouallègue et al. (2024b) uses the bivariate energy score over consecutive lead times.
In a more general setup, we consider projection on -dimensional marginals. In order to reduce the number of transformation-based scores to aggregate, it is standard to focus on localized marginals (e.g., belonging to patches of a given spatial size). Denote a set of valid patches (for some criterion or of a given size) and the set of transformation-based scores associated with the projections on the patches . Given a multivariate scoring rule proper relative to , we can construct the following aggregated score :
This construction can be used to create a scoring rule only considering the dependence of localized components, given that the patches are defined in that sense. The use of patches has similar benefits as the weighting of pairs given a belief on their correlations: obtain a better signal-to-noise ratio and improve the discrimination of the resulting scoring rule. For example, Pacchiardi et al. (2024) introduced patched energy scores as scoring rules to minimize in order to train a generative neural network. The patched energy scores are defined for and square patches spaced by a given stride. Even though spatial patches may be more intuitive, it is possible to use temporal or spatio-temporal patches. Patch-based scoring rules appear as a natural member of the neighborhood-based methods of the spatial verification classification mentioned in Section 2.4. Given that the patches are correctly chosen (e.g., of a size appropriate to the problem at hand), patch-based scoring rules are not subject to the double-penalty effect.
As noticeable by the low number of examples available in the literature, aggregation (and plain use) of scoring rules based on projection in dimension is not standard practice, probably because such projections may lack interpretability. Instead, to assess the multivariate aspects of a forecast, scoring rules relying on summary statistics are often favored.
4.2 Summary statistics
Summary statistics are a central tool of statisticians’ toolboxes as they provide interpretable and understandable quantities that can be linked to the behavior of the phenomenon studied. Moreover, their interpretability can be enhanced by the forecaster’s experience and this can be leveraged when constructing scoring rules based on them. Summary statistics are commonly present during the verification procedure and this can be extended by the use of new scoring rules derived from any summary statistic of interest. For example, numerous summary statistics can come in handy when studying precipitations over a region covered by gridded observation and forecasts. Firstly, it is common practice to focus on binary events such as the exceedance of a threshold (e.g., the presence or absence of precipitation). This can be studied by using the BS (5) on all -dimensional marginals as mentioned in the previous subsection but also in a multivariate manner through the fraction of threshold exceedances (FTE) over patches as presented further. Regarding precipitations, it is standard to be interested in the prediction of total precipitation over a region or a time period. This transformation of the field can be leveraged to construct a scoring rule. Finally, it is important to verify that the spatial structure of the forecast matches the spatial structure of observations. The spatial structure can be (partially) summarized by the variogram or by wavelet transformations. The predictive performance for the spatial structure can be assessed by their associated scoring rules: the VS of order (14) and the wavelet-based score (Buschow et al., 2019). Other summary statistics can be of interest to the phenomenon studied, Heinrich-Mertsching et al. (2021) present summary statistics specific to point processes focusing on clustering and intensity.
The most well-known summary statistic is certainly the mean. In spatial statistics, it can be used to avoid double penalization when we are less interested in the exact location of the forecast but rather in a regional prediction. The transformation associated with the mean is
(17) |
where denotes a patch and its dimension. Proposition 1 ensures that this transformation can be used to construct proper scoring rules. The scoring rule involved in the construction has to be univariate, however, the choice depends on the general properties preferred. For example, the SE would focus on the mean of the transformed quantity, whereas the AE would target its median. It is worth noting that the total can be derived by the mean transformation by removing the prefactor
In the case of precipitation, the total is more used than the mean since the total precipitation over a river basin can be decisive in evaluating flood risk. For example, one could construct an adapted version of the amplitude component of the SAL method (Wernli et al., 2008; Radanovics et al., 2018) using the SE if the mean total precipitation is of interest. Gneiting (2011) presents other links between the quantity of interest and the scoring rule associated. Similarly, the transformations associated with the minimum and the maximum over a patch can be obtained :
The maximum or minimum can be useful when considering extreme events. It can help understand if the severity of an event is well-captured. For example, as minimum and maximum temperatures affect crop yields (see, e.g., Agnolucci et al. 2020), it can be of particular interest that a weather forecast within an agricultural model correctly predicts the minimum and maximum temperatures. After studying the mean, it is natural to think of the moments of higher order. We can define the transformation associated with the variance over a patch as
The variance transformation can provide information on the fluctuations over a patch and be used to assess the quality of the local variability of the forecast. In a more general setup, it can be of interest to use a transformation related to the moment of order and the transformation associated follows naturally
More application-oriented transformations are the
central or standardized moments (e.g., skewness or kurtosis). Their transformations can be obtained directly from estimators. As underlined in Heinrich-Mertsching et al. (2021), since Proposition 1 applies to any transformation, there is no condition on having an unbiased estimator to obtain proper scoring rules.
Threshold exceedance plays an important role in decision making such as weather alerts. For example, MeteoSwiss’ heat warning levels are based on the exceedance of daily mean temperature over three consecutive days (Allen et al., 2023a). They can be defined by the simultaneous exceedance of a certain threshold and the fraction of threshold exceedance (FTE) is the summary statistic associated.
(18) |
FTEs can be used as an extension of univariate threshold exceedances and it prevents the double-penalty effect. FTEs may be used to target compound events (e.g., the simultaneous exceedances of a threshold at multiple locations of interest). Roberts and Lean (2008) used an FTE-based SE over different sizes of neighborhoods (patches) to verify at which scale forecasts become skillful. To assess extreme precipitation forecasts, Rivoire et al. (2023) introduces scores for extremes with temporal and spatial aggregation separately. Extreme events are defined as values higher than the seasonal quantile. In the subseasonal-to-seasonal range, the temporal patches are 7-day windows centered on the extreme event and the spatial patches are square boxes of 150 km 150 km centered on the extreme event. The final scores are transformed BS (5) with a threshold of one event predicted across the patch.
Correctly predicting the structure dependence is crucial in multivariate forecasting. Variograms are summary statistics representing the dependence structure. The variogram of order of the pair corresponds to the following transformation :
As mentioned in the Introduction, using both the transformation and aggregation principles, we can recover the VS of order (14) introduced in Scheuerer and Hamill (2015) :
Along with the well-known VS of order , Scheuerer and Hamill (2015) introduced alternatives where the scoring rule applied on the transformation is the CRPS (6) or the AE (3) instead of the SE (2). As mentioned previously, under the intrinsic hypothesis of Matheron (1963) (i.e., pairwise differences only depend on the distance between locations), the weights can be selected to obtain an optimal signal-to-noise ratio. Moreover, the weights could be selected to investigate a specific scale by giving a non-zero weight to pairs separated by a given distance.
In the case of spatial forecasts over a grid of size , a spatial version of the variogram transformation is available :
where are the coordinates of grid points. Under the intrinsic hypothesis of Matheron (1963), the variogram between grid points separated by the vector can be estimated by :
where . This directed variogram can be used to target the verification of the anisotropy of the dependence structure. The isotropy transformation associated to the distance can be defined by
(19) |
This transformation is the isotropy pre-rank function proposed in Allen et al. (2024). The isotropy transformation considers the orthogonal directions formed by the abscissa and ordinate axes and evaluates how the variogram changes between these directions. The transformation leads to negative or zero quantities with values close to zero characterizing isotropy and negative values corresponding to the anisotropy of the variograms in the directions and at the scale involved.
4.3 Other transformations
Transformations other than projections or summary statistics can be used to target forecast characteristics. For example, a transformation in the form of a change of coordinates or a change of scale (e.g., a logarithmic scale) can be used to obtain proper scoring rules. We highlight two families of scoring rules that can be seen as transformation-based scoring rules: wavelet-based scoring rules and threshold-weighted scoring rules.
Generally speaking, wavelet-based scoring rules are built thanks to a projection of forecast and observation fields onto a wavelet basis. Based on the wavelet coefficients, dimension reduction might be performed to target specific characteristics such as the dependence structure or the location. The resulting coefficients of the forecast fields are compared to the coefficients of the observations fields using scoring rules (e.g., squared error (SE) or energy score (ES)). Wavelet transformations are (complex) transformations, and thus, the scoring rules associated fall within the scope of Proposition 1. In particular, Buschow et al. (2019) used a dimension reduction procedure resulting in the obtention of a mean and a scale spectra and used scoring rules to compare forecasts and observation spectra. For example, the ES of the mean spectrum is used and shows good discrimination ability when the scale structure is misspecified.
Note that Buschow et al. (2019) proposed two other wavelet-based scoring rules: one based on the earth mover’s distance (EMD) of the scale histograms and one based on the distance in the scale histograms’ center of mass. The EMD-based scoring rules are not proper since the EMD is not a proper scoring rule (Thorarinsdottir et al., 2013) and the so-called distance between centers of mass is not a distance but rather a difference of position leading to an improper scoring rule. However, the ES-based scoring rules are proper and could be derived from scale histograms. Despite their apparent complexity, wavelet transformations allow to target interpretable characteristics such as the location (Buschow, 2022), the scale structure (Buschow et al., 2019; Buschow and Friederichs, 2020) or the anisotropy (Buschow and Friederichs, 2021). The transformations proposed for the deterministic forecasts setting in most of these articles could be used as foundations for future work willing to propose wavelet-based proper scoring rules targeting the location, the scale structure or the anisotropy.
As showcased in Heinrich-Mertsching et al. (2021) for a specific example and hinted in Allen et al. (2024), transformations can also be used to emphasize certain outputs. Threshold weighting is one of the three main types of weighting conserving the propriety of scoring rules. Its name come from the fact that it corresponds to a weighting over different thresholds in the case of CRPS (7) (Gneiting, 2011). Recall that given a conditionally negative definite kernel , the kernel scoring associated (15) is proper relative to . Many popular scoring rules are kernel scores such as the BS (5), the CRPS (6), the ES (13) and the VS (14). By definition (Allen et al., 2023b, Definition 4), threshold-weighted kernel scores are constructed as
where is the chaining function capturing how the emphasis is put on certain outputs. With this explicit definition, it is obvious that threshold-weighted kernel scores are covered by the framework of Proposition 1. It can be noted that Proposition 4 in Allen et al. (2023b) states that strict propriety of the kernel scoring rule is preserved by the chaining function if and only if is injective. Weighted scoring rules allow to emphasize particular outcomes: when studying extreme events, it is often of particular interest to focus on values larger than a given threshold and this can be achieved using the chaining rule . Threshold-weighted scoring rules have been used in verification procedures in the literature; we illustrate its use through three different studies. Lerch and Thorarinsdottir (2013) aggregated across stations twCRPS to compare the upper tail performance of different daily maximum wind speed forecasts. Chapman et al. (2022) aggregated the threshold-weighted CRPS across locations to study the improvement of statistical postprocessing techniques, the importance of predictors and the influence of the size of the training set on the performance. Allen et al. (2023a) used threshold-weighted versions of the CRPS, the ES, and the VS to compare the predictive performance of forecasts regarding heatwave severity; the scoring rules were aggregated across stations. Readers may refer to Allen et al. (2023a) and Allen et al. (2023b) for insightful reviews of weighted scoring rules in both univariate and multivariate settings.
5 Simulation study
This section provides simulated examples to showcase the different uses of the framework introduced in Section 3 to construct interpretable proper scoring rules for multivariate forecasts. Four examples are developed. Firstly, a setup where the emphasis is put on -marginal verification is proposed. This setup serves as a means of recalling and showing the limitations of strictly proper scoring rules and the benefits of interpretable scoring rules in a concrete setting. Secondly, a standard multivariate setup is studied where popular multivariate scoring rules (i.e., VS and ES) are compared to a multivariate scoring rule aggregated over patches and an aggregation-and-transformation-based scoring rule in their discrimination ability regarding the dependence structure. Thirdly, a setup introducing anisotropy in both observations and forecasts is introduced. The anisotropic score is constructed based on the transformation principle with the goal of discriminating differences of anisotropy in the dependence structure between forecast and observations. Fourthly, we propose a setup to test the sensitivity of scoring rules to the double-penalty effect and we introduce scoring rules that can be built to be resilient to some manifestation of the double-penalty effect.
In these four numerical experiments, the spatial field is observed and predicted on a regular grid . Observations are realizations of a Gaussian random field with zero mean and power-exponential covariance defined as
(20) |
The parameters are taken equal to , and .
In each numerical experiment, we compare a few predictive distributions, including the distribution generating observations and other ones deviating from the generative distributions in a specific way. These different predictive distributions are evaluated with different scoring rules and the aim is to illustrate the discriminatory ability of the different scoring rules.
The simulation study uses 500 observations of the random field . The scoring rules are computed using exact formulas when possible (see Appendix E), and, when exact formulas are not available, they are computed based on a sample of size 100 (i.e., ensemble forecasts) of the probabilistic forecast. Estimated expectations over the 500 observations are computed and the experiment is repeated 10 times. The corresponding results are represented by boxplots. The units of the scoring rules are rescaled by the average expected score of the true distribution (i.e., the ideal forecast). The statistical significativity of the ranking between forecasts is tested using a Diebold-Mariano test (Diebold and Mariano, 1995). When deemed necessary, statistical significativity is mentioned for a confidence level of 95%.
The code used for the different numerical experiments is publicly available111https://github.com/pic-romain/aggregation-transformation.
5.1 Marginals
This first numerical experiment focuses on the prediction of the 1-dimensional marginal distributions and the aggregation of univariate scoring rules. For simplicity, we consider only stationary random fields so that the 1-marginal distribution is the same at all grid points. Although similar conclusions could be drawn from an univariate framework (i.e., with independent 1-dimensional rather than spatial observations), this example aims to clarify the notion of interpretability and presents notions that will be reused in the following examples. The verification of marginals, along with other simple quantities, is usually one of the first steps of any multivariate forecast verification process.
Observations follow the model of (20) and multiple competing forecasts are considered:
-
-
the ideal forecast is the Gaussian distribution generating observations and is used as a reference;
-
-
the biased forecast is a Gaussian predictive distribution with the same covariance structure as the observation but a different mean ;
-
-
the overdispersed forecast and the underdispersed forecast are Gaussian predictive distributions from the same model as the observations except for an overestimation () and an underestimation () of the variance respectively;
-
-
the location-scale Student forecast is used where the marginals follow location-scale Student- distributions with parameters , , and is such that the standard deviation is and the covariance structure the same as in (20).
In order to compare the predictive performance of forecasts, we use scoring rules constructed by aggregating univariate scoring rules. Here, the aggregation is done with uniform weights since there is no prior knowledge on the locations. The univariate scoring rules considered are the continuous ranked probability score (CRPS), the Brier score (BS), the quantile score (QS), the squared error (SE) and the Dawid-Sebastiani score (DSS). Figure 1(a) compares five different forecasts based on their expected CRPS. It can be seen that all forecasts except for the ideal one have similar expected values and no sub-efficient forecast is significantly better than the others. In order to gain more insight into the predictive performance of the forecast, it is necessary to use other scoring rules. In practice, the distribution is unknown; thus, it is impossible to know if a forecast is efficient; it is only possible to provide a ranking linked to the closeness of the forecast with respect to the observations. The definition of closeness depends on the scoring rule used: for example, the CRPS defines closeness in terms of the integrated quadratic distance between the two cumulative distribution functions (see, e.g., Thorarinsdottir and Schuhen 2018).
If the quantity of interest is the value of a quantile of a certain level , the aggregated QS is an appropriate scoring rule. Figure 1(b) shows the expected aggregated QS for three different levels : , and . is associated with the prediction of the median and, since all the forecasts are symmetric and only the biased forecast is not centered on zero, the other forecasts are equally the best and efficient forecasts. If the third quartile is of interest (), the location-scale Student forecast appears as significantly the best (among the non-ideal). For the higher level of , the biased forecast is significantly the best since its bias error seems to be compensated by its correct prediction of the variance. Depending on the level of interest, the best forecast varies; the only forecast that would appear to be the best regardless of the level is the ideal forecast, as implied by (8).
If a quantity of interest is the exceedance of a threshold at each location, then the aggregated BS is an interesting scoring rule. Figure 1(c) shows the expectation of aggregated BS for the different forecasts and for two different thresholds ( and ). Among the non-ideal forecasts, there seems to be a clearer ranking than for the CRPS. The overdispersed forecast is significantly the best regarding the prediction of the exceedance of the threshold and the biased forecast is significantly the best regarding the exceedance of . As for the aggregated quantile score, the best forecast depends on the threshold considered and the only forecast that is the best regardless of the threshold is the ideal one (see Eq. (7)).
If the moments are of interest, the aggregated SE discriminates the first moment (i.e., the mean) and the aggregated DSS discriminates the first two moments (i.e., the mean and the variance). Figure 1(d) presents the expected values of these scoring rules for the different forecasts considered in this example. The aggregated SEs of all forecasts, except the biased forecast, are equal since they have the same (correct) marginal means. The aggregated DSS presents the biased forecast as significantly the best one (among non-ideal). This is caused by the combined discrimination of the first two moments of the Dawid-Sebastiani score (see Eq. (9) and Appendix A).
5.2 Multivariate scores over patches
This second numerical experiment focuses on the prediction of the dependence structure. Observations are sampled from the model of Eq. (20) and we compare forecasts that differ only in their dependence structure through misspecification of the range parameter and the smoothness parameter :
-
-
the ideal forecast is the Gaussian distribution generating the observations;
-
-
the small-range forecast and the large-range forecast are Gaussian predictive distributions from the same model (20) as the observations except for an underestimation () and an overestimation (), respectively, of the range;
-
-
the under-smooth forecast and the over-smooth forecast are Gaussian predictive distributions from the same model as the observations except for an underestimation () and an overestimation (), respectively, of the smoothness.
Since the forecasts differ only in their dependence structure, scoring rules acting on the 1-dimensional marginals would not be able to distinguish the ideal forecast from the others. We use the variogram score (VS) as a reference since it is known to discriminate misspecification of the dependence structure. We introduce the patched energy score, which results from the aggregation of the ES (with over patches, defined as
where is an ensemble of spatial patches, is the weight associated with a patch and is the marginal of over the patch . In order to make the scoring more interpretable, only square patches of a given size are considered and the weights are uniform (). Moreover, we consider the aggregated CRPS and the ES since they are limiting cases of the patched ES for patches and a single patch over the whole domain , respectively. Additionally, we proposed the -variation score (VS), which is based on the -variation transformation to focus on the discrimination of the regularity of the random fields,
where is the domain restricted to grid points such that is defined (i.e., ). Note that in the literature on fractional random fields, the -variation is an important characteristic used to characterize the roughness of a random field and is commonly used for estimation purposes, see Benassi et al. (2004), Basse-O’Connor et al. (2021) and the references therein.
In Figure 2, the ES and the patched ES were computed using samples from the forecasts since closed expressions could not be derived. However, closed formulas for the VS and the VS were derived and are available in Appendix E. As already shown in Scheuerer and Hamill (2015), the VS is able to significantly discriminate misspecification of the dependence structure induced by the range parameter (see Fig. 2(a)). Smaller orders of (such as ) appear as more informative than higher ones. Moreover, it is able to discriminate misspecification induced by the smoothness parameter (significantly for all orders considered) even if it is less marked than for the misspecification of the range .
Figure 2(b) compares the forecasts using the -variation score with . Note that the forecasts are provided in the same order as in the other sub-figures. The VS is able to (significantly) discriminate all four sub-efficient forecasts from the ideal forecast at all order . In the cases considered, the VS has a stronger discriminating ability than the VS; in particular, for misspecification of the smoothness parameter . The overall improvement in the discrimination ability of the VS compared to the VS is due to the fact that it only considers local pair interactions between grid points; which in the experimental setup considered greatly improves the signal-to-noise ratio compared to the VS. For example, it would be incapable of differentiating two forecasts that only differ in their longer-range dependence structure, where the VS should discriminate the two forecasts.
Figure 2(c) shows that the patched ESs have a better discrimination ability than the ES. As expected by the clear analogy between the variogram score weights and the selection of valid patches, focusing on smaller patches improves the signal-to-noise ratio. For all patch size considered, the patched ES significantly discriminates the ideal forecast from the others. Whereas the ES does not significantly discriminate the misspecification of smoothness of the under-smooth and over-smooth forecasts. Nonetheless, the patched ES remains less sensitive than the VS to misspecifications in the dependence structure through the range parameter or the smoothness parameter .
The VS relies on the aggregation and transformation principles and is able to discriminate the dependence structure. Similarly, the VS is able to discriminate misspecifications of the dependence structure. Being based on more local transformations (i.e., -variation transformation instead of variogram transformation), it has a greater discrimination ability than the VS in this experimental setup. In addition to this known application of the aggregation and transformation principles, it has been shown that multivariate transformations can be used to obtain patched scores that, in the case of the ES, lead to an improvement in the signal-to-noise ratio with respect to the original scoring rule.
5.3 Anisotropy
In this example, we focus on the anisotropy of the dependence structure. We introduce geometric anisotropy in observations and forecasts via the covariance function in the following way
with . The matrix has the following form :
with the direction of the anisotropy and the ratio between the axes.
The observations follow the anisotropic version of the model in Eq. (20) where the covariance function presents the geometric anisotropy introduced above with (as previously) and and . Multiple forecasts are considered that only differ in their prediction of the anisotropy in the model:
-
-
the ideal forecast has the same distribution as the observations and is used as a reference;
-
-
the small-angle forecast and the large-angle forecast have a correct ratio but an under- and over-estimation of the angle, respectively (i.e., and );
-
-
the isotropic forecast and the over-anisotropic forecast have a ratio and , respectively, but a correct angle .
Since these forecasts differ only in the anisotropy of their dependence structure, scoring rules not suited to discriminate the dependence structure would not be able to differentiate them. We compare two proper scoring rules: the variogram score and the anisotropic scoring rule. The variogram score is studied in two different settings: one where the weights are proportional to the inverse of the distance and one where the weights are proportional to the inverse of the anisotropic distance , which is supposed to be more informed since it is the quantity present in the covariance function. The anisotropic score (AS) is a scoring rule based on the framework introduced in Section 3 and, in its general form, it is defined as
(21) |
where is a transformation summarizing the anisotropy of a field such as the one introduced in (19). Additionally, we use a special case of this scoring rule where we do not aggregate across the scales and where is the squared error :
(22) |
We use a transformation similar to the one of (19) where instead the axes are the first and second bisectors. This leads to the following formula:
The choice of this transformation instead of the transformation based on the anisotropy along the abscissa and ordinate is motivated by the fact that it leads to a clearer differentiation of the forecasts (not shown).
Figure 3(a) presents the variogram score of order in its two variants. Both the standard VS and the informed VS are able to significantly discriminate the ideal forecast from the other forecasts but they have a weak sensitivity to misspecification of the geometric anisotropy. Even though the informed VS is supposed to increase the signal-to-noise ratio compared to the standard VS; it is not more sensitive to misspecifications in the experimental setup considered. Other orders of variograms were tested and worsened the discrimination ability of both standard and informed VS (not shown).
Figure 3(b) shows the AS (22) with scales for the different forecasts and the aggregated AS (21), where the scales are aggregated with weights . The anisotropic scores were computed using samples drawn from the forecasts; this explains why the ideal forecast may appear sub-efficient for some values of (e.g., ). As aimed by its construction, the AS is able to significantly distinguish the correct anisotropy behavior in the dependence structure for values of up to included. For , the AS does not significantly discriminate the isotropic forecast and the over-anisotropic forecast from the ideal one. The fact that is the most sensitive to misspecifications is probably caused by the fact that the strength of the dependence structure decays with the distance (i.e., with ). This shows that the hyperparameter plays an important role in having an informative AS (as do the weights and the order for the variogram score). For in particular, it can be seen that the AS is not sensitive to the misspecification of the ratio and the angle in the same manner. This depends on the degree of misspecification but also on the hyperparameters of the AS. The aggregated AS allows us to avoid the selection of a scale while maintaining the discrimination ability of the lower values of .
The anisotropic score is an interpretable scoring rule targeting the anisotropy of the dependence structure. However, it has the limitation of introducing hyperparameters in the form of the scale and the axes along which the anisotropy is measured. Aggregation across scales and axes can circumvent the selection of these hyperparameters; however, a clever choice of weights will be required to maintain the signal-to-noise ratio.
5.4 Double-penalty effect
In this example, we illustrate in a simple setting how scoring rules over patches can be robust to the double-penalty effect (see Section 2.4). The double-penalty effect is introduced in the form of forecasts that deviate from the ideal forecast by an additive or multiplicative noise term (i.e., nugget effect). The noises are centered uniforms such that the forecasts are correct on average but incorrect over each grid point.
Observations follow the model of (20) with the parameters , and . As per usual the ideal forecast, having the same distribution as the observations, is used as a reference. Additive-noised forecasts are the first type of forecast introduced to test the sensitivity of scoring rules to the form of the double-penalty effect (presented above). They differ from the ideal forecast through their marginals in the following way :
where is a spatial white noise independent at each location . This has an effect on the mean of the marginals at each grid point. Three different noise range values are tested . Similarly, multiplicative-noised forecasts that differ from the ideal forecast through their marginals are introduced :
where and . This has an effect on the variance of the marginals at each grid point and, thus, on the covariance. The same noise range values are tested .
The aggregated CRPS is a naive scoring rule that is sensitive to the double-penalty effect. We propose the aggregated CRPS of spatial mean which is defined as
where is an ensemble of spatial patches, is the weight associated with a patch and the spatial mean over the patch (17). It is a proper scoring rule, and it has an interpretation similar to the aggregated CRPS, but the forecasts are only evaluated on the performance of their spatial mean. In order to make the scoring more interpretable, only square patches of a given size are considered and the weights are uniform. The size of the patches can be determined by multiple factors such as the physics of the problem, the constraints of the verification in the case of models on different scales, or hypotheses on a different behavior below and above the scale of the patch (e.g., independent and identically distributed; Taillardat and Mestre 2020). Note that the aggregated CRPS of spatial mean is equal to the aggregated CRPS when patches of size are considered.
If a quantity of interest is the exceedance of a threshold , the scoring rule associated with that is the Brier score (5). We compare the aggregated BS with its counterpart over patches: the aggregated SE of the FTE. It is defined as
where is an ensemble of spatial patches, is the weight associated with a patch and the fraction of threshold exceedance over the patch and for the threshold (18). This scoring rule is proper and focuses on the prediction of the exceedance of a threshold via the fraction of locations over a patch exceeding said threshold. The resemblance with the Brier score is clear and the aggregated SE of FTE becomes the aggregated BS when patches of size are considered.
In Figure 4, the values of the aggregated SE of FTE have been obtained by sampling the forecasts’ distribution. Figure 4(a) compares the aggregated CRPS and the aggregated CRPS of spatial mean for different patch size . For all the scoring rules, we observe an increase in the expected value with the increase of the range of the noise . As expected, the aggregated CRPS is very sensitive to noise in the mean or the variance and, thus, is prone to the double-penalty effect. The aggregated CRPS of spatial mean is less sensitive to noise on the mean or the variance. Moreover, different patch sizes allow us to select the spatial scale below which we want to avoid a double penalty. Given that the distribution of the noise is fixed in this simulation (i.e., uniform), patch size is related to the level of random fluctuations (i.e., the range ) tolerated by the scoring rule before significant discrimination with respect to the ideal forecast. It is worth noting that the range of the noise leads to a stronger increase in the values of these CRPS-related scoring rules when the noise is on the mean rather than on the variance.
Figure 4(b) compares the aggregated BS and the aggregated squared error of fraction of threshold exceedances. For simplicity, we fix the threshold . The aggregated BS is, as expected, sensitive to noise in the mean or the variance, and an increase in the range of the noise leads to an increase in the expected value of the score. The aggregated SE of FTE acts as a natural extension of the aggregated BS to patches and provides scoring rules that are less sensitive to noise on the mean or the variance. The sensitivity evolves differently with the increase of the patch size compared to the aggregated CRPS of spatial mean since the aggregated SE of FTE measures the effect on the average exceedance over a patch. The range of the noise apparently leads to a comparable increase in the values of the aggregated SE of FTE when the noise is additive or multiplicative.
The use of transformations over patches is similar to neighborhood-based methods in the spatial verification tools framework. Even though avoiding the double-penalty effect is not restricted to tools that do not penalize forecasts below a certain scale, this simulation setup presents a type of test relevant to probability forecasts. The patched-based scoring rules proposed here are not by themselves a sufficient verification tool since they are insensitive to some unrealistic forecast (e.g., if the mean value over the patch is correct but deviations may be as large as possible and lead to unchanged values of the scoring rule). As for any other scoring rule, they should be used with other scoring rules.
6 Conclusion
Verification of probabilistic forecasts is an essential but complex step of all forecasting procedures. Scoring rules may appear as the perfect tool to compare forecast performance since, when proper, they can simultaneously assess calibration and sharpness. However, propriety, even if strict, does not ensure that a scoring rule is relevant to the problem at hand. With that in mind, we agree with the recommendation of Scheuerer and Hamill (2015) that "several different scores be always considered before drawing conclusions". This is even more important in a multivariate setting where forecasts are characterized by more complex objects.
We proposed a framework to construct proper scoring rules in a multivariate setting using aggregation and transformation principles. Aggregation-and-transformation-based scoring rules can improve the conclusions drawn since they enable the verification of specific aspects of the forecast (e.g., anisotropy of the dependence structure). This has been illustrated both using examples from the literature and numerical experiments showcasing different settings. Moreover, we showed that the aggregation and transformation principles can be used to construct scoring rules that are proper, interpretable, and not affected by the double-penalty effect. This could be a starting point to help bridging the gap between the proper scoring rule community and the spatial verification tools community.
As the interest for machine learning-based weather forecast is increasing (see, e.g., Ben Bouallègue et al. 2024a), multiple approaches have performance comparable to ECMWF deterministic high-resolution forecasts (Keisler, 2022; Pathak et al., 2022; Bi et al., 2023; Lam et al., 2022; Chen et al., 2023). The natural extension to probabilistic forecast is already develo** and enabled by publicly available benchmark datasets such as WeatherBench 2 (Rasp et al., 2024). Aggregation-and-transformation-based methods can help ensure that parameter inference does not hedge certain important aspects of the multivariate probabilistic forecasts.
There seems to be a trade-off between discrimination ability and strict propriety. Discrimination ability comes from the ability of scoring rules to differentiate misspecification of certain characteristics. By definition, the expectation of strictly proper scoring rules is minimized when the probabilistic forecast is the true distribution. Nonetheless, it does not guarantee that this global minimum is steep in any misspecification direction. However, interpretable scoring rules can discriminate the misspecification of their target characteristic. Should scoring rules discriminating any misspecification be pursued? Or should interpretable scoring rules discriminating a specific type of misspecification be used instead?
Acknowledgments
The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR) under reference ANR-20-CE40-0025-01 (T-REX project) and the Energy-oriented Centre of Excellence II (EoCoE-II), Grant Agreement 824158, funded within the Horizon2020 framework of the European Union. Part of this work was also supported by the ExtremesLearning grant from 80 PRIME CNRS-INSU and this study has received funding from Agence Nationale de la Recherche - France 2030 as part of the PEPR TRACCS program under grant number ANR-22-EXTR-0005 and the ANR EXSTA.
Sam Allen is thanked for fruitful discussions during the preparation of this manuscript.
References
- Agnolucci et al. (2020) Paolo Agnolucci, Chrysanthi Rapti, Peter Alexander, Vincenzo De Lipsis, Robert A. Holland, Felix Eigenbrod, and Paul Ekins. Impacts of rising temperatures and farm management practices on global yields of 18 crops. Nature Food, 1(9):562–571, September 2020. ISSN 2662-1355. https://doi.org/10.1038/s43016-020-00148-x.
- Al Masry et al. (2023) Zeina Al Masry, Romain Pic, Clément Dombry, and Chrisine Devalland. A new methodology to predict the oncotype scores based on clinico-pathological data with similar tumor profiles. Breast Cancer Research and Treatment, 2023. ISSN 1573-7217. https://doi.org/10.1007/s10549-023-07141-5.
- Alexander et al. (2022) Carol Alexander, Michael Coulon, Y. Han, and Xiaochun Meng. Evaluating the discrimination ability of proper multi-variate scoring rules. Annals of Operations Research, March 2022. ISSN 1572-9338. https://doi.org/10.1007/s10479-022-04611-9.
- Allen et al. (2023a) Sam Allen, Jonas Bhend, Olivia Martius, and Johanna Ziegel. Weighted verification tools to evaluate univariate and multivariate probabilistic forecasts for high-impact weather events. Weather and Forecasting, 38(3):499–516, March 2023a. ISSN 1520-0434. https://doi.org/10.1175/waf-d-22-0161.1.
- Allen et al. (2023b) Sam Allen, David Ginsbourger, and Johanna Ziegel. Evaluating forecasts for high-impact events using transformed kernel scores. SIAM/ASA Journal on Uncertainty Quantification, 11(3):906–940, August 2023b. ISSN 2166-2525. https://doi.org/10.1137/22m1532184.
- Allen et al. (2024) Sam Allen, Johanna Ziegel, and David Ginsbourger. Assessing the calibration of multivariate probabilistic forecasts. Quarterly Journal of the Royal Meteorological Society, 150(760):1315–1335, February 2024. ISSN 1477-870X. https://doi.org/10.1002/qj.4647.
- Anderson (1996) Jeffrey L. Anderson. A method for producing and evaluating probabilistic forecasts from ensemble model integrations. Journal of Climate, 9(7):1518–1530, July 1996. ISSN 1520-0442. https://doi.org/10.1175/1520-0442(1996)009<1518:amfpae>2.0.co;2.
- Basse-O’Connor et al. (2021) Andreas Basse-O’Connor, Vytautė Pilipauskaitė, and Mark Podolskij. Power variations for fractional type infinitely divisible random fields. Electronic Journal of Probability, 26(none):1 – 35, 2021. https://doi.org/10.1214/21-EJP617. URL https://doi.org/10.1214/21-EJP617.
- Ben Bouallègue et al. (2024a) Zied Ben Bouallègue, Mariana C. A. Clare, Linus Magnusson, Estibaliz Gascón, Michael Maier-Gerber, Martin Janoušek, Mark Rodwell, Florian Pinault, Jesper S. Dramsch, Simon T. K. Lang, Baudouin Raoult, Florence Rabier, Matthieu Chevallier, Irina Sandu, Peter Dueben, Matthew Chantry, and Florian Pappenberger. The rise of data-driven weather forecasting: A first statistical assessment of machine learning–based weather forecasts in an operational-like context. Bulletin of the American Meteorological Society, 105(6):E864–E883, June 2024a. ISSN 1520-0477. https://doi.org/10.1175/bams-d-23-0162.1.
- Ben Bouallègue et al. (2024b) Zied Ben Bouallègue, Jonathan A. Weyn, Mariana C. A. Clare, Jesper Dramsch, Peter Dueben, and Matthew Chantry. Improving medium-range ensemble weather forecasts with hierarchical ensemble transformers. Artificial Intelligence for the Earth Systems, 3(1), January 2024b. ISSN 2769-7525. https://doi.org/10.1175/aies-d-23-0027.1.
- Benassi et al. (2004) Albert Benassi, Serge Cohen, and Jacques Istas. On roughness indices for fractional fields. Bernoulli, 10(2):357 – 373, 2004. https://doi.org/10.3150/bj/1082380223. URL https://doi.org/10.3150/bj/1082380223.
- Berlinet and Thomas-Agnan (2004) Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Kluwer Academic Publishers, Boston, MA, 2004. ISBN 1-4020-7679-7. https://doi.org/10.1007/978-1-4419-9096-9. URL https://doi.org/10.1007/978-1-4419-9096-9. With a preface by Persi Diaconis.
- Bi et al. (2023) Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Accurate medium-range global weather forecasting with 3d neural networks. Nature, 619(7970):533–538, July 2023. ISSN 1476-4687. https://doi.org/10.1038/s41586-023-06185-3.
- Bjerregård et al. (2021) Mathias Blicher Bjerregård, Jan Kloppenborg Møller, and Henrik Madsen. An introduction to multivariate probabilistic forecast evaluation. Energy and AI, 4:100058, June 2021. ISSN 2666-5468. https://doi.org/10.1016/j.egyai.2021.100058.
- Bolin and Wallin (2023) David Bolin and Jonas Wallin. Local scale invariance and robustness of proper scoring rules. Statistical Science, 38(1), feb 2023. https://doi.org/10.1214/22-sts864.
- Bosse et al. (2023) Nikos I. Bosse, Sam Abbott, Anne Cori, Edwin van Leeuwen, Johannes Bracher, and Sebastian Funk. Scoring epidemiological forecasts on transformed scales. PLOS Computational Biology, 19(8):e1011393, August 2023. ISSN 1553-7358. https://doi.org/10.1371/journal.pcbi.1011393.
- Brehmer (2017) Jonas Brehmer. Elicitability and its application in risk management. July 2017. https://doi.org/10.48550/ARXIV.1707.09604.
- Brehmer and Strokorb (2019) Jonas R. Brehmer and Kirstin Strokorb. Why scoring functions cannot assess tail properties. Electronic Journal of Statistics, 13(2), January 2019. ISSN 1935-7524. https://doi.org/10.1214/19-ejs1622.
- Bremnes (2019) John Bjørnar Bremnes. Ensemble postprocessing using quantile function regression based on neural networks and bernstein polynomials. Monthly Weather Review, 148(1):403–414, December 2019. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-19-0227.1.
- Brier (1950) Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3, 1950. ISSN 1520-0493. https://doi.org/10.1175/1520-0493(1950)078<0001:vofeit>2.0.co;2.
- Bröcker (2009) Jochen Bröcker. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135(643):1512–1519, July 2009. ISSN 1477-870X. https://doi.org/10.1002/qj.456.
- Bröcker and Ben Bouallègue (2020) Jochen Bröcker and Zied Ben Bouallègue. Stratified rank histograms for ensemble forecast verification under serial dependence. Quarterly Journal of the Royal Meteorological Society, 146(729):1976–1990, April 2020. ISSN 1477-870X. https://doi.org/10.1002/qj.3778.
- Bröcker and Smith (2007) Jochen Bröcker and Leonard A. Smith. Scoring probabilistic forecasts: The importance of being proper. Weather and Forecasting, 22(2):382–388, April 2007. ISSN 0882-8156. https://doi.org/10.1175/waf966.1.
- Buschow (2022) Sebastian Buschow. Measuring displacement errors with complex wavelets. Weather and Forecasting, 37(6):953–970, June 2022. ISSN 1520-0434. https://doi.org/10.1175/waf-d-21-0180.1.
- Buschow and Friederichs (2020) Sebastian Buschow and Petra Friederichs. Using wavelets to verify the scale structure of precipitation forecasts. Advances in Statistical Climatology, Meteorology and Oceanography, 6(1):13–30, March 2020. ISSN 2364-3587. https://doi.org/10.5194/ascmo-6-13-2020.
- Buschow and Friederichs (2021) Sebastian Buschow and Petra Friederichs. Sad: Verifying the scale, anisotropy and direction of precipitation forecasts. Quarterly Journal of the Royal Meteorological Society, 147(735):1150–1169, January 2021. ISSN 1477-870X. https://doi.org/10.1002/qj.3964.
- Buschow et al. (2019) Sebastian Buschow, Jakiw Pidstrigach, and Petra Friederichs. Assessment of wavelet-based spatial verification by means of a stochastic precipitation model (wv_verif v0.1.0). Geoscientific Model Development, 12(8):3401–3418, August 2019. ISSN 1991-9603. https://doi.org/10.5194/gmd-12-3401-2019.
- Casati et al. (2022) Barbara Casati, Manfred Dorninger, Caio A. S. Coelho, Elizabeth E. Ebert, Chiara Marsigli, Marion P. Mittermaier, and Eric Gilleland. The 2020 international verification methods workshop online: Major outcomes and way forward. Bulletin of the American Meteorological Society, 103(3):E899–E910, March 2022. ISSN 1520-0477. https://doi.org/10.1175/bams-d-21-0126.1.
- Chapman et al. (2022) William E. Chapman, Luca Delle Monache, Stefano Alessandrini, Aneesh C. Subramanian, F. Martin Ralph, Shang-** Xie, Sebastian Lerch, and Negin Hayatbini. Probabilistic predictions from deterministic atmospheric river forecasts with deep learning. Monthly Weather Review, 150(1):215–234, January 2022. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-21-0106.1.
- Chen et al. (2023) Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, **g-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, Yuanzheng Ci, Bin Li, Xiaokang Yang, and Wanli Ouyang. Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead. April 2023. https://doi.org/10.48550/ARXIV.2304.02948.
- Christensen et al. (2014) H. M. Christensen, I. M. Moroz, and T. N. Palmer. Evaluation of ensemble forecast uncertainty using a new proper score: Application to medium-range and seasonal forecasts. Quarterly Journal of the Royal Meteorological Society, 141(687):538–549, May 2014. ISSN 1477-870X. https://doi.org/10.1002/qj.2375.
- Dawid (1984) A. P. Dawid. Present position and potential developments: Some personal views: Statistical theory: The prequential approach. Journal of the Royal Statistical Society. Series A (General), 147(2):278, 1984. ISSN 0035-9238. https://doi.org/10.2307/2981683.
- Dawid and Sebastiani (1999) A. Philip Dawid and Paola Sebastiani. Coherent dispersion criteria for optimal experimental design. The Annals of Statistics, 27(1), March 1999. ISSN 0090-5364. https://doi.org/10.1214/aos/1018031101.
- Dawid et al. (2015) A. Philip Dawid, Monica Musio, and Laura Ventura. Minimum scoring rule inference. Scandinavian Journal of Statistics, 43(1):123–138, August 2015. ISSN 1467-9469. https://doi.org/10.1111/sjos.12168.
- Dawid and Musio (2014) Alexander Philip Dawid and Monica Musio. Theory and applications of proper scoring rules. METRON, 72(2):169–183, April 2014. ISSN 2281-695X. https://doi.org/10.1007/s40300-014-0039-y.
- Delle Monache et al. (2013) Luca Delle Monache, F. Anthony Eckel, Daran L. Rife, Badrinath Nagarajan, and Keith Searight. Probabilistic weather prediction with an analog ensemble. Monthly Weather Review, 141(10):3498–3516, September 2013. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-12-00281.1.
- Diebold and Mariano (1995) Francis X. Diebold and Roberto S. Mariano. Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3):253–263, July 1995. ISSN 1537-2707. https://doi.org/10.1080/07350015.1995.10524599.
- Dorninger et al. (2018) Manfred Dorninger, Eric Gilleland, Barbara Casati, Marion P. Mittermaier, Elizabeth E. Ebert, Barbara G. Brown, and Laurence J. Wilson. The setup of the mesovict project. Bulletin of the American Meteorological Society, 99(9):1887–1906, September 2018. ISSN 1520-0477. https://doi.org/10.1175/bams-d-17-0164.1.
- Ebert (2008) Elizabeth E. Ebert. Fuzzy verification of high-resolution gridded forecasts: a review and proposed framework. Meteorological Applications, 15(1):51–64, 2008. https://doi.org/10.1002/met.25.
- Ehm and Gneiting (2012) Werner Ehm and Tilmann Gneiting. Local proper scoring rules of order two. The Annals of Statistics, 40(1), February 2012. ISSN 0090-5364. https://doi.org/10.1214/12-aos973.
- Ferro et al. (2008) Christopher A. T. Ferro, David S. Richardson, and Andreas P. Weigel. On the effect of ensemble size on the discrete and continuous ranked probability scores. Meteorological Applications, 15(1):19–24, March 2008. ISSN 1469-8080. https://doi.org/10.1002/met.45.
- Friederichs and Hense (2008) Petra Friederichs and Andreas Hense. A probabilistic forecast approach for daily precipitation totals. Weather and Forecasting, 23(4):659–673, August 2008. ISSN 0882-8156. https://doi.org/10.1175/2007waf2007051.1.
- Gilleland (2011) Eric Gilleland. Spatial forecast verification: Baddeley’s delta metric applied to the icp test cases. Weather and Forecasting, 26(3):409–415, June 2011. ISSN 1520-0434. https://doi.org/10.1175/waf-d-10-05061.1.
- Gilleland et al. (2009) Eric Gilleland, David Ahijevych, Barbara G. Brown, Barbara Casati, and Elizabeth E. Ebert. Intercomparison of spatial forecast verification methods. Weather and Forecasting, 24(5):1416–1430, October 2009. ISSN 0882-8156. https://doi.org/10.1175/2009waf2222269.1.
- Gneiting (2011) Tilmann Gneiting. Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494):746–762, June 2011. ISSN 1537-274X. https://doi.org/10.1198/jasa.2011.r10138.
- Gneiting and Katzfuss (2014) Tilmann Gneiting and Matthias Katzfuss. Probabilistic forecasting. Annual Review of Statistics and Its Application, 1(1):125–151, January 2014. ISSN 2326-831X. https://doi.org/10.1146/annurev-statistics-062713-085831.
- Gneiting and Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, March 2007. ISSN 1537-274X. https://doi.org/10.1198/016214506000001437.
- Gneiting et al. (2005) Tilmann Gneiting, Adrian E. Raftery, Anton H. Westveld, and Tom Goldman. Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review, 133(5):1098–1118, May 2005. ISSN 0027-0644. https://doi.org/10.1175/mwr2904.1.
- Gneiting et al. (2007) Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E. Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(2):243–268, March 2007. ISSN 1467-9868. https://doi.org/10.1111/j.1467-9868.2007.00587.x.
- Gneiting et al. (2008) Tilmann Gneiting, Larissa I. Stanberry, Eric P. Grimit, Leonhard Held, and Nicholas A. Johnson. Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds. TEST, 17(2):211–235, July 2008. ISSN 1863-8260. https://doi.org/10.1007/s11749-008-0114-x.
- Gneiting et al. (2023) Tilmann Gneiting, Sebastian Lerch, and Benedikt Schulz. Probabilistic solar forecasting: Benchmarks, post-processing, verification. Solar Energy, 252:72–80, March 2023. ISSN 0038-092X. https://doi.org/10.1016/j.solener.2022.12.054.
- Good (1952) I. J. Good. Rational decisions. Journal of the Royal Statistical Society: Series B (Methodological), 14(1):107–114, January 1952. ISSN 2517-6161. https://doi.org/10.1111/j.2517-6161.1952.tb00104.x.
- Han and Szunyogh (2018) Fan Han and Istvan Szunyogh. A technique for the verification of precipitation forecasts and its application to a problem of predictability. Monthly Weather Review, 146(5):1303–1318, April 2018. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-17-0040.1.
- Heinrich-Mertsching et al. (2021) Claudio Heinrich-Mertsching, Thordis L. Thorarinsdottir, Peter Guttorp, and Max Schneider. Validation of point process predictions with proper scoring rules. October 2021.
- Hersbach (2000) Hans Hersbach. Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather and Forecasting, 15(5):559–570, October 2000. ISSN 1520-0434. https://doi.org/10.1175/1520-0434(2000)015<0559:dotcrp>2.0.co;2.
- Holzmann and Eulert (2014) Hajo Holzmann and Matthias Eulert. The role of the information set for forecasting—with applications to risk management. The Annals of Applied Statistics, 8(1), March 2014. ISSN 1932-6157. https://doi.org/10.1214/13-aoas709.
- Hu et al. (2023) Weiming Hu, Mohammadvaghef Ghazvinian, William E. Chapman, Agniv Sengupta, Fred Martin Ralph, and Luca Delle Monache. Deep learning forecast uncertainty for precipitation over the western united states. Monthly Weather Review, 151(6):1367–1385, June 2023. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-22-0268.1.
- Hyvärinen (2005) Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005. URL http://jmlr.org/papers/v6/hyvarinen05a.html.
- Jolliffe and Primo (2008) Ian T. Jolliffe and Cristina Primo. Evaluating rank histograms using decompositions of the chi-square test statistic. Monthly Weather Review, 136(6):2133–2139, June 2008. ISSN 0027-0644. https://doi.org/10.1175/2007mwr2219.1.
- Jordan et al. (2019) Alexander Jordan, Fabian Krüger, and Sebastian Lerch. Evaluating probabilistic forecasts with scoringrules. Journal of Statistical Software, 90(12), 2019. ISSN 1548-7660. https://doi.org/10.18637/jss.v090.i12.
- Jordan et al. (2011) Thomas H. Jordan, Yun-Tai Chen, Paolo Gasparini, Raul Madariaga, Ian Main, Warner Marzocchi, Gerassimos Papadopoulos, Gennady Sobolev, Koshun Yamaoka, and Jochen Zschau. Operational earthquake forecasting. state of knowledge and guidelines for utilization. Annals of Geophysics, 54(4), August 2011. ISSN 2037-416X. https://doi.org/10.4401/ag-5350.
- Jose (2007) Victor Richmond Jose. A characterization for the spherical scoring rule. Theory and Decision, 66(3):263–281, July 2007. ISSN 1573-7187. https://doi.org/10.1007/s11238-007-9067-x.
- Keisler (2022) Ryan Keisler. Forecasting global weather with graph neural networks. February 2022. https://doi.org/10.48550/ARXIV.2202.07575.
- Kullback and Leibler (1951) S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, March 1951. ISSN 0003-4851. https://doi.org/10.1214/aoms/1177729694.
- Lam et al. (2022) Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. Graphcast: Learning skillful medium-range global weather forecasting. December 2022. https://doi.org/10.48550/ARXIV.2212.12794.
- Lerch and Polsterer (2022) Sebastian Lerch and Kai L. Polsterer. Convolutional autoencoders for spatially-informed ensemble post-processing. In ICLR 2022 - AI for Earth and Space Science Workshop, 2022.
- Lerch and Thorarinsdottir (2013) Sebastian Lerch and Thordis L. Thorarinsdottir. Comparison of non-homogeneous regression models for probabilistic wind speed forecasting. Tellus A: Dynamic Meteorology and Oceanography, 65(1):21206, December 2013. ISSN 1600-0870. https://doi.org/10.3402/tellusa.v65i0.21206.
- Lerch et al. (2017) Sebastian Lerch, Thordis L. Thorarinsdottir, Francesco Ravazzolo, and Tilmann Gneiting. Forecaster’s dilemma: Extreme events and forecast evaluation. Statistical Science, 32(1), February 2017. ISSN 0883-4237. https://doi.org/10.1214/16-sts588.
- Matheron (1963) Georges Matheron. Principles of geostatistics. Economic Geology, 58(8):1246–1266, December 1963. ISSN 0361-0128. https://doi.org/10.2113/gsecongeo.58.8.1246.
- Matheson and Winkler (1976) James E. Matheson and Robert L. Winkler. Scoring rules for continuous probability distributions. Management Science, 22, 1976. https://doi.org/10.2307/2629907.
- Meng et al. (2023) Xiaochun Meng, James W. Taylor, Souhaib Ben Taieb, and Siran Li. Scores for multivariate distributions and level sets. Operations Research, July 2023. ISSN 1526-5463. https://doi.org/10.1287/opre.2020.0365.
- Murphy and Winkler (1987) Allan H. Murphy and Robert L. Winkler. A general framework for forecast verification. Monthly Weather Review, 115(7):1330–1338, July 1987. ISSN 1520-0493. https://doi.org/10.1175/1520-0493(1987)115<1330:agfffv>2.0.co;2.
- Nowotarski and Weron (2018) Jakub Nowotarski and Rafał Weron. Recent advances in electricity price forecasting: A review of probabilistic forecasting. Renewable and Sustainable Energy Reviews, 81:1548–1568, January 2018. ISSN 1364-0321. https://doi.org/10.1016/j.rser.2017.05.234.
- Pacchiardi et al. (2024) Lorenzo Pacchiardi, Rilwan Adewoyin, Peter Dueben, and Ritabrata Dutta. Probabilistic forecasting with generative networks via scoring rule minimization. Journal of Machine Learning Research, 25(45):1–64, 2024. URL https://jmlr.org/papers/v25/23-0038.html.
- Palmer (2012) T. N. Palmer. Towards the probabilistic earth-system simulator: a vision for the future of climate and weather prediction. Quarterly Journal of the Royal Meteorological Society, 138(665):841–861, April 2012. ISSN 1477-870X. https://doi.org/10.1002/qj.1923.
- Parry et al. (2012) Matthew Parry, A. Philip Dawid, and Steffen Lauritzen. Proper local scoring rules. The Annals of Statistics, 40(1), February 2012. ISSN 0090-5364. https://doi.org/10.1214/12-aos971.
- Pathak et al. (2022) Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, and Animashree Anandkumar. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. February 2022.
- Pinson and Girard (2012) P. Pinson and R. Girard. Evaluating the quality of scenarios of short-term wind power generation. Applied Energy, 96:12–20, aug 2012. https://doi.org/10.1016/j.apenergy.2011.11.004.
- Pinson (2013) Pierre Pinson. Wind energy: Forecasting challenges for its operational management. Statistical Science, 28(4), November 2013. ISSN 0883-4237. https://doi.org/10.1214/13-sts445.
- Pinson and Tastu (2013) Pierre Pinson and Julija Tastu. Discrimination ability of the energy score. DTU Compute - Technical Report, 2013.
- Radanovics et al. (2018) Sabine Radanovics, Jean-Philippe Vidal, and Eric Sauquet. Spatial verification of ensemble precipitation: An ensemble version of sal. Weather and Forecasting, 33(4):1001–1020, July 2018. ISSN 1520-0434. https://doi.org/10.1175/waf-d-17-0162.1.
- Rasp and Lerch (2018) Stephan Rasp and Sebastian Lerch. Neural networks for postprocessing ensemble weather forecasts. Monthly Weather Review, 146(11):3885–3900, October 2018. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-18-0187.1.
- Rasp et al. (2024) Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russel, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallègue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell, and Fei Sha. Weatherbench 2: A benchmark for the next generation of data-driven global weather models. 2024. https://doi.org/10.48550/ARXIV.2308.15560.
- Rivoire et al. (2023) Pauline Rivoire, Olivia Martius, Philippe Naveau, and Alexandre Tuel. Assessment of subseasonal-to-seasonal (s2s) ensemble extreme precipitation forecast skill over europe. Natural Hazards and Earth System Sciences, 23(8):2857–2871, August 2023. ISSN 1684-9981. https://doi.org/10.5194/nhess-23-2857-2023.
- Roberts and Lean (2008) Nigel M. Roberts and Humphrey W. Lean. Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Monthly Weather Review, 136(1):78–97, January 2008. ISSN 0027-0644. https://doi.org/10.1175/2007mwr2123.1.
- Roulston and Smith (2002) Mark S. Roulston and Leonard A. Smith. Evaluating probabilistic forecasts using information theory. Monthly Weather Review, 130(6):1653–1660, June 2002. ISSN 1520-0493. https://doi.org/10.1175/1520-0493(2002)130<1653:epfuit>2.0.co;2.
- Scheuerer and Hamill (2015) Michael Scheuerer and Thomas M. Hamill. Variogram-based proper scoring rules for probabilistic forecasts of multivariate quantities. Monthly Weather Review, 143(4):1321–1334, 2015. https://doi.org/10.1175/mwr-d-14-00269.1.
- Schorlemmer et al. (2018) Danijel Schorlemmer, Maximilian J. Werner, Warner Marzocchi, Thomas H. Jordan, Yosihiko Ogata, David D. Jackson, Sum Mak, David A. Rhoades, Matthew C. Gerstenberger, Naoshi Hirata, Maria Liukis, Philip J. Maechling, Anne Strader, Matteo Taroni, Stefan Wiemer, Jeremy D. Zechar, and Jiancang Zhuang. The collaboratory for the study of earthquake predictability: Achievements and priorities. Seismological Research Letters, 89(4):1305–1313, June 2018. ISSN 1938-2057. https://doi.org/10.1785/0220180053.
- Schulz and Lerch (2022) Benedikt Schulz and Sebastian Lerch. Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison. Monthly Weather Review, 150(1):235–257, January 2022. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-21-0150.1.
- Shannon (1948) C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(4):623–656, October 1948. ISSN 0005-8580. https://doi.org/10.1002/j.1538-7305.1948.tb00917.x.
- Smola et al. (2007) Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A hilbert space embedding for distributions. In Marcus Hutter, Rocco A. Servedio, and Eiji Takimoto, editors, Algorithmic Learning Theory, pages 13–31, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-540-75225-7.
- Stein and Stoop (2022) Joël Stein and Fabien Stoop. Neighborhood-based ensemble evaluation using the crps. Monthly Weather Review, 150(8):1901–1914, August 2022. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-21-0224.1.
- Steinwart and Christmann (2008) Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer, New York, 2008. ISBN 978-0-387-77241-7.
- Steinwart and Ziegel (2021) Ingo Steinwart and Johanna F. Ziegel. Strictly proper kernel scores and characteristic kernels on compact spaces. Applied and Computational Harmonic Analysis, 51:510–542, 2021. ISSN 1063-5203. https://doi.org/10.1016/j.acha.2019.11.005. URL https://www.sciencedirect.com/science/article/pii/S1063520317301483.
- Székely (2003) Gábor Székely. E-statistics: The energy of statistical samples. techreport, Bowling Green State University, 2003.
- Taillardat (2021) Maxime Taillardat. Skewed and mixture of gaussian distributions for ensemble postprocessing. Atmosphere, 12(8):966, July 2021. ISSN 2073-4433. https://doi.org/10.3390/atmos12080966.
- Taillardat and Mestre (2020) Maxime Taillardat and Olivier Mestre. From research to applications – examples of operational ensemble post-processing in france using machine learning. Nonlinear Processes in Geophysics, 27(2):329–347, May 2020. ISSN 1607-7946. https://doi.org/10.5194/npg-27-329-2020.
- Taillardat et al. (2016) Maxime Taillardat, Olivier Mestre, Michaël Zamo, and Philippe Naveau. Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Monthly Weather Review, 144(6):2375–2393, June 2016. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-15-0260.1.
- Talagrand et al. (1997) O. Talagrand, R. Vautard, and B Strauss. Evaluation of probabilistic prediction systems. In Workshop on Predictability, 20-22 October 1997, pages 1–26, Shinfield Park, Reading, 1997. ECMWF.
- Thorarinsdottir and Schuhen (2018) Thordis L. Thorarinsdottir and Nina Schuhen. Verification: Assessment of Calibration and Accuracy, pages 155–186. Elsevier, 2018. https://doi.org/10.1016/b978-0-12-812372-0.00006-6.
- Thorarinsdottir et al. (2013) Thordis L. Thorarinsdottir, Tilmann Gneiting, and Nadine Gissibl. Using proper divergence functions to evaluate climate models. SIAM/ASA Journal on Uncertainty Quantification, 1(1):522–534, January 2013. ISSN 2166-2525. https://doi.org/10.1137/130907550.
- Tsyplakov (2011) Alexander Tsyplakov. Evaluating density forecasts: A comment. SSRN Electronic Journal, 2011. ISSN 1556-5068. https://doi.org/10.2139/ssrn.1907799.
- Tsyplakov (2013) Alexander Tsyplakov. Evaluation of probabilistic forecasts: Proper scoring rules and moments. SSRN Electronic Journal, 2013. ISSN 1556-5068. https://doi.org/10.2139/ssrn.2236605.
- Tsyplakov (2020) Alexander Tsyplakov. Evaluation of probabilistic forecasts: Conditional auto-calibration, 2020. URL https://www.sas.upenn.edu/~fdiebold/papers2/Tsyplakov_Auto_calibration_sent_eswc2020.pdf.
- Vannitsem et al. (2021) Stéphane Vannitsem, John Bjørnar Bremnes, Jonathan Demaeyer, Gavin R. Evans, Jonathan Flowerdew, Stephan Hemri, Sebastian Lerch, Nigel Roberts, Susanne Theis, Aitor Atencia, Zied Ben Bouallègue, Jonas Bhend, Markus Dabernig, Lesley De Cruz, Leila Hieta, Olivier Mestre, Lionel Moret, Iris Odak Plenković, Maurice Schmeits, Maxime Taillardat, Joris Van den Bergh, Bert Van Schaeybroeck, Kirien Whan, and Jussi Ylhaisi. Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bulletin of the American Meteorological Society, 102(3):E681–E699, March 2021. ISSN 1520-0477. https://doi.org/10.1175/bams-d-19-0308.1.
- Wernli et al. (2008) Heini Wernli, Marcus Paulat, Martin Hagen, and Christoph Frei. Sal—a novel quality measure for the verification of quantitative precipitation forecasts. Monthly Weather Review, 136(11):4470–4487, November 2008. ISSN 0027-0644. https://doi.org/10.1175/2008mwr2415.1.
- Winkelbauer (2014) Andreas Winkelbauer. Moments and absolute moments of the normal distribution. September 2014. https://doi.org/10.48550/ARXIV.1209.4340.
- Winkler et al. (1996) R. L. Winkler, Javier Muñoz, José L. Cervera, José M. Bernardo, Gail Blattenberger, Joseph B. Kadane, Dennis V. Lindley, Allan H. Murphy, Robert M Oliver, and David Ríos-Insua. Scoring rules and the evaluation of probabilities. Test, 5(1):1–60, June 1996. ISSN 1863-8260. https://doi.org/10.1007/bf02562681.
- Winkler (1977) Robert L. Winkler. Rewarding Expertise in Probability Assessment, pages 127–140. Springer Netherlands, 1977. ISBN 9789401012768. https://doi.org/10.1007/978-94-010-1276-8_10.
- Zamo and Naveau (2017) Michaël Zamo and Philippe Naveau. Estimation of the continuous ranked probability score with limited information and applications to ensemble weather forecasts. Mathematical Geosciences, 50(2):209–234, November 2017. ISSN 1874-8953. https://doi.org/10.1007/s11004-017-9709-7.
- Ziel and Berk (2019) Florian Ziel and Kevin Berk. Multivariate forecasting evaluation: On sensitive and strictly proper scoring rules. 2019.
Appendix A Expected univariate scoring rules
A.1 Squared Error
For any , the expectation of the squared error (2) is :
where is the mean of the distribution and and are the mean and the variance of the distribution .
Proof.
Using the fact that ,
∎
A.2 Quantile Score
For any , the expectation of the quantile score of level (4) is :
Proof.
Inspired by the proof of the propriety of the quantile score in Friederichs and Hense (2008).
Then, using ,
∎
A.3 Absolute Error
First of all, for and , the absolute error (3) is equal to twice the quantile score of level :
where is the median of the distribution .
It can be deduced that, for any , the expectation of the absolute error is :
A.4 Brier score
For any , the expectation of the Brier score (5) is :
Proof.
∎
A.5 Continuous Ranked Probability Score
For any , the expectation of the CRPS (7) is :
where the second term of the last line is the entropy of the CRPS.
Proof.
∎
A.6 Dawid-Sebastiani score
For any , the expectation of the Dawid-Sebastiani score (9) is :
Proof.
Noticing that ,
∎
A.7 Error-spread score
A.8 Logarithmic score
For any such that and have probability density functions in the class , the expectation of the logarithmic score (11) is :
where is the Kullback-Leibler divergence from to and is the Shannon entropy of . The proof is straightforward given that the Kullback-Leibler divergence and Shannon entropy are defined as
A.9 Hyvärinen score
For such that their densities exist, are twice continuously differentiable and satisfy as and as , the expectation of the Hyvärinen score is :
where the last formula shows the entropy of the Hyvärinen score (second term on the right-hand side).
Proof.
Integrating by part the integral of the first term on the right-hand side leads to :
The boundary term is null since as and is a probability density function.
Thus,
∎
A.10 Quadratic score
For any , the expectation of the quadratic score is :
where .
A.11 Pseudospherical score
For any , the expectation of the quadratic score is :
where .
Appendix B Expected multivariate scoring rules
B.1 Squared error
For any , the expectation of the squared error (12) is :
where is the mean vector of the distribution and and are the mean vector and the covariance matrix of the distribution .
Proof.
Let denote the projection on the -th margin.
∎
B.2 Dawid-Sebastiani score
For any , the expectation of the Dawid-Sebastiani score is :
The proof is available in the original article (Dawid and Sebastiani, 1999).
B.3 Energy score
In a general setting, the expected energy score does not simplify. For any , the expected energy score (13) is :
B.4 Variogram score
For any such that the -th moments of all their univariate margins are finite, the expected variogram score of order (14) is :
Proof.
∎
B.5 Logarithmic score
For any such that and have probability density functions that belong to , the expectation of the logarithmic score is analogous to its univariate version :
where is the Kullback-Leibler divergence from to and is the Shannon entropy of .
B.6 Hyvärinen score
For such that their probability density functions and such that they are twice continuously differentiable and satisfying and as , the expectation of the Hyvärinen score is :
where is the gradient operator and is the scalar product. The proof is similar to the proof for the univariate case using integration by parts and Stoke’s theorem (Parry et al., 2012).
B.7 Quadratic score
For any , the expectation of the quadratic score is analogous to its univariate version :
where .
B.8 Pseudospherical score
For any , the expectation of the quadratic score is analogous to its univariate version :
where .
Appendix C Proofs
C.1 Proposition 1
Proof of Proposition 1.
Let be a class of Borel probability measure on and let be a forecast and an observation. Let be a transformation and let be a scoring rule on that is proper relative to .
Given that and is proper relative to ,
(23) |
∎
Proof of the strict propriety case in Proposition 1.
The notations are the same as the proof above except the following. Let be an injective transformation and let be a scoring rule on that is strictly proper relative to .
The equality in Equation (23) leads to :
The fact that is strictly proper relative to leads to , and finally since is injective, we have . ∎
C.2 Proposition 3
Proof of Proposition 3.
The proof relies on the reproducing kernel Hilbert space (RKHS) representation of the kernel scoring rule . For a background on kernel scoring rule, maximum mean discrepancies and RKHS, we refer to Smola et al. (2007) or Steinwart and Christmann (2008, Section 4).
Let denote the RKHS associated with . We recall that contains all the functions and that the inner product on satisfies the property
The kernel mean embedding is a linear application map** an admissible distribution into a function in the RKHS and such that the image of the point measure is . Equation (16) giving the kernel scoring rule for an ensemble prediction can be written as
The properties of the kernel mean embedding ensure that this relation still holds for all . As a consequence, if is an Hilbertian basis of , we have
Finally, the properties of the kernel mean embedding ensure that, for all ,
whence the result follows. ∎
C.3 Proof of examples illustrating Proposition 3
Next, we illustrate the Proposition 3 and provide some computations in two cases: the Gaussian kernel scoring rule and the continuous rank probability score (CRPS).
Gaussian Kernel Scoring Rule. This is the scoring rule related to the Gaussian kernel
Using a series expansion of the exponential function, we have
with the transformation defined, for , by
As a consequence, the Gaussian kernel scoring rule writes, for all and ,
Continuous Ranked Probability Score. The CRPS is the scoring rule with kernel . This kernel is the covariance of the Brownian motion on and its RKHS is known to be the Sobolev space , see Berlinet and Thomas-Agnan (2004). We recall the definition of the Sobolev space
where denotes the derivative of assumed to be defined almost everywhere and square-integrable. The inner product on is defined by
and one can easily check the fundamental relation
Here the derivative is taken with respect to the second variable . Then, we consider the Haar system defined as the collection of functions
with and . Since the Haar system is an orthonormal basis of the space and the map is an isomorphism between Hilbert spaces, we obtain an orthonormal basis of by considering the primitives vanishing at of the Haar basis functions. Setting and the primitive functions of and respectively, we obtain the system
The series representation of the CRPS is then deduced from Proposition 3 and its proof since the collection , is an orthonormal basis of the RKHS associated with the kernel of the CRPS.
Appendix D General form of Corollary 1
Corollary 2.
Let be a set of transformations from to . Let be a set of proper scoring rules such that is proper relative to , for all . Let be nonnegative weights. Then the scoring rule
is proper relative to .
Appendix E Scoring rules of the simulation study
The following formulas are deduced for a probabilistic forecast taking the form of the Gaussian random field model of Equation (20). The formulas of the aggregated univariate scoring rules can be obtained from the formulas in Gneiting and Raftery (2007) and Jordan et al. (2019) and, thus, are not presented here. We focus on the expression of the variogram score and the CRPS of spatial mean.
Variogram Score
For , the absolute moment is (Winkelbauer, 2014) :
(24) |
where is the confluent hypergeometric function of the first kind. For ,
This leads to
Finally,
p-Variation Score
Denote . For , we have with
and
Using (24), this leads to
Finally,
CRPS of spatial mean
The CRPS of spatial mean is defined as
where is an ensemble of spatial patches and is the weight associated with a patch . The mean of Gaussian marginals follows a Gaussian distribution :
where is the cardinal of the patch (i.e., the number of grid points belonging to ).
Finally,