Proper Scoring Rules for Multivariate Probabilistic Forecasts based on Aggregation and Transformation

Romain Pic Université de Franche Comté, CNRS, LmB (UMR 6623), F-25000 Besançon, France Clément Dombry Université de Franche Comté, CNRS, LmB (UMR 6623), F-25000 Besançon, France Philippe Naveau Laboratoire des Sciences du Climat et de l’Environnement, UMR 8212, CEA-CNRS-UVSQ, EstimR, IPSL & U Paris-Saclay, Gif-sur-Yvette, France Maxime Taillardat CNRM, Université de Toulouse, Météo-France, CNRS, Toulouse, France
Abstract

Proper scoring rules are an essential tool to assess the predictive performance of probabilistic forecasts. However, propriety alone does not ensure an informative characterization of predictive performance and it is recommended to compare forecasts using multiple scoring rules. With that in mind, interpretable scoring rules providing complementary information are necessary. We formalize a framework based on aggregation and transformation to build interpretable multivariate proper scoring rules. Aggregation-and-transformation-based scoring rules are able to target specific features of the probabilistic forecasts; which improves the characterization of the predictive performance. This framework is illustrated through examples taken from the literature and studied using numerical experiments showcasing its benefits. In particular, it is shown that it can help bridge the gap between proper scoring rules and spatial verification tools.

1 Introduction

Probabilistic forecasting allows to issue forecasts carrying information about the prediction uncertainty. It has become an essential tool in numerous applied fields such as weather and climate prediction (Vannitsem et al., 2021; Palmer, 2012), earthquake forecasting (Jordan et al., 2011; Schorlemmer et al., 2018), electricity price forecasting (Nowotarski and Weron, 2018) or renewable energies (Pinson, 2013; Gneiting et al., 2023) among others. Moreover, it is slowly reaching fields further from "usual" forecasting, such as epidemiology predictions (Bosse et al., 2023) or breast cancer recurrence prediction (Al Masry et al., 2023). In weather forecasting, probabilistic forecasts often take the form of ensemble forecasts in which the dispersion among members captures forecast uncertainty.

The development of probabilistic forecasts has induced the need for appropriate verification methods. Forecast verification fulfills two main purposes: quantifying how good a forecast is given observations available and allowing one to rank different forecasts according to their predictive performance. Scoring rules provide a single value to compare forecasts with observations. Propriety is a property of scoring rules that encourages forecasters to follow their true beliefs and that prevents hedging. Proper scoring rules allow to assess calibration and sharpness simultaneously (Winkler, 1977; Winkler et al., 1996). Calibration is the statistical compatibility between forecasts and observations. Sharpness is the uncertainty of the forecast itself. Propriety is a necessary property of good scoring rules, but it does not guarantee that a scoring rule provides an informative characterization of predictive performance. In univariate and multivariate settings, numerous studies have proven that no scoring rule has it all, and thus, different scoring rules should be used to get a better understanding of the predictive performance of forecasts (see, e.g., Scheuerer and Hamill 2015; Taillardat 2021; Bjerregård et al. 2021). With that in mind, Scheuerer and Hamill (2015) "strongly recommend that several different scores be always considered before drawing conclusions." This amplifies the need for numerous complementary proper scoring rules that are well-understood to facilitate forecast verification. In that direction, Dorninger et al. (2018) states that: "gaining an in-depth understanding of forecast performance depends on gras** the full meaning of the verification results." Interpretability of proper scoring rules can arise from being induced by a consistent scoring function for a functional (e.g., the squared error is induced by a scoring function consistent for the mean; Gneiting 2011), knowing what aspects of the forecast the scoring rule discriminates (e.g., the Dawid-Sebastiani score only discriminates forecasts through their mean and variance; Dawid and Sebastiani 1999) or knowing the limitations of a certain proper scoring rule (e.g., the variogram score is incapable of discriminating two forecasts that only differ by a constant bias; Scheuerer and Hamill 2015). In this context, interpretable proper scoring rules become verification methods of choice as the ranking of forecasts they produce can be more informative than the ranking of a more complex but less interpretable scoring rule. Section 2 provides an in-depth explanation of this in the case of univariate scoring rules. It is worth noting that interpretability of a scoring rule can also arise from its decomposition into meaningful terms (see, e.g., Bröcker 2009). This type of interpretability can be used complementarily to the framework proposed in this article.

Scheuerer and Hamill (2015) proposed the variogram score to target the verification of the dependence structure. The variogram score of order p𝑝pitalic_p (p>0𝑝0p>0italic_p > 0) is defined as

VSp(F,𝒚)=i,j=1dwij(𝔼F[|XiXj|p]|yiyj|p)2,subscriptVS𝑝𝐹𝒚superscriptsubscript𝑖𝑗1𝑑subscript𝑤𝑖𝑗superscriptsubscript𝔼𝐹delimited-[]superscriptsubscript𝑋𝑖subscript𝑋𝑗𝑝superscriptsubscript𝑦𝑖subscript𝑦𝑗𝑝2\mathrm{VS}_{p}(F,\bm{y})=\sum_{i,j=1}^{d}w_{ij}\left(\mathbb{E}_{F}\left[|X_{% i}-X_{j}|^{p}\right]-|y_{i}-y_{j}|^{p}\right)^{2},roman_VS start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] - | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th component of the random vector 𝑿d𝑿superscript𝑑\bm{X}\in\mathbb{R}^{d}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT following F𝐹Fitalic_F, the wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are nonnegative weights and 𝒚d𝒚superscript𝑑\bm{y}\in\mathbb{R}^{d}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is an observation. The construction of the variogram score relies on two main principles. First, the variogram score is the weighted sum of scoring rules acting on the distribution of 𝑿i,j=(Xi,Xj)subscript𝑿𝑖𝑗subscript𝑋𝑖subscript𝑋𝑗\bm{X}_{i,j}=(X_{i},X_{j})bold_italic_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and on paired components of the observations yi,jsubscript𝑦𝑖𝑗y_{i,j}italic_y start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. This aggregation principle allows the combination of proper scoring rules and summarizes them into a proper scoring rule acting on the whole distribution F𝐹Fitalic_F and observations 𝒚𝒚\bm{y}bold_italic_y. Second, the scoring rules composing the weighted sum can be seen as a standard proper scoring rule applied to transformations of both forecasts and observations. Let us denote γi,j:𝒙|xixj|p:subscript𝛾𝑖𝑗maps-to𝒙superscriptsubscript𝑥𝑖subscript𝑥𝑗𝑝\gamma_{i,j}:\bm{x}\mapsto|x_{i}-x_{j}|^{p}italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT : bold_italic_x ↦ | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT the transformation related to the variogram of order p𝑝pitalic_p, then the variogram score can be rewritten as

VSp(F,𝒚)=i,j=1dwijSE(γi,j(F),γi,j(𝒚)),subscriptVS𝑝𝐹𝒚superscriptsubscript𝑖𝑗1𝑑subscript𝑤𝑖𝑗SEsubscript𝛾𝑖𝑗𝐹subscript𝛾𝑖𝑗𝒚\mathrm{VS}_{p}(F,\bm{y})=\sum_{i,j=1}^{d}w_{ij}\mathrm{SE}(\gamma_{i,j}(F),% \gamma_{i,j}(\bm{y})),roman_VS start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_SE ( italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_F ) , italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( bold_italic_y ) ) ,

where SE(F,y)=(𝔼F[X]y)2SE𝐹𝑦superscriptsubscript𝔼𝐹delimited-[]𝑋𝑦2\mathrm{SE}(F,y)=(\mathbb{E}_{F}[X]-y)^{2}roman_SE ( italic_F , italic_y ) = ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_X ] - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the univariate squared error and γi,j(F)subscript𝛾𝑖𝑗𝐹\gamma_{i,j}(F)italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_F ) is the distribution of γi,j(𝑿)subscript𝛾𝑖𝑗𝑿\gamma_{i,j}(\bm{X})italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( bold_italic_X ) for 𝑿𝑿\bm{X}bold_italic_X following F𝐹Fitalic_F. This second principle is the transformation principle, allowing to build transformation-based proper scoring rules that can benefit from interpretability arising from a transformation (here, the variogram transformation γi,jsubscript𝛾𝑖𝑗\gamma_{i,j}italic_γ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT) and the simplicity and interoperability of the proper scoring rule they rely on (here, the squared error).

We review the univariate and multivariate proper scoring rules through the lens of interpretability and by mentioning their known benefits and limitations. We formalize these two principles of aggregation and transformation to construct interpretable proper scoring rules for multivariate forecasts. To illustrate the use of these principles, we provide examples of transformation-and-aggregation-based scoring rules from both the literature on probabilistic forecast verification and quantities of interest. We conduct a simulation study to empirically demonstrate how transformation-and-aggregation-based scoring rules can be used. Additionally, we show how the aggregation and transformation principle can help bridging the gap between the proper scoring rules framework and the spatial verification tools (Gilleland et al., 2009; Dorninger et al., 2018).

The remainder of this article is organized as follows. Section 2 gives a general review of verification methods for univariate and multivariate forecasts. Section 3 introduces the framework of proper scoring rules based on transformation and aggregation for multivariate forecasts. Section 4 provides examples of transformation-and-aggregation-based scoring rules, including examples from the literature. Then, Section 5 showcases through different simulation setups how the framework proposed in this article can help build interpretable proper scoring rules. Finally, Section 6 provides a summary as well as a discussion on the verification of multivariate forecasts. Throughout the article, we focus on spatial forecasts for simplicity. However, the points made remain valid for any multivariate forecasts, including temporal forecasts or spatio-temporal forecasts.

2 Overview of verification tools for univariate and multivariate forecasts

This section presents the zoology of available verification tools and briefly summarizes their benefits and limitations. First, we define scoring rules and their key properties. Then, we recall univariate scoring rules, starting with ones derived from scoring functions used in point forecasting. Finally, we provide an overview of verification tools for multivariate forecasts.

2.1 Calibration, sharpness, and propriety

Gneiting et al. (2007) proposed a paradigm for the evaluation of probabilistic forecasts: "maximizing the sharpness of the predictive distributions subject to calibration". Calibration is the statistical compatibility between the forecast and the observations. Sharpness is the concentration of the forecast and is a property of the forecast itself. In other words, the paradigm aims at minimizing the uncertainty of the forecast given that the forecast is statistically consistent with the observations. Tsyplakov (2011) states that the notion of calibration in the paradigm is too vague but it holds if the definition of calibration is refined. This principle for the evaluation of probabilistic forecasts has reached a consensus in the field of probabilistic forecasting (see, e.g., Gneiting and Katzfuss 2014; Thorarinsdottir and Schuhen 2018). The paradigm proposed in Gneiting et al. (2007) is not the first mention of the link between sharpness and calibration: for example, Murphy and Winkler (1987) mentioned the relation between refinement (i.e., sharpness) and calibration.

For univariate forecasts, multiple definitions of calibration are available depending on the setting. The most used definition is probabilistic calibration and, broadly speaking, consists of computing the rank of observations among samples of the forecast and checking for uniformity with respect to observations. If the forecast is calibrated, observations should not be distinguishable from forecast samples, and thus, the distribution of their ranks should be uniform. Probabilistic calibration can be assessed by probability integral transform (PIT) histograms (Dawid, 1984) or rank histograms (Anderson, 1996; Talagrand et al., 1997) for ensemble forecasts when observations are stationary (i.e., their distribution is the same across time). The shape of the PIT or rank histogram gives information about the type of (potential) miscalibration: a triangular-shaped histogram suggests that the probabilistic forecast has a systematic bias, a \cup-shaped histogram suggests that the probabilistic forecast is under-dispersed and a \cap-shaped histogram suggests that the probabilistic forecast is over-dispersed. Moreover, probabilistic calibration implies that rank histograms should be uniform but uniformity is not sufficient. For example, rank histograms should also be uniform conditionally on different forecast scenarios (e.g., conditionally on the value of the observations available when the forecast is issued). Additionally, under certain hypotheses, calibration tools have been developed to consider real-world limitations such as serial dependence (Bröcker and Ben Bouallègue, 2020). Statistical tests have been developed to check the uniformity of rank histograms (Jolliffe and Primo, 2008). Readers interested in a more in-depth understanding of univariate forecast calibration are encouraged to consult Tsyplakov (2013, 2020).

For multivariate forecasts, a popular approach relies on a similar principle: first, multivariate forecast samples are transformed into univariate quantities using so-called pre-rank functions and then the calibration is assessed by techniques used in the univariate case (see, e.g., Gneiting et al. 2008). Pre-rank functions may be interpretable and allow targeting the calibration of specific aspects of the forecast such as the dependence structure. Readers interested in the calibration of multivariate forecasts can refer to Allen et al. (2024) for a comprehensive review of multivariate calibration.

A scoring rule SS\mathrm{S}roman_S assigns a real-valued quantity S(F,y)S𝐹𝑦\mathrm{S}(F,y)roman_S ( italic_F , italic_y ) to a forecast-observation pair (F,y)𝐹𝑦(F,y)( italic_F , italic_y ), where F𝐹F\in\mathcal{F}italic_F ∈ caligraphic_F is a probabilistic forecast and 𝒚d𝒚superscript𝑑\bm{y}\in\mathbb{R}^{d}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is an observation. In the negative-oriented convention, a scoring rule SS\mathrm{S}roman_S is proper relative to the class \mathcal{F}caligraphic_F if

𝔼G[S(G,𝒀)]𝔼G[S(F,𝒀)]subscript𝔼𝐺delimited-[]S𝐺𝒀subscript𝔼𝐺delimited-[]S𝐹𝒀\mathbb{E}_{G}[\mathrm{S}(G,\bm{Y})]\leq\mathbb{E}_{G}[\mathrm{S}(F,\bm{Y})]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_S ( italic_G , bold_italic_Y ) ] ≤ blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_S ( italic_F , bold_italic_Y ) ] (1)

for all F,G𝐹𝐺F,G\in\mathcal{F}italic_F , italic_G ∈ caligraphic_F, where 𝔼G[]subscript𝔼𝐺delimited-[]\mathbb{E}_{G}[\cdots]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ ⋯ ] is the expectation with respect to 𝒀Gsimilar-to𝒀𝐺\bm{Y}\sim Gbold_italic_Y ∼ italic_G. In simple terms, a scoring rule is proper relative to a class of distribution if its expected value is minimal when the true distribution is predicted, for any distribution within the class. Forecasts minimizing the expected scoring rule are said to be efficient and the other forecasts are said to be sub-efficient. Moreover, the scoring rule SS\mathrm{S}roman_S is strictly proper relative to the class \mathcal{F}caligraphic_F if the equality in (1) holds if and only if F=G𝐹𝐺F=Gitalic_F = italic_G. This ensures the characterization of the ideal forecast (i.e., there is a unique efficient forecast and it is the true distribution). Moreover, proper scoring rules are powerful tools as they allow the assessment of calibration and sharpness simultaneously (Winkler, 1977; Winkler et al., 1996). Sharpness can be assessed individually using the entropy associated with proper scoring rules, defined by eS(F)=𝔼F[S(F,𝒀)]subscript𝑒S𝐹subscript𝔼𝐹delimited-[]S𝐹𝒀e_{\mathrm{S}}(F)=\mathbb{E}_{F}[\mathrm{S}(F,\bm{Y})]italic_e start_POSTSUBSCRIPT roman_S end_POSTSUBSCRIPT ( italic_F ) = blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ roman_S ( italic_F , bold_italic_Y ) ]. The sharper the forecast, the smaller its entropy. Strictly proper scoring rules can also be used to infer the parameters of a parametric probabilistic forecast (see, e.g., Gneiting et al. 2005; Pacchiardi et al. 2024).

Regardless of all the interesting properties of proper scoring rules, it is worth noting that they have some limitations. Proper scoring rules may have multiple efficient forecasts (i.e., associated with their minimal expected value) and, in the general setting, no guarantee is given on their relevance. Moreover, strict propriety ensures that the efficient forecast is unique and that it is the ideal forecast (i.e., the true distribution), however, no guarantee is available for forecasts within the vicinity of the minimum in the general case. This is particularly problematic since, in practice, the unavailability of the ideal distribution makes it impossible to know if the minimum expected score is achieved. In the case of calibrated forecasts, the expected scoring rule is the entropy of the forecast and the ranking of forecasts is thus linked to the information carried by the forecast (see Corollary 4, Holzmann and Eulert 2014 for the complete result). These limitations may explain the plurality of scoring rules depending on application fields.

2.2 Univariate scoring rules

We recall classical univariate scoring rules to explain key concepts. Some univariate scoring rules will be useful for the multivariate scoring rules construction framework proposed in Section 3. Let 𝒫(E)𝒫𝐸\mathcal{P}(E)caligraphic_P ( italic_E ) denote the class of Borel probability measures on E𝐸Eitalic_E. We consider F𝒫()𝐹𝒫F\in\mathcal{F}\subseteq\mathcal{P}(\mathbb{R})italic_F ∈ caligraphic_F ⊆ caligraphic_P ( blackboard_R ) a probabilistic forecast in the form of its cumulative distribution function (cdf) and y𝑦y\in\mathbb{R}italic_y ∈ blackboard_R an observation. When the probabilistic forecast F𝐹Fitalic_F has a probability density function (pdf), it will be denoted f𝑓fitalic_f.

The simplest scoring rules can be derived from scoring functions used to assess point forecasts. The squared error (SE) is the most popular and is known through its averaged value (the mean squared error; MSE) or the square root of its average (the root mean squared error; RMSE) which has the advantage of being expressed in the same units as the observations. As a scoring rule, the SE is expressed as

SE(F,y)=(μFy)2,SE𝐹𝑦superscriptsubscript𝜇𝐹𝑦2\mathrm{SE}(F,y)=(\mu_{F}-y)^{2},roman_SE ( italic_F , italic_y ) = ( italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (2)

where μFsubscript𝜇𝐹\mu_{F}italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT denotes the mean of the predicted distribution F𝐹Fitalic_F. The SE solely discriminates the mean of the forecast (see Appendix A); efficient forecasts for SE are the ones matching the mean of the true distribution. The SE is proper relative to 𝒫2()subscript𝒫2\mathcal{P}_{2}(\mathbb{R})caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ), the class of Borel probability measures on \mathbb{R}blackboard_R with a finite second moment (i.e., finite variance). Note that the SE cannot be strictly proper as the equality of mean does not imply the equality of distributions.

Another well-known scoring rule is the absolute error (AE) defined by

AE(F,y)=|med(F)y|,AE𝐹𝑦med𝐹𝑦\mathrm{AE}(F,y)=|\mathrm{med}(F)-y|,roman_AE ( italic_F , italic_y ) = | roman_med ( italic_F ) - italic_y | , (3)

where med(F)med𝐹\mathrm{med}(F)roman_med ( italic_F ) is the median of the predicted distribution F𝐹Fitalic_F. The mean absolute error (MAE), the average of the absolute error, is the most seen form of the AE and it is also expressed in the same units as the observations. Efficient forecasts are forecasts that have a median equal to the median of the true distribution. The AE is proper relative to 𝒫1()subscript𝒫1\mathcal{P}_{1}(\mathbb{R})caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R ) but not strictly proper. Similarly, the quantile score (QS), also known as the pinball loss, is a scoring rule focusing on quantiles of level α𝛼\alphaitalic_α defined by

QSα(F,y)=(𝟙yF1(α)α)(F1(α)y)subscriptQS𝛼𝐹𝑦subscript1𝑦superscript𝐹1𝛼𝛼superscript𝐹1𝛼𝑦\mathrm{QS}_{\alpha}(F,y)=(\mathds{1}_{y\leq F^{-1}(\alpha)}-\alpha)(F^{-1}(% \alpha)-y)roman_QS start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_F , italic_y ) = ( blackboard_1 start_POSTSUBSCRIPT italic_y ≤ italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUBSCRIPT - italic_α ) ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) (4)

where 0<α<10𝛼10<\alpha<10 < italic_α < 1 is a probability level and F1(α)superscript𝐹1𝛼F^{-1}(\alpha)italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) is the predicted quantile of level α𝛼\alphaitalic_α. The case α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 corresponds to the AE up to a factor 2222. The QS of level α𝛼\alphaitalic_α is proper relative to 𝒫1()subscript𝒫1\mathcal{P}_{1}(\mathbb{R})caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R ) but not strictly proper since efficient forecasts are ones correctly predicting the quantile of level α𝛼\alphaitalic_α (see, e.g., Friederichs and Hense 2008).

Another summary statistic of interest is the exceedance of a threshold t𝑡t\in\mathbb{R}italic_t ∈ blackboard_R. The Brier score (BS; Brier 1950) was initially introduced for binary predictions but allows also to discriminate forecasts based on the exceedance of a threshold t𝑡titalic_t. For probabilistic forecasts, the BS is defined as

BSt(F,y)=((1F(t))𝟙y>t)2=(F(t)𝟙yt)2,subscriptBS𝑡𝐹𝑦superscript1𝐹𝑡subscript1𝑦𝑡2superscript𝐹𝑡subscript1𝑦𝑡2\mathrm{BS}_{t}(F,y)=((1-F(t))-\mathds{1}_{y>t})^{2}=(F(t)-\mathds{1}_{y\leq t% })^{2},roman_BS start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_F , italic_y ) = ( ( 1 - italic_F ( italic_t ) ) - blackboard_1 start_POSTSUBSCRIPT italic_y > italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_F ( italic_t ) - blackboard_1 start_POSTSUBSCRIPT italic_y ≤ italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5)

where 1F(t)1𝐹𝑡1-F(t)1 - italic_F ( italic_t ) is the predicted probability that the threshold t𝑡titalic_t is exceeded. The BS is proper relative to 𝒫()𝒫\mathcal{P}(\mathbb{R})caligraphic_P ( blackboard_R ) but not strictly proper. Binary events (e.g., exceedance of thresholds) are relevant in weather forecasting as they are used, for example, in operational settings for decision-making.

All the scoring rules presented above are proper but not strictly proper since they only discriminate against specific summary statistics instead of the whole distribution. Nonetheless, they are still used as they allow forecasters to verify specific characteristics of the forecast: the mean, the median, the quantile of level α𝛼\alphaitalic_α or the exceedance of a threshold t𝑡titalic_t. The simplicity of these scoring rules makes them interpretable, thus making them essential verification tools.

Some univariate scoring rules contain a summary statistic: for example, the formulas of the QS (4) or the BS (5) contain the exceedance of a threshold t𝑡titalic_t and the quantile of level α𝛼\alphaitalic_α, respectively. They can be seen as a scoring function applied to a summary statistic. This duality can be understood through the link between scoring functions and scoring rules through consistent functionals as presented in Gneiting (2011) or Section 2.2 in Lerch et al. (2017).

Other summary statistics can be of interest depending on applications. Nonetheless, it is worth noting that mispecifications of numerous summary statistics cannot be discriminated because of their non-elicitability. Non-elicitability of a transformation implies that no proper scoring rule can be constructed such that efficient forecasts are forecasts where the transformation is equal to the one of the true distribution. For example, the variance is known to be non-elicitable; however, it is jointly elicitable with the mean (see, e.g., Brehmer 2017). Readers interested in details regarding elicitable, non-elicitable and jointly elicitable transformations may refer to Gneiting (2011), Brehmer and Strokorb (2019) and references therein.

A strictly proper scoring rule should discriminate the whole distribution and not only specific summary statistics. The continuous ranked probability score (CRPS; Matheson and Winkler 1976) is the most popular univariate scoring rule in weather forecasting applications and can be expressed by the following expressions

CRPS(F,y)CRPS𝐹𝑦\displaystyle\mathrm{CRPS}(F,y)roman_CRPS ( italic_F , italic_y ) =𝔼F|Xy|12𝔼F|XX|,absentsubscript𝔼𝐹𝑋𝑦12subscript𝔼𝐹𝑋superscript𝑋\displaystyle=\mathbb{E}_{F}|X-y|-\frac{1}{2}\mathbb{E}_{F}|X-X^{\prime}|,= blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT | italic_X - italic_y | - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT | italic_X - italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | , (6)
=BSz(F,y)dz,absentsubscriptsubscriptBS𝑧𝐹𝑦differential-d𝑧\displaystyle=\int_{\mathbb{R}}\mathrm{BS}_{z}(F,y)\mathrm{d}z,= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT roman_BS start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_F , italic_y ) roman_d italic_z , (7)
=201QSα(F,y)dα,absent2superscriptsubscript01subscriptQS𝛼𝐹𝑦differential-d𝛼\displaystyle=2\int_{0}^{1}\mathrm{QS}_{\alpha}(F,y)\mathrm{d}\alpha,= 2 ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT roman_QS start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_F , italic_y ) roman_d italic_α , (8)

where y𝑦y\in\mathbb{R}italic_y ∈ blackboard_R and X𝑋Xitalic_X and Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are independent random variables following F𝐹Fitalic_F, with a finite first moment. Equations (7) and (8) show that the CRPS is linked with the BS and the QS. Broadly speaking, as the QS discriminates a quantile associated with a specific level, integrating the QS across all levels discriminates the quantile function that fully characterizes univariate distributions. Similarly, integrating the BS across all thresholds discriminates the cumulative distribution function that also fully characterizes univariate distributions. The CRPS is a strictly proper scoring rule relative to 𝒫1()subscript𝒫1\mathcal{P}_{1}(\mathbb{R})caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R ), the class of Borel probability measures on \mathbb{R}blackboard_R with a finite first moment. In addition, Equation (6) indicates the CRPS values have the same units as observations. In the case of deterministic forecasts, the CRPS reduces to the absolute error, in its scoring function form (Hersbach, 2000). The use of the CRPS for ensemble forecast is straightforward using expectations as in (6). Ferro et al. (2008) and Zamo and Naveau (2017) studied estimators of the CRPS for ensemble forecasts.

In addition to scoring rules based on scoring functions, some scoring rules use the moments of the probabilistic forecast F𝐹Fitalic_F. The SE (2) depends on the forecast only through its mean μFsubscript𝜇𝐹\mu_{F}italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT. The Dawid-Sebastiani score (DSS; Dawid and Sebastiani 1999) is a scoring rule depending on the forecast F𝐹Fitalic_F only through its first two central moments. The DSS is expressed as

DSS(F,y)=2log(σF)+(μFy)2σF2,DSS𝐹𝑦2subscript𝜎𝐹superscriptsubscript𝜇𝐹𝑦2superscriptsubscript𝜎𝐹2\mathrm{DSS}(F,y)=2\log(\sigma_{F})+\frac{(\mu_{F}-y)^{2}}{{\sigma_{F}}^{2}},roman_DSS ( italic_F , italic_y ) = 2 roman_log ( start_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG ) + divide start_ARG ( italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (9)

where μFsubscript𝜇𝐹\mu_{F}italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and σF2superscriptsubscript𝜎𝐹2{\sigma_{F}}^{2}italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the mean and the variance of the distribution F𝐹Fitalic_F. The DSS is proper relative to 𝒫2()subscript𝒫2\mathcal{P}_{2}(\mathbb{R})caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ) but not strictly proper, since efficient forecasts only need to correctly predict the first two central moments (see Appendix A). Dawid and Sebastiani (1999) proposed a more general class of proper scoring rules but the DSS, as defined in (9), can be seen as a special case of the logarithmic score (up to an additive constant), introduced further down.

Another scoring rule relying on the central moments of the probabilistic forecast F𝐹Fitalic_F up to order three is the error-spread score (ESS; Christensen et al. 2014). The ESS is defined as

ESS(F,y)=(σF2(μFy)2(μFy)σFγF)2,ESS𝐹𝑦superscriptsuperscriptsubscript𝜎𝐹2superscriptsubscript𝜇𝐹𝑦2subscript𝜇𝐹𝑦subscript𝜎𝐹subscript𝛾𝐹2\mathrm{ESS}(F,y)=({\sigma_{F}}^{2}-(\mu_{F}-y)^{2}-(\mu_{F}-y)\sigma_{F}% \gamma_{F})^{2},roman_ESS ( italic_F , italic_y ) = ( italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - italic_y ) italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (10)

where μFsubscript𝜇𝐹\mu_{F}italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, σF2superscriptsubscript𝜎𝐹2\sigma_{F}^{2}italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and γFsubscript𝛾𝐹\gamma_{F}italic_γ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT are the mean, the variance and the skewness of the probabilistic forecast F𝐹Fitalic_F. The ESS is proper relative to 𝒫4()subscript𝒫4\mathcal{P}_{4}(\mathbb{R})caligraphic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( blackboard_R ). As for the other scoring rules only based on moments of the forecast presented above, the expected ESS compares the probabilistic forecast F𝐹Fitalic_F with the true distribution only via their four first moments (see Appendix A). Scoring rules based on central moments of higher order could be built following the process described in Christensen et al. (2014). Such scoring rules would benefit from the interpretability induced by their construction and the ease to be applied to ensemble forecasts. However, they would also inherit the limitation of being only proper.

When the probabilistic forecast F𝐹Fitalic_F has a pdf f𝑓fitalic_f, scoring rules of a different type can be defined. Let α()subscript𝛼\mathcal{L}_{\alpha}(\mathbb{R})caligraphic_L start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( blackboard_R ) denote the class of probability measures on \mathbb{R}blackboard_R that are absolutely continuous with respect to μ𝜇\muitalic_μ (usually taken as the Lebesgue measure) and have μ𝜇\muitalic_μ-density f𝑓fitalic_f such that

fα=(f(x)αμ(dx))1/α<.subscriptdelimited-∥∥𝑓𝛼superscriptsubscript𝑓superscript𝑥𝛼𝜇d𝑥1𝛼\lVert f\rVert_{\alpha}=\left(\int_{\mathbb{R}}f(x)^{\alpha}\mu(\mathrm{d}x)% \right)^{1/\alpha}<\infty.∥ italic_f ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = ( ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_f ( italic_x ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT italic_μ ( roman_d italic_x ) ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT < ∞ .

The most popular scoring rule based on the pdf is the logarithmic score (also known as ignorance score; Good 1952; Roulston and Smith 2002). The logarithmic score is defined as

LogS(F,y)=log(f(y)),LogS𝐹𝑦𝑓𝑦\mathrm{LogS}(F,y)=-\log(f(y)),roman_LogS ( italic_F , italic_y ) = - roman_log ( start_ARG italic_f ( italic_y ) end_ARG ) , (11)

for y𝑦yitalic_y such that f(y)>0𝑓𝑦0f(y)>0italic_f ( italic_y ) > 0. In its formulation, the logarithmic score is different from the scoring rules seen previously. Good (1952) proposed the logarithmic score knowing its link with the theory of information: its entropy is the Shannon entropy (Shannon, 1948) and its expectation is related to the Kullback-Leibler divergence (Kullback and Leibler, 1951) (see Appendix A). The logarithmic score is strictly proper relative to the class 1()subscript1\mathcal{L}_{1}(\mathbb{R})caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R ). Moreover, inference via minimization of the expected logarithmic score is equivalent to maximum likelihood estimation (see, e.g., Dawid et al. 2015). The logarithmic score belongs to the family of local scoring rules, which are scoring rules only depending on y𝑦yitalic_y, f(y)𝑓𝑦f(y)italic_f ( italic_y ) and its derivatives up to a finite order. Another local scoring rule is the Hyvärinen score (also known as the gradient scoring rule; Hyvärinen 2005) and it is defined as

HS(F,y)=2f′′(y)f(y)f(y)2f(y)2,HS𝐹𝑦2superscript𝑓′′𝑦𝑓𝑦superscript𝑓superscript𝑦2𝑓superscript𝑦2\mathrm{HS}(F,y)=2\frac{f^{\prime\prime}(y)}{f(y)}-\frac{f^{\prime}(y)^{2}}{f(% y)^{2}},roman_HS ( italic_F , italic_y ) = 2 divide start_ARG italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) end_ARG - divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,

for y𝑦yitalic_y such that f(y)>0𝑓𝑦0f(y)>0italic_f ( italic_y ) > 0. The Hyvärinen score is proper relative to the subclass of 𝒫()𝒫\mathcal{P}(\mathbb{R})caligraphic_P ( blackboard_R ) such that the density f𝑓fitalic_f exists, is twice continuously differentiable and satisfies f(x)/f(x)0superscript𝑓𝑥𝑓𝑥0f^{\prime}(x)/f(x)\to 0italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) / italic_f ( italic_x ) → 0 as |x|𝑥|x|\to\infty| italic_x | → ∞. It is worth noticing that the Hyvärinen score can be computed even if f𝑓fitalic_f is only known up to a scale factor (e.g., up to a normalizing constant). This property allows circumventing the use of Monte Carlo methods or approximations of the normalizing constant when it is unavailable or hard to compute. This is a property of local proper scoring rules except for the logarithmic score (Parry et al., 2012). Readers eager to learn more about local proper scoring rules may refer to Parry et al. (2012) and Ehm and Gneiting (2012).

The logarithmic score and the Hyvärinen score do not allow f𝑓fitalic_f to be zero. To overcome this limitation, scoring rules expressed in terms of the Lαsubscript𝐿𝛼L_{\alpha}italic_L start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT-norm have been proposed. The quadratic score is defined as

QuadS(F,y)=f222f(y),QuadS𝐹𝑦subscriptsuperscriptdelimited-∥∥𝑓222𝑓𝑦\mathrm{QuadS}(F,y)=\lVert f\rVert^{2}_{2}-2f(y),roman_QuadS ( italic_F , italic_y ) = ∥ italic_f ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 italic_f ( italic_y ) ,

where f22=f(y)2dysubscriptsuperscriptdelimited-∥∥𝑓22subscript𝑓superscript𝑦2differential-d𝑦\lVert f\rVert^{2}_{2}=\int_{\mathbb{R}}f(y)^{2}\mathrm{d}y∥ italic_f ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_f ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_y. The quadratic score is strictly proper relative to the class 2()subscript2\mathcal{L}_{2}(\mathbb{R})caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ).

The pseudospherical score is defined as

PseudoS(F,y)=f(y)α1/fαα1,PseudoS𝐹𝑦𝑓superscript𝑦𝛼1subscriptsuperscriptdelimited-∥∥𝑓𝛼1𝛼\mathrm{PseudoS}(F,y)=-f(y)^{\alpha-1}/\lVert f\rVert^{\alpha-1}_{\alpha},roman_PseudoS ( italic_F , italic_y ) = - italic_f ( italic_y ) start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT / ∥ italic_f ∥ start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ,

with α>1𝛼1\alpha>1italic_α > 1. For α=2𝛼2\alpha=2italic_α = 2, it reduces to the spherical score (see, e.g., Jose 2007). The pseudospherical score is strictly proper relative to the class α()subscript𝛼\mathcal{L}_{\alpha}(\mathbb{R})caligraphic_L start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( blackboard_R ). The four scoring rules presented above have been criticized as they do not encourage a high probability in the vicinity of the observation y𝑦yitalic_y (Gneiting and Raftery, 2007). In particular, as the logarithmic score is more sensitive to outliers, probabilistic forecasts inferred by its minimization may be overdispersive (Gneiting et al., 2005). Moreover, the pdf is not always available, for example in the case of ensemble forecasts.

Readers may refer to the various reviews of scoring rules available (see, e.g., Bröcker and Smith 2007; Gneiting and Raftery 2007; Gneiting and Katzfuss 2014; Thorarinsdottir and Schuhen 2018; Alexander et al. 2022). Formulas of the expected scoring rules presented are available in Appendix A.

Strictly proper scoring rules can be seen as more powerful than proper scoring rules. This is theoretically true when the interest is in identifying the ideal forecast (i.e., the true distribution). Regardless, in practice, scoring rules are also used to rank probabilistic forecasts and with that in mind, a given ranking of forecasts in terms of the expectation of a strictly proper scoring rule (such as the CRPS) is harder to interpret than a ranking in terms of the expectation of a proper but more interpretable scoring rule (such as the SE). The SE is known to discriminate the mean, and thus, a better rank in terms of expected SE implies a better prediction of the mean of the true distribution. Conversely, a better ranking in terms of CRPS implies a better prediction of the whole prediction, but it might not be useful as is, and other verification tools will be needed to know what caused this ranking. When forecasts are not calibrated, there seems to be a trade-off between interpretability and discriminatory power and this becomes more prominent in a multivariate setting. However, simpler interpretable tools and discriminatory-powerful tools can be used complementarily. The framework proposed in Section 3 aims at hel** the construction of interpretable proper scoring rules.

2.3 Multivariate scoring rules

In a multivariate setting, forecasters cannot solely use univariate scoring rules as they are not able to discriminate forecasts beyond their 1111-dimensional marginals. Univariate scoring rules cannot discriminate the dependence structure between the univariate margins. Multivariate forecasts can be applied in different setups: spatial forecasts, temporal forecasts, multivariable forecasts or any combination of these categories (e.g., spatio-temporal forecasts of multiple variables). Considering weather forecasting, a spatial forecast could aim at predicting temperatures across multiple locations. A temporal forecast could be focused on predicting rainfall at multiple lead times at a given location. A multivariable forecast could predict both eastward and northward components of the wind. In the following, we consider F𝒫(d)𝐹𝒫superscript𝑑F\in\mathcal{F}\subseteq\mathcal{P}(\mathbb{R}^{d})italic_F ∈ caligraphic_F ⊆ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) a multivariate probabilistic forecast and 𝒚d𝒚superscript𝑑\bm{y}\in\mathbb{R}^{d}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT an observation.

Even if there is no natural ordering in the multivariate case, the notions of median and quantile can be adapted using level sets, and then scoring rules using these quantities can be constructed (see, e.g., Meng et al. 2023). Nonetheless, as the mean is well-defined, the squared error (SE) can be defined in the multivariate setting :

SE(F,𝒚)=𝝁F𝒚22,SE𝐹𝒚subscriptsuperscriptdelimited-∥∥subscript𝝁𝐹𝒚22\mathrm{SE}(F,\bm{y})=\lVert\bm{\mu}_{F}-\bm{y}\rVert^{2}_{2},roman_SE ( italic_F , bold_italic_y ) = ∥ bold_italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - bold_italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (12)

where μFsubscript𝜇𝐹\mu_{F}italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the mean vector of the distribution F𝐹Fitalic_F. Similar to the univariate case, the SE is proper relative to 𝒫2(d)subscript𝒫2superscript𝑑\mathcal{P}_{2}(\mathbb{R}^{d})caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ). Moments are well-defined in the multivariate case allowing the multivariate version of the Dawid-Sebastiani score to be defined. The Dawid-Sebastiani score (DSS) was proposed in Dawid and Sebastiani (1999) as

DSS(F,𝒚)=log(detΣF)+(𝝁F𝒚)TΣF1(𝝁F𝒚),DSS𝐹𝒚subscriptΣ𝐹superscriptsubscript𝝁𝐹𝒚𝑇superscriptsubscriptΣ𝐹1subscript𝝁𝐹𝒚\mathrm{DSS}(F,\bm{y})=\log(\det\Sigma_{F})+(\bm{\mu}_{F}-\bm{y})^{T}\Sigma_{F% }^{-1}(\bm{\mu}_{F}-\bm{y}),roman_DSS ( italic_F , bold_italic_y ) = roman_log ( start_ARG roman_det roman_Σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG ) + ( bold_italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - bold_italic_y ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - bold_italic_y ) ,

where 𝝁Fsubscript𝝁𝐹\bm{\mu}_{F}bold_italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT and ΣFsubscriptΣ𝐹\Sigma_{F}roman_Σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT are the mean vector and the covariance matrix of the distribution F𝐹Fitalic_F. The DSS is proper relative to 𝒫2(d)subscript𝒫2superscript𝑑\mathcal{P}_{2}(\mathbb{R}^{d})caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) and it becomes strictly proper relative to any convex class of probability measures characterized by their first two moments (Gneiting and Raftery, 2007). The second term in the DSS is the squared Mahalanobis distance between 𝒚𝒚\bm{y}bold_italic_y and 𝝁Fsubscript𝝁𝐹\bm{\mu}_{F}bold_italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT.

To define a strictly proper scoring rule for multivariate forecast, Gneiting and Raftery (2007) proposed the energy score (ES) as a generalization of the CRPS to the multivariate case. The ES is defined by

ESα(F,𝒚)=𝔼F𝑿𝒚2α12𝔼F𝑿𝑿2α,\mathrm{ES}_{\alpha}(F,\bm{y})=\mathbb{E}_{F}\lVert\bm{X}-\bm{y}\lVert^{\alpha% }_{2}-\frac{1}{2}\mathbb{E}_{F}\lVert\bm{X}-\bm{X}^{\prime}\lVert^{\alpha}_{2},roman_ES start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∥ bold_italic_X - bold_italic_y ∥ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∥ bold_italic_X - bold_italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (13)

where α(0,2)𝛼02\alpha\in(0,2)italic_α ∈ ( 0 , 2 ) and F𝒫α(d)𝐹subscript𝒫𝛼superscript𝑑F\in\mathcal{P}_{\alpha}(\mathbb{R}^{d})italic_F ∈ caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), the class of Borel probability measures on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT such that the moment of order α𝛼\alphaitalic_α is finite. The definition of the ES is related to the kernel form of the CRPS (6), to which the ES reduces for d=1𝑑1d=1italic_d = 1 and α=1𝛼1\alpha=1italic_α = 1. As pointed out in Gneiting and Raftery (2007), in the limiting case α=2𝛼2\alpha=2italic_α = 2, the ES becomes the SE (12). The ES is strictly proper relative to 𝒫α(d)subscript𝒫𝛼superscript𝑑\mathcal{P}_{\alpha}(\mathbb{R}^{d})caligraphic_P start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) (Székely, 2003; Gneiting and Raftery, 2007) and is suited for ensemble forecasts (Gneiting et al., 2008). Moreover, the parameter α𝛼\alphaitalic_α gives some flexibility: a small value of α𝛼\alphaitalic_α can be chosen and still lead to a strictly proper scoring rule, for example, when higher-order moments are ill-defined. The discrimination ability of the ES has been studied in numerous studies (see, e.g., Pinson and Girard 2012; Pinson and Tastu 2013; Scheuerer and Hamill 2015). Pinson and Girard (2012) studied the ability of the ES to discriminate among rival sets of scenarios (i.e., forecasts) of wind power generation. In the case of bivariate Gaussian processes, Pinson and Tastu (2013) illustrated that the ES appears to be more sensitive to misspecifications of the mean rather than misspecifications of the variance or dependence structure. The lack of sensitivity to misspecifications of the dependence structure has been confirmed in Scheuerer and Hamill (2015) using multivariate Gaussian random vectors of higher dimension. Moreover, the discriminatory power of the ES deteriorates in higher dimensions (Pinson and Tastu, 2013).

To overcome the discriminatory limitation of the ES, Scheuerer and Hamill (2015) proposed the variogram score (VSVS\mathrm{VS}roman_VS), a score targeting the verification of the dependence structure. The VS of order p𝑝pitalic_p is defined as

VSp(F,𝒚)=i,j=1dwij(𝔼F[|XiXj|p]|yiyj|p)2subscriptVS𝑝𝐹𝒚superscriptsubscript𝑖𝑗1𝑑subscript𝑤𝑖𝑗superscriptsubscript𝔼𝐹delimited-[]superscriptsubscript𝑋𝑖subscript𝑋𝑗𝑝superscriptsubscript𝑦𝑖subscript𝑦𝑗𝑝2\mathrm{VS}_{p}(F,\bm{y})=\sum_{i,j=1}^{d}w_{ij}\left(\mathbb{E}_{F}\left[|X_{% i}-X_{j}|^{p}\right]-|y_{i}-y_{j}|^{p}\right)^{2}roman_VS start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] - | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (14)

where Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th component of the random vector X𝑋Xitalic_X following F𝐹Fitalic_F, wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT are nonnegative weights and p>0𝑝0p>0italic_p > 0. The variogram score capitalizes on the variogram, used in spatial statistics to access the dependence structure. The VS cannot detect an equal bias across all components. The VS of order p𝑝pitalic_p is proper relative to the class of Borel probability measures on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT such that the 2p2𝑝2p2 italic_p-th moments of all univariate margins are finite. The weights wijsubscript𝑤𝑖𝑗w_{ij}italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT can be selected to emphasize or depreciate certain pair interactions. For example, in a spatial context, it can be expected the dependence between pairs decays with the distance: choosing the weights proportional to the inverse of the distance between locations can increase the signal-to-noise ratio and improve the discriminatory power of the VS (Scheuerer and Hamill, 2015).

When the pdf f𝑓fitalic_f of the probabilistic forecast F𝐹Fitalic_F is available, multivariate versions of the univariate scoring rules based on the pdf are available. The multivariate versions of the scoring rules have the same properties and limitations as their univariate counterpart. The logarithmic score (11) has a natural multivariate version :

LogS(F,𝒚)=log(f(𝒚)),LogS𝐹𝒚𝑓𝒚\mathrm{LogS}(F,\bm{y})=-\log(f(\bm{y})),roman_LogS ( italic_F , bold_italic_y ) = - roman_log ( start_ARG italic_f ( bold_italic_y ) end_ARG ) ,

for 𝒚𝒚\bm{y}bold_italic_y such that f(𝒚)>0𝑓𝒚0f(\bm{y})>0italic_f ( bold_italic_y ) > 0. The logarithmic score is strictly proper relative to the class 1(d)subscript1superscript𝑑\mathcal{L}_{1}(\mathbb{R}^{d})caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ).

The Hyvärinen score (HS; Hyvärinen 2005) was initially proposed in its multivariate form

HS(F,𝒚)=2Δlog(f(𝒚))+|log(f(𝒚))|2,HS𝐹𝒚2Δ𝑓𝒚superscript𝑓𝒚2\mathrm{HS}(F,\bm{y})=2\Delta\log(f(\bm{y}))+|\nabla\log(f(\bm{y}))|^{2},roman_HS ( italic_F , bold_italic_y ) = 2 roman_Δ roman_log ( start_ARG italic_f ( bold_italic_y ) end_ARG ) + | ∇ roman_log ( start_ARG italic_f ( bold_italic_y ) end_ARG ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

for 𝒚𝒚\bm{y}bold_italic_y such that f(𝒚)>0𝑓𝒚0f(\bm{y})>0italic_f ( bold_italic_y ) > 0, where ΔΔ\Deltaroman_Δ is the Laplace operator (i.e., the sum of the second-order partial derivatives) and \nabla is the gradient operator (i.e., vector of the first-order partial derivatives). In the multivariate setting, the HS can also be computed if the predicted pdf is known up to a normalizing constant. The HS is proper relative to the subclass of 𝒫(d)𝒫superscript𝑑\mathcal{P}(\mathbb{R}^{d})caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) such that the density f𝑓fitalic_f exists, is twice continuously differentiable and satisfies log(f(x))0delimited-∥∥𝑓𝑥0\lVert\nabla\log(f(x))\rVert\to 0∥ ∇ roman_log ( start_ARG italic_f ( italic_x ) end_ARG ) ∥ → 0 as xdelimited-∥∥𝑥\lVert x\rVert\to\infty∥ italic_x ∥ → ∞.

The quadratic score and pseudospherical score are directly suited to the multivariate setting :

QuadS(F,𝒚)QuadS𝐹𝒚\displaystyle\mathrm{QuadS}(F,\bm{y})roman_QuadS ( italic_F , bold_italic_y ) =f222f(𝒚);absentsubscriptsuperscriptdelimited-∥∥𝑓222𝑓𝒚\displaystyle=\lVert f\rVert^{2}_{2}-2f(\bm{y});= ∥ italic_f ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 2 italic_f ( bold_italic_y ) ;
PseudoS(F,𝒚)PseudoS𝐹𝒚\displaystyle\mathrm{PseudoS}(F,\bm{y})roman_PseudoS ( italic_F , bold_italic_y ) =f(𝒚)α1/fαα1,absent𝑓superscript𝒚𝛼1subscriptsuperscriptdelimited-∥∥𝑓𝛼1𝛼\displaystyle=-f(\bm{y})^{\alpha-1}/\lVert f\rVert^{\alpha-1}_{\alpha},= - italic_f ( bold_italic_y ) start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT / ∥ italic_f ∥ start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ,

where fα=(df(𝒚)αd𝒚)1/αsubscriptdelimited-∥∥𝑓𝛼superscriptsubscriptsuperscript𝑑𝑓superscript𝒚𝛼differential-d𝒚1𝛼\lVert f\rVert_{\alpha}=(\int_{\mathbb{R}^{d}}f(\bm{y})^{\alpha}\mathrm{d}\bm{% y})^{1/\alpha}∥ italic_f ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = ( ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_y ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT roman_d bold_italic_y ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT. The quadratic score is strictly proper relative to the class 2(d)subscript2superscript𝑑\mathcal{L}_{2}(\mathbb{R}^{d})caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ). The pseudospherical score is strictly proper relative to the class α(d)subscript𝛼superscript𝑑\mathcal{L}_{\alpha}(\mathbb{R}^{d})caligraphic_L start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ).

Additionally, other multivariate scoring rules have been proposed among which the marginal-copula score (Ziel and Berk, 2019) or wavelet-based scoring rules (see, e.g., Buschow et al. 2019). These scoring rules will be briefly mentioned in Section 4 in light of the proper scoring rule construction framework proposed in this article. Appendix B provides formulas for the expected multivariate scoring rules presented above.

2.4 Spatial verification tools

Spatial forecasts are a very important group of multivariate forecasts as they are involved in various applications (e.g., weather or renewable energy forecasting). Spatial fields are often characterized by high dimensionality and potentially strong correlations between neighboring locations. These characteristics make the verification of spatial forecasts very demanding in terms of discriminating misspecified dependence structures, for example. In the case of spatial forecasts, it is known that traditional verification methods (e.g., gridpoint-by-gridpoint verification) may result in a double penalty. The double-penalty effect was pinned in Ebert (2008) and refers to the fact that if a forecast presents a spatial (or temporal) shift with respect to observations, the error made would be penalized twice: once where the event was observed and again where the forecast predicted it. In particular, high-resolution forecasts are more penalized than less realistic blurry forecasts. The double-penalty effect may also affect spatio-temporal forecasts in general.

In parallel with the development of scoring rules, various application-focused spatial verification methods have been developed to evaluate weather forecasts. The efforts toward improving spatial verification methods have been guided by two projects: the intercomparison project (ICP; Gilleland et al. 2009) and its second phase, called Mesoscale Verification Intercomparison over Complex Terrain (MesoVICT; Dorninger et al. 2018). These projects resulted in the comparison of spatial verification methods with a particular focus on understanding their limitations and clarifying their interpretability. Only a few links exist between the approaches studied in these projects (and the work they induced) and the proper scoring rules framework. In particular, Casati et al. (2022) noted "a lack of representation of novel spatial verification methods for ensemble prediction systems". In general, there is a clear lack of methods focusing on the spatial verification of probabilistic forecasts. Moreover, to help bridging the gap between the two communities, we would like to recall the approach of spatial verification tools in the light of the scoring rule framework introduced above.

One of the goals of the ICP was to provide insights on how to develop methods robust to the double-penalty effect. In particular, Gilleland et al. (2009) proposed a classification of spatial verification tools updated later in Dorninger et al. (2018) resulting in a five-category classification. The classes differ in the computing principle they rely on. Not all spatial verification tools mentioned in these studies can be applied to probabilistic forecasts, some of them can solely be applied to deterministic forecasts. In the following description of the classes, we try to focus on methods suited to probabilistic forecasts or at least the special case of ensemble forecasts.

Neighborhood-based methods consist of applying a smoothing filter to the forecast and observation fields to prevent the double-penalty effect. The smoothing filter can take various forms (e.g., a minimum, a maximum, a mean, or a Gaussian filter) and be applied over a given neighborhood. For example, Stein and Stoop (2022) proposed a neighborhood-based CRPS for ensemble forecasts gathering forecasts and observations made within the neighborhood of the location considered. The use of a neighborhood prevents the double-penalty effect from taking place at scales smaller than that of the neighborhood. In this general definition, neighborhood-based methods can lead to proper scoring rules, in particular, see the notion of patches in Section 4.

Scale-separation techniques denote methods for which the verification is obtained after comparing forecast and observation fields across different scales. The scale-separation process can be seen as several single-bandpass spatial filters (e.g., projection onto a base of wavelets as wavelet-based scoring rules; Buschow et al. 2019). However, in order to obtain proper scoring rules, the comparison of the scale-specific characteristics needs to be performed using a proper scoring rule. Section 4 provides a discussion on wavelet-based scoring rules and their propriety.

Object-based methods rely on the identification of objects of interest and the comparison of the objects obtained in the forecast and observation fields. Object identification is application-dependent and can take the form of objects that forecasters are familiar with (e.g., storm cells for precipitation forecasts). A well-known verification tool within this class is the structure-amplitude-location (SAL; Wernli et al. 2008) method which has been generalized to ensemble forecasts in Radanovics et al. (2018). The three components of the ensemble SAL do not lead to proper scoring rules. They rely on the mean of the forecast within scoring functions inconsistent with the mean. Thus, the ideal forecast does not minimize the expected value. Nonetheless, the three components of the SAL method could be adapted to use proper scoring rules sensitive to the misspecification of the same features.

Field-deformation techniques consist of deforming the forecasts field into the observation field (the similarity between the fields can be ensured by a metric of interest). The field of distortion associated with the morphing of the forecast field into the observation field becomes a measure of the predictive performance of the forecast (see, e.g., Han and Szunyogh 2018).

Distance measures between binary images, such as exceedance of a threshold of interest, of the forecast and observation fields. These methods are inspired by development in image processing (e.g., Baddeley’s delta measure Gilleland 2011).

These five categories are partially overlap** as it can be argued that some methods belong to multiple categories (e.g., some distance measures techniques can be seen as a mix of field-deformation and object-based). They define different principles that can be used to build verification tools that are not subject to the double-penalty effect. The reader may refer to Dorninger et al. (2018) and references therein for details on the classification and the spatial verification methods not used thereafter. The frontier between the aforementioned spatial verification methods and the proper scoring rules framework is porous with, for example, wavelet-based scoring rules belonging to both. It appears that numerous spatial verification methods seek interpretability and we believe that this is not incompatible with the use of proper scoring rules. We propose the following framework to facilitate the construction of interpretable proper scoring rules.

3 A framework for interpretable proper scoring rules

We define a framework to design proper scoring rules for multivariate forecasts. Its definition is motivated by remarks on the multivariate forecasts literature and operational use. There seems to be a growing consensus around the fact that no single verification method has it all (see, e.g., Bjerregård et al. 2021). Most of the studies comparing forecast verification methods highlight that verification procedures should not be reduced to the use of a single method and that each procedure needs to be well suited to the context (see, e.g., Scheuerer and Hamill 2015; Thorarinsdottir and Schuhen 2018). Moreover, from a more theoretical point of view, (strict) propriety does not ensure discrimination ability and different (strictly) proper scoring rules can lead to different rankings of sub-efficient forecasts.

Standard verification procedures gradually increase the complexity of the quantities verified. Procedures often start by verifying simple quantities such as quantiles, mean, or binary events (e.g., prediction of dry/wet events for precipitation). If multiple forecasts have a satisfying performance for these quantities, marginal distributions of the multivariate forecast can be verified using univariate scoring rules. Finally, multivariate-related quantities, such as the dependence structure, can be verified through multivariate scoring rules. Forecasters rely on multiple verification methods to evaluate a forecast and ideally, the verification method should be interpretable by targeting specific aspects of the distribution or thanks to the forecaster’s experience. This type of verification procedure allows the forecaster to understand what characterizes the predictive performance of a forecast instead of directly looking at a strictly proper scoring rule giving an encapsulated summary of the predictive performance.

Various multivariate forecast calibration methods rely on the calibration of univariate quantities obtained by dimension reduction techniques. As the general principle of multivariate calibration leans on studying the calibration of quantities obtained by pre-rank functions, Allen et al. (2024) argue that calibration procedures should not rely on a single pre-rank function and should instead use multiple simple pre-rank functions and leverage the interpretability of the PIT/rank histograms associated. A similar principle can be applied to increase the interpretability of verification methods based on scoring rules.

As general multivariate strictly proper scoring rules fail to discriminate forecasts with respect to arbitrary misspecifications and they may lead to different ranking of sub-efficient forecasts, multivariate verification could benefit from using multiple proper scoring rules targeting specific aspects of the forecasts. Thereby, forecasters know which aspect of the observations are well-predicted by the forecast and can update their forecast or select the best forecast among others in the light of this better understanding of the forecast. To facilitate the construction of interpretable proper scoring rules, we define a framework based on two principles: transformation and aggregation.

The transformation principle consists of transforming both forecast and observation before applying a scoring rule. Heinrich-Mertsching et al. (2021) introduced this general principle in the context of point processes. In particular, they present scoring rules based on summary statistics targeting the clustering behavior or the intensity of the processes. In a more general context, the use of transformations was disseminated in the literature for several years (see Section 4). Proposition 1 shows how transformations can be used to construct proper scoring rules.

Proposition 1.

Let 𝒫(d)𝒫superscript𝑑\mathcal{F}\subset\mathcal{P}(\mathbb{R}^{d})caligraphic_F ⊂ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) be a class of Borel probability measure on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and let F𝐹F\in\mathcal{F}italic_F ∈ caligraphic_F be a forecast and 𝐲d𝐲superscript𝑑\bm{y}\in\mathbb{R}^{d}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT an observation. Let T:dk:𝑇superscript𝑑superscript𝑘T:\mathbb{R}^{d}\to\mathbb{R}^{k}italic_T : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT be a transformation and let SS\mathrm{S}roman_S be a scoring rule on ksuperscript𝑘\mathbb{R}^{k}blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT that is proper relative to T()={(T(𝐗)),𝐗F}𝑇similar-to𝑇𝐗𝐗𝐹T(\mathcal{F})=\{\mathcal{L}(T(\bm{X})),\bm{X}\sim F\in\mathcal{F}\}italic_T ( caligraphic_F ) = { caligraphic_L ( italic_T ( bold_italic_X ) ) , bold_italic_X ∼ italic_F ∈ caligraphic_F }. Then, the scoring rule

ST(F,𝒚)=S(T(F),T(𝒚))subscriptS𝑇𝐹𝒚S𝑇𝐹𝑇𝒚\mathrm{S}_{T}(F,\bm{y})=\mathrm{S}(T(F),T(\bm{y}))roman_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = roman_S ( italic_T ( italic_F ) , italic_T ( bold_italic_y ) )

is proper relative to \mathcal{F}caligraphic_F. If SS\mathrm{S}roman_S is strictly proper relative to T()𝑇T(\mathcal{F})italic_T ( caligraphic_F ) and T𝑇Titalic_T is injective, then the resulting scoring rule STsubscriptS𝑇\mathrm{S}_{T}roman_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is strictly proper relative to \mathcal{F}caligraphic_F.

To gain interpretability, it is natural to have dimension-reducing transformations (i.e., k<d𝑘𝑑k<ditalic_k < italic_d), which generally leads to T𝑇Titalic_T not being injective and STsubscriptS𝑇\mathrm{S}_{T}roman_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT not being strictly proper. Nonetheless, as expressed previously, interpretability is important and it can mostly be leveraged if the transformation simplifies the multivariate quantities. Particularly, it is generally preferred to choose k=1𝑘1k=1italic_k = 1 to make the quantity easier to interpret and focus on specific information contained in the forecast or the observation. Straightforward transformations can be projections on a k𝑘kitalic_k-dimensional margin or a summary statistic relevant to the forecast type such as the total over a domain in the case of precipitations. Simple transformations may be preferred for their interpretability and their potential lack of discriminatory power can be made up for via the use of multiple simpler transformations. Numerous examples of transformations are presented, discussed, and linked to the literature in Section 4. The proof of Proposition 1 is provided in Appendix C.1.

The second principle is the aggregation of scoring rules. Aggregation can be used on scoring rules in order to combine them and obtain a single scoring rule summarizing the evaluation. It can be used to operate on scoring rules acting on different spaces, times or locations. Note that Dawid and Musio (2014) introduced the notion of composite score which is related to the aggregation principle but is closer to the combined application of both principles. Proposition 2 presents a general aggregation principle to build proper scoring rules. This principle has been known since proper scoring rules have been introduced.

Proposition 2.

Let 𝒮={Si}1im𝒮subscriptsubscriptS𝑖1𝑖𝑚\mathcal{S}=\{\mathrm{S}_{i}\}_{1\leq i\leq m}caligraphic_S = { roman_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT be a set of proper scoring rules relative to 𝒫(d)𝒫superscript𝑑\mathcal{F}\subset\mathcal{P}(\mathbb{R}^{d})caligraphic_F ⊂ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ). Let 𝐰={wi}1im𝐰subscriptsubscript𝑤𝑖1𝑖𝑚\bm{w}=\{w_{i}\}_{{1\leq i\leq m}}bold_italic_w = { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT be nonnegative weights. Then, the scoring rule

S𝒮,𝒘(F,𝒚)=i=1mwiSi(F,𝒚)subscriptS𝒮𝒘𝐹𝒚superscriptsubscript𝑖1𝑚subscript𝑤𝑖subscriptS𝑖𝐹𝒚\mathrm{S}_{\mathcal{S},\bm{w}}(F,\bm{y})=\sum_{i=1}^{m}w_{i}\mathrm{S}_{i}(F,% \bm{y})roman_S start_POSTSUBSCRIPT caligraphic_S , bold_italic_w end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_F , bold_italic_y )

is proper relative to \mathcal{F}caligraphic_F. If at least one scoring rule SisubscriptS𝑖\mathrm{S}_{i}roman_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is strictly proper relative to \mathcal{F}caligraphic_F and wi>0subscript𝑤𝑖0w_{i}>0italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0, the aggregated scoring rule S𝒮,𝐰subscriptS𝒮𝐰\mathrm{S}_{\mathcal{S},\bm{w}}roman_S start_POSTSUBSCRIPT caligraphic_S , bold_italic_w end_POSTSUBSCRIPT is strictly proper relative to \mathcal{F}caligraphic_F.

It is worth noting that Proposition 2 does not specify any strict condition for the scoring rules used. For example, the scoring rules aggregated do not need to be the same or do not need to be expressed in the same units. Aggregated scoring rules can be used to summarize the evaluation of univariate probabilistic forecasts (e.g., aggregation of CRPS at different locations) or to summarize complementary scoring rules (e.g., aggregation of Brier score and a threshold-weighted CRPS). Unless stated otherwise, for simplicity, we will restrict ourselves to cases where the aggregated scoring rules are of the same type. Bolin and Wallin (2023) showed that the aggregation of scoring rules can lead to unintuitive behaviors. For the aggregation of univariate scoring rules, they showed that scoring rules do not necessarily have the same dependence on the scale of the forecasted phenomenon: this leads to scoring rules putting more (or less) emphasis on the forecasts with larger scales. They define and propose local scale-invariant scoring rules to make scale-agnostic scoring rules. When performing aggregation, it is important to be aware of potential preferences or biases of the scoring rules.

We only consider aggregation of proper scoring rules through a weighted sum. To conserve (strict) propriety of scoring rules, aggregations can take, more generally, the form of (strictly) isotonic transformations, such as a multiplicative structure when positive scoring rules are considered (Ziel and Berk, 2019).

The two principles of Proposition 1 and Proposition 2 can be used simultaneously to create proper scoring rules based on both transformations and aggregation as presented in Corollary 1.

Corollary 1.

Let 𝒯={Ti}1im𝒯subscriptsubscript𝑇𝑖1𝑖𝑚\mathcal{T}=\{T_{i}\}_{{1\leq i\leq m}}caligraphic_T = { italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT be a set of transformations from dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to ksuperscript𝑘\mathbb{R}^{k}blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. Let 𝒮𝒯={STi}1imsubscript𝒮𝒯subscriptsubscriptSsubscript𝑇𝑖1𝑖𝑚\mathcal{S}_{\mathcal{T}}=\{{\mathrm{S}_{T_{i}}}\}_{{1\leq i\leq m}}caligraphic_S start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT = { roman_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT be a set of proper scoring rules where SS\mathrm{S}roman_S is proper relative to Ti()subscript𝑇𝑖T_{i}(\mathcal{F})italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_F ), for all 1im1𝑖𝑚1\leq i\leq m1 ≤ italic_i ≤ italic_m. Let 𝐰={wi}1im𝐰subscriptsubscript𝑤𝑖1𝑖𝑚\bm{w}=\{w_{i}\}_{{1\leq i\leq m}}bold_italic_w = { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT be nonnegative weights. Then, the scoring rule

S𝒮𝒯,𝒘(F,𝒚)=i=1mwiSTi(F,𝒚)subscriptSsubscript𝒮𝒯𝒘𝐹𝒚superscriptsubscript𝑖1𝑚subscript𝑤𝑖subscriptSsubscript𝑇𝑖𝐹𝒚\mathrm{S}_{\mathcal{S}_{\mathcal{T}},\bm{w}}(F,\bm{y})=\sum_{i=1}^{m}w_{i}% \mathrm{S}_{T_{i}}(F,\bm{y})roman_S start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , bold_italic_w end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y )

is proper relative to \mathcal{F}caligraphic_F.

Strict propriety relative to \mathcal{F}caligraphic_F of the resulting scoring rule is obtained as soon as there exists 1im1𝑖𝑚1\leq i\leq m1 ≤ italic_i ≤ italic_m such that SS\mathrm{S}roman_S is strictly proper relative to Ti()subscript𝑇𝑖T_{i}(\mathcal{F})italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_F ), Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is injective and wi>0subscript𝑤𝑖0w_{i}>0italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0. The result of Corollary 1 can be extended to transformations with images in different dimensions and paired with different scoring rules (see Appendix D).

As we will see in the examples developed in the following section, numerous scoring rules used in the literature are based on these two principles of aggregation and transformation.


Decomposition of kernel scoring rules.

We briefly discuss the link between the transformation and aggregation principles for scoring rules and the specific class of kernel scoring rules. A kernel on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a measurable function ρ:d×d:𝜌superscript𝑑superscript𝑑\rho:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}italic_ρ : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R satisfying the following two properties:

  • i)i)italic_i )

    (symmetry) ρ(𝒙1,𝒙2)=ρ(𝒙2,𝒙1)𝜌subscript𝒙1subscript𝒙2𝜌subscript𝒙2subscript𝒙1\rho(\bm{x}_{1},\bm{x}_{2})=\rho(\bm{x}_{2},\bm{x}_{1})italic_ρ ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_ρ ( bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) for all 𝒙1,𝒙2dsubscript𝒙1subscript𝒙2superscript𝑑\bm{x}_{1},\bm{x}_{2}\in\mathbb{R}^{d}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT;

  • ii)ii)italic_i italic_i )

    (non-negativity) 1ijnaiajρ(𝒙i,𝒙j)0subscript1𝑖𝑗𝑛subscript𝑎𝑖subscript𝑎𝑗𝜌subscript𝒙𝑖subscript𝒙𝑗0\sum_{1\leq i\leq j\leq n}a_{i}a_{j}\rho(\bm{x}_{i},\bm{x}_{j})\geq 0∑ start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_j ≤ italic_n end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_ρ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ≥ 0 for all 𝒙1,,𝒙ndsubscript𝒙1subscript𝒙𝑛superscript𝑑\bm{x}_{1},\ldots,\bm{x}_{n}\in\mathbb{R}^{d}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and a1,,ansubscript𝑎1subscript𝑎𝑛a_{1},\ldots,a_{n}\in\mathbb{R}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R, for all n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N.

The kernel scoring rule SρsubscriptS𝜌\mathrm{S}_{\rho}roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT associated with the kernel ρ𝜌\rhoitalic_ρ is defined on the space of predictive distributions

𝒫ρ={F𝒫(d):ρ(x,x)F(dx)<+}subscript𝒫𝜌conditional-set𝐹𝒫superscript𝑑𝜌𝑥𝑥𝐹d𝑥\mathcal{P}_{\rho}=\left\{F\in\mathcal{P}(\mathbb{R}^{d})\colon\int\sqrt{\rho(% x,x)}F(\mathrm{d}x)<+\infty\right\}caligraphic_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT = { italic_F ∈ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) : ∫ square-root start_ARG italic_ρ ( italic_x , italic_x ) end_ARG italic_F ( roman_d italic_x ) < + ∞ }

by

Sρ(F,𝒚)=𝔼F[ρ(𝑿,𝒚)]12𝔼F[ρ(𝑿,𝑿)]12ρ(𝒚,𝒚),subscriptS𝜌𝐹𝒚subscript𝔼𝐹delimited-[]𝜌𝑿𝒚12subscript𝔼𝐹delimited-[]𝜌𝑿superscript𝑿12𝜌𝒚𝒚\mathrm{S}_{\rho}(F,\bm{y})=\mathbb{E}_{F}[\rho(\bm{X},\bm{y})]-\frac{1}{2}% \mathbb{E}_{F}[\rho(\bm{X},\bm{X}^{\prime})]-\frac{1}{2}\rho(\bm{y},\bm{y}),roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_ρ ( bold_italic_X , bold_italic_y ) ] - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_ρ ( bold_italic_X , bold_italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ρ ( bold_italic_y , bold_italic_y ) , (15)

where 𝒚d𝒚superscript𝑑\bm{y}\in\mathbb{R}^{d}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and 𝑿,𝑿𝑿superscript𝑿\bm{X},\bm{X}^{\prime}bold_italic_X , bold_italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT are independent random variables following F𝐹Fitalic_F. Importantly, SρsubscriptS𝜌\mathrm{S}_{\rho}roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT is proper on 𝒫ρsubscript𝒫𝜌\mathcal{P}_{\rho}caligraphic_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT and, for an ensemble forecast F=1Mm=1Mδ𝒙m𝐹1𝑀superscriptsubscript𝑚1𝑀subscript𝛿subscript𝒙𝑚F=\frac{1}{M}\sum_{m=1}^{M}\delta_{\bm{x}_{m}}italic_F = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT with M𝑀Mitalic_M members 𝒙1,,𝒙Msubscript𝒙1subscript𝒙𝑀\bm{x}_{1},\ldots,\bm{x}_{M}bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT, it takes the simple form

Sρ(F,𝒚)=1Mm=1Mρ(𝒙m,𝒚)12M21m1,m2Mρ(𝒙m1,𝒙m2)12ρ(𝒚,𝒚),subscriptS𝜌𝐹𝒚1𝑀superscriptsubscript𝑚1𝑀𝜌subscript𝒙𝑚𝒚12superscript𝑀2subscriptformulae-sequence1subscript𝑚1subscript𝑚2𝑀𝜌subscript𝒙subscript𝑚1subscript𝒙subscript𝑚212𝜌𝒚𝒚\mathrm{S}_{\rho}(F,\bm{y})=\frac{1}{M}\sum_{m=1}^{M}\rho(\bm{x}_{m},\bm{y})-% \frac{1}{2M^{2}}\sum_{1\leq m_{1},m_{2}\leq M}\rho(\bm{x}_{m_{1}},\bm{x}_{m_{2% }})-\frac{1}{2}\rho(\bm{y},\bm{y}),roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_ρ ( bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , bold_italic_y ) - divide start_ARG 1 end_ARG start_ARG 2 italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT 1 ≤ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_M end_POSTSUBSCRIPT italic_ρ ( bold_italic_x start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ρ ( bold_italic_y , bold_italic_y ) , (16)

making scoring rules particularly useful for ensemble forecasts.

The CRPS is surely the most widely used kernel scoring rule. Equation (6) shows that it is a associated with the kernel ρ(x1,x2)=|x1|+|x2||x1x2|𝜌subscript𝑥1subscript𝑥2subscript𝑥1subscript𝑥2subscript𝑥1subscript𝑥2\rho(x_{1},x_{2})=|x_{1}|+|x_{2}|-|x_{1}-x_{2}|italic_ρ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | + | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | - | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | (the function |x1x2|subscript𝑥1subscript𝑥2|x_{1}-x_{2}|| italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | is conditionally semi-definite negative so that ρ𝜌\rhoitalic_ρ is non-negative). For more details on kernel scoring rules, the reader should refer to Gneiting et al. (2005) or Steinwart and Ziegel (2021).

The following proposition reveals that a kernel scoring rule can always be expressed as an aggregation of squared errors (SEs) between transformations of the forecast-observation pair.

Proposition 3.

Let SρsubscriptS𝜌\mathrm{S}_{\rho}roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT be the kernel scoring rule associated with the kernel ρ𝜌\rhoitalic_ρ. Then there exists a sequence of transformations Tl:d:subscript𝑇𝑙superscript𝑑T_{l}:\mathbb{R}^{d}\to\mathbb{R}italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R, l1𝑙1l\geq 1italic_l ≥ 1, such that

Sρ(F,𝒚)=12l1SE(Tl(F),Tl(𝒚)),subscriptS𝜌𝐹𝒚12subscript𝑙1SEsubscript𝑇𝑙𝐹subscript𝑇𝑙𝒚\mathrm{S}_{\rho}(F,\bm{y})=\frac{1}{2}\sum_{l\geq 1}\mathrm{SE}(T_{l}(F),T_{l% }(\bm{y})),roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_l ≥ 1 end_POSTSUBSCRIPT roman_SE ( italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_F ) , italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_y ) ) ,

for all predictive distribution F𝒫ρ𝐹subscript𝒫𝜌F\in\mathcal{P}_{\rho}italic_F ∈ caligraphic_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT and observation 𝐲d𝐲superscript𝑑\bm{y}\in\mathbb{R}^{d}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

In particular, the series on the right-hand side is always finite. The proof is provided in Appendix C.2 and relies on the reproducing kernel Hilbert space (RKHS) representation of kernel scoring rules. In particular, we will see that the sequence (Tl)l1subscriptsubscript𝑇𝑙𝑙1(T_{l})_{l\geq 1}( italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_l ≥ 1 end_POSTSUBSCRIPT can be chosen as an orthonormal basis of the RKHS associated with the kernel ρ𝜌\rhoitalic_ρ.

This representation of kernel scoring rules can be useful to understand more deeply the comparison of the predictive forecast F𝐹Fitalic_F and observation 𝒚𝒚\bm{y}bold_italic_y. While the definition (15) is quite abstract, the series representation can be rewritten

Sρ(F,𝒚)=l1(𝔼F[Tl(𝑿)]Tl(𝒚))2subscriptS𝜌𝐹𝒚subscript𝑙1superscriptsubscript𝔼𝐹delimited-[]subscript𝑇𝑙𝑿subscript𝑇𝑙𝒚2\mathrm{S}_{\rho}(F,\bm{y})=\sum_{l\geq 1}\big{(}\mathbb{E}_{F}[T_{l}(\bm{X})]% -T_{l}(\bm{y})\big{)}^{2}roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_l ≥ 1 end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_X ) ] - italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

with X𝑋Xitalic_X a random variable following F𝐹Fitalic_F. In other words, for l1𝑙1l\geq 1italic_l ≥ 1, the observed value Tl(𝒚)subscript𝑇𝑙𝒚T_{l}(\bm{y})italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_y ) is compared to the predicted value Tl(𝑿)subscript𝑇𝑙𝑿T_{l}(\bm{X})italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_X ) under the predictive distribution F𝐹Fitalic_F using the SE; then all these contributions are aggregated in a series forming the kernel scoring rule.

To give more intuition, we study two important cases in dimension d=1𝑑1d=1italic_d = 1. The details of the computations are provided in Appendix C.3. For the Gaussian kernel scoring rule associated with the kernel

ρ(x1,x2)=exp((x1x2)2/2),𝜌subscript𝑥1subscript𝑥2superscriptsubscript𝑥1subscript𝑥222\rho(x_{1},x_{2})=\exp(-(x_{1}-x_{2})^{2}/2),italic_ρ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_exp ( start_ARG - ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_ARG ) ,

some computations yield the series representation

Sρ(F,y)=12l01l!(𝔼F[XleX2/2]yley2/2)2subscriptS𝜌𝐹𝑦12subscript𝑙01𝑙superscriptsubscript𝔼𝐹delimited-[]superscript𝑋𝑙superscript𝑒superscript𝑋22superscript𝑦𝑙superscript𝑒superscript𝑦222\mathrm{S}_{\rho}(F,y)=\frac{1}{2}\sum_{l\geq 0}\frac{1}{l!}\Big{(}\mathbb{E}_% {F}[X^{l}e^{-X^{2}/2}]-y^{l}e^{-y^{2}/2}\Big{)}^{2}roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F , italic_y ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_l ≥ 0 end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_l ! end_ARG ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT ] - italic_y start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

so that this score compares the probabilistic forecast F𝐹Fitalic_F and the observation y𝑦yitalic_y through the transforms

Tl(x)=1l!xlex2/2,l0.formulae-sequencesubscript𝑇𝑙𝑥1𝑙superscript𝑥𝑙superscript𝑒superscript𝑥22𝑙0T_{l}(x)=\frac{1}{\sqrt{l!}}x^{l}e^{-x^{2}/2},\quad l\geq 0.italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_l ! end_ARG end_ARG italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT , italic_l ≥ 0 .

For the CRPS, a possible series representation is obtained thanks to the following wavelet basis of functions: let T0(x)=x𝟙[0,1)(x)+𝟙[1,+)(x)superscript𝑇0𝑥𝑥subscript101𝑥subscript11𝑥T^{0}(x)=x\mathds{1}_{[0,1)}(x)+\mathds{1}_{[1,+\infty)}(x)italic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x ) = italic_x blackboard_1 start_POSTSUBSCRIPT [ 0 , 1 ) end_POSTSUBSCRIPT ( italic_x ) + blackboard_1 start_POSTSUBSCRIPT [ 1 , + ∞ ) end_POSTSUBSCRIPT ( italic_x ) (plateau function) and T1(x)=(1/2|x1/2|)𝟙[0,1](x)superscript𝑇1𝑥12𝑥12subscript101𝑥T^{1}(x)=\big{(}1/2-|x-1/2|\big{)}\mathds{1}_{[0,1]}(x)italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_x ) = ( 1 / 2 - | italic_x - 1 / 2 | ) blackboard_1 start_POSTSUBSCRIPT [ 0 , 1 ] end_POSTSUBSCRIPT ( italic_x ) (triangle function) and consider the collection of functions

Tl0(x)=T0(xl),Tl,m1(x)=2m/2T1(2mxl),l,m0,formulae-sequencesubscriptsuperscript𝑇0𝑙𝑥superscript𝑇0𝑥𝑙formulae-sequencesubscriptsuperscript𝑇1𝑙𝑚𝑥superscript2𝑚2superscript𝑇1superscript2𝑚𝑥𝑙formulae-sequence𝑙𝑚0T^{0}_{l}(x)=T^{0}(x-l),\quad T^{1}_{l,m}(x)=2^{-m/2}T^{1}(2^{m}x-l),\quad l% \in\mathbb{Z},m\geq 0,italic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x ) = italic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x - italic_l ) , italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_m end_POSTSUBSCRIPT ( italic_x ) = 2 start_POSTSUPERSCRIPT - italic_m / 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_x - italic_l ) , italic_l ∈ blackboard_Z , italic_m ≥ 0 ,

where l𝑙l\in\mathbb{Z}italic_l ∈ blackboard_Z is a position parameter and m0𝑚0m\geq 0italic_m ≥ 0 a scale parameter. Then, the CRPS can be written as

CRPS(F,y)CRPS𝐹𝑦\displaystyle\mathrm{CRPS}(F,y)roman_CRPS ( italic_F , italic_y ) =lSE(Tl0(F),Tl0(y))+lm0SE(Tl,m1(F),Tl,m1(y))absentsubscript𝑙SEsubscriptsuperscript𝑇0𝑙𝐹subscriptsuperscript𝑇0𝑙𝑦subscript𝑙subscript𝑚0SEsubscriptsuperscript𝑇1𝑙𝑚𝐹subscriptsuperscript𝑇1𝑙𝑚𝑦\displaystyle=\sum_{l\in\mathbb{Z}}\mathrm{SE}(T^{0}_{l}(F),T^{0}_{l}(y))+\sum% _{l\in\mathbb{Z}}\sum_{m\geq 0}\mathrm{SE}(T^{1}_{l,m}(F),T^{1}_{l,m}(y))= ∑ start_POSTSUBSCRIPT italic_l ∈ blackboard_Z end_POSTSUBSCRIPT roman_SE ( italic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_F ) , italic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_y ) ) + ∑ start_POSTSUBSCRIPT italic_l ∈ blackboard_Z end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m ≥ 0 end_POSTSUBSCRIPT roman_SE ( italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_m end_POSTSUBSCRIPT ( italic_F ) , italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_m end_POSTSUBSCRIPT ( italic_y ) )
=l(𝔼F[T0(Xl)]T0(yl))2+lm02m(𝔼F[T1(2mXl)]T(2myl))2.absentsubscript𝑙superscriptsubscript𝔼𝐹delimited-[]superscript𝑇0𝑋𝑙superscript𝑇0𝑦𝑙2subscript𝑙subscript𝑚0superscript2𝑚superscriptsubscript𝔼𝐹delimited-[]superscript𝑇1superscript2𝑚𝑋𝑙𝑇superscript2𝑚𝑦𝑙2\displaystyle=\sum_{l\in\mathbb{Z}}\Big{(}\mathbb{E}_{F}[T^{0}(X-l)]-T^{0}(y-l% )\Big{)}^{2}+\sum_{l\in\mathbb{Z}}\sum_{m\geq 0}2^{-m}\Big{(}\mathbb{E}_{F}[T^% {1}(2^{m}X-l)]-T(2^{m}y-l)\Big{)}^{2}.= ∑ start_POSTSUBSCRIPT italic_l ∈ blackboard_Z end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_X - italic_l ) ] - italic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_y - italic_l ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_l ∈ blackboard_Z end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_m ≥ 0 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT - italic_m end_POSTSUPERSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_X - italic_l ) ] - italic_T ( 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_y - italic_l ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

We can see that the CRPS compares forecast and observation through the SE after applying the plateau and triangle transformations for multiple positions and scales and then aggregates all the contributions.

4 Applications of the transformation and aggregation principles

4.1 Projections

Certainly, the most direct type of transformation is projections of forecasts and observations on their k𝑘kitalic_k-dimensional marginals. We denote Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the projection on the i𝑖iitalic_i-th component such that Ti(𝑿)=Xisubscript𝑇𝑖𝑿subscript𝑋𝑖T_{i}(\bm{X})=X_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_X ) = italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for all 𝑿d𝑿superscript𝑑\bm{X}\in\mathbb{R}^{d}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. This allows the forecaster to assess the predictive performance of a forecast for a specific univariate marginal independently of the other variables. If SS\mathrm{S}roman_S is an univariate scoring rule proper relative to 𝒫()𝒫\mathcal{P}(\mathbb{R})caligraphic_P ( blackboard_R ), then Proposition 1 leads to STisubscriptSsubscript𝑇𝑖\mathrm{S}_{T_{i}}roman_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT being proper relative to 𝒫(d)𝒫superscript𝑑\mathcal{P}(\mathbb{R}^{d})caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ). This "new" scoring rule can be useful if a given marginal is of particular interest (e.g., location of high interest in a spatial forecast). However, it can be more interesting to aggregate such scoring rules across all 1111-dimensional marginals. This leads to the following scoring rule

S𝒮𝒯,𝒘(F,𝒚)=i=1dwiSTi(F,𝒚),subscriptSsubscript𝒮𝒯𝒘𝐹𝒚superscriptsubscript𝑖1𝑑subscript𝑤𝑖subscriptSsubscript𝑇𝑖𝐹𝒚\mathrm{S}_{\mathcal{S}_{\mathcal{T}},\bm{w}}(F,\bm{y})=\sum_{i=1}^{d}w_{i}% \mathrm{S}_{T_{i}}(F,\bm{y}),roman_S start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , bold_italic_w end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) ,

where 𝒮𝒯subscript𝒮𝒯\mathcal{S}_{\mathcal{T}}caligraphic_S start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT is {STi}1idsubscriptsubscriptSsubscript𝑇𝑖1𝑖𝑑\{\mathrm{S}_{T_{i}}\}_{{1\leq i\leq d}}{ roman_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_d end_POSTSUBSCRIPT. This setting is popular for assessing the performance of multivariate forecasts and we briefly present examples from the literature falling under this setting. Aggregation of CRPS (6) across locations and/or lead times is common practice for plots or comparison tables with uniform weights (Gneiting et al., 2005; Taillardat et al., 2016; Rasp and Lerch, 2018; Schulz and Lerch, 2022; Lerch and Polsterer, 2022; Hu et al., 2023) or with more complex schemes such as weights proportional to the cosine of the latitude (Ben Bouallègue et al., 2024b). The SE (2) and AE (3) can be aggregated to obtain RMSE and MAE, respectively (Delle Monache et al., 2013; Gneiting et al., 2005; Lerch and Polsterer, 2022; Pathak et al., 2022). Bremnes (2019) aggregated QSs (4) across stations and different quantile levels of interest with uniform weights. Note that the multivariate SE (12) can be rewritten as the sum of univariate SE across 1111-marginals: SE(F,𝒚)=𝝁𝑭𝒚22=i=1dSETi(F,𝒚)SE𝐹𝒚subscriptsuperscriptdelimited-∥∥subscript𝝁𝑭𝒚22superscriptsubscript𝑖1𝑑subscriptSEsubscript𝑇𝑖𝐹𝒚\mathrm{SE}(F,\bm{y})=\lVert\bm{\mu_{F}}-\bm{y}\rVert^{2}_{2}=\sum_{i=1}^{d}% \mathrm{SE}_{T_{i}}(F,\bm{y})roman_SE ( italic_F , bold_italic_y ) = ∥ bold_italic_μ start_POSTSUBSCRIPT bold_italic_F end_POSTSUBSCRIPT - bold_italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT roman_SE start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ).

The second simplest choice is the 2222-dimensional case, allowing to focus on pair dependency. We denote T(i,j)subscript𝑇𝑖𝑗T_{(i,j)}italic_T start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT the projection on the i𝑖iitalic_i-th and j𝑗jitalic_j-th components (i.e., the (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) pair of components) such that T(i,j)(𝑿)=Xi,j=(Xi,Xj)subscript𝑇𝑖𝑗𝑿subscript𝑋𝑖𝑗subscript𝑋𝑖subscript𝑋𝑗T_{(i,j)}(\bm{X})=X_{i,j}=(X_{i},X_{j})italic_T start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT ( bold_italic_X ) = italic_X start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). In this setting, SS\mathrm{S}roman_S has to be a bivariate proper scoring rule to construct a proper scoring rule ST(i,j)subscriptSsubscript𝑇𝑖𝑗\mathrm{S}_{T_{(i,j)}}roman_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The aggregation of such scoring rules becomes

S𝒮𝒯,𝒘(F,𝒚)=i,j=1ijdwi,jST(i,j)(F,𝒚).subscriptSsubscript𝒮𝒯𝒘𝐹𝒚superscriptsubscript𝑖𝑗1𝑖𝑗𝑑subscript𝑤𝑖𝑗subscriptSsubscript𝑇𝑖𝑗𝐹𝒚\mathrm{S}_{\mathcal{S}_{\mathcal{T}},\bm{w}}(F,\bm{y})=\sum_{\begin{subarray}% {c}i,j=1\\ i\neq j\end{subarray}}^{d}w_{i,j}\mathrm{S}_{T_{(i,j)}}(F,\bm{y}).roman_S start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , bold_italic_w end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT start_ARG start_ROW start_CELL italic_i , italic_j = 1 end_CELL end_ROW start_ROW start_CELL italic_i ≠ italic_j end_CELL end_ROW end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT roman_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT ( italic_i , italic_j ) end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) .

As suggested in Scheuerer and Hamill (2015) for the VS (14), the weights wi,jsubscript𝑤𝑖𝑗w_{i,j}italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT can be chosen appropriately to optimize the signal-to-noise ratio. For example, in a spatial setting where the dependence between locations is believed to decrease with the distance separating them, the weights wi,jsubscript𝑤𝑖𝑗w_{i,j}italic_w start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT can be chosen to be proportional to the inverse of the distance. This bivariate setting is less used in the literature, we present two articles using or mentioning scoring rules within this scope. In a general multivariate setting, Ziel and Berk (2019) suggests the use of a marginal-copula scoring rule where the copula score is the bivariate copula energy score (i.e., the aggregation of the energy scores across all the regularized pairs). To focus on the verification of the temporal dependence of spatio-temporal forecasts, Ben Bouallègue et al. (2024b) uses the bivariate energy score over consecutive lead times.

In a more general setup, we consider projection on k𝑘kitalic_k-dimensional marginals. In order to reduce the number of transformation-based scores to aggregate, it is standard to focus on localized marginals (e.g., belonging to patches of a given spatial size). Denote 𝒫={Pi}1im𝒫subscriptsubscript𝑃𝑖1𝑖𝑚\mathcal{P}=\{P_{i}\}_{1\leq i\leq m}caligraphic_P = { italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT a set of valid patches (for some criterion or of a given size) and 𝒮𝒫subscript𝒮𝒫\mathcal{S}_{\mathcal{P}}caligraphic_S start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT the set of transformation-based scores associated with the projections on the patches 𝒫𝒫\mathcal{P}caligraphic_P. Given a multivariate scoring rule SS\mathrm{S}roman_S proper relative to 𝒫(k)𝒫superscript𝑘\mathcal{P}(\mathbb{R}^{k})caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ), we can construct the following aggregated score :

S𝒮𝒫,𝒘(F,𝒚)=P𝒫wPSP(F,𝒚).subscriptSsubscript𝒮𝒫𝒘𝐹𝒚subscript𝑃𝒫subscript𝑤𝑃subscriptS𝑃𝐹𝒚\mathrm{S}_{\mathcal{S}_{\mathcal{P}},\bm{w}}(F,\bm{y})=\sum_{P\in\mathcal{P}}% w_{P}\mathrm{S}_{P}(F,\bm{y}).roman_S start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT , bold_italic_w end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_S start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) .

This construction can be used to create a scoring rule only considering the dependence of localized components, given that the patches are defined in that sense. The use of patches has similar benefits as the weighting of pairs given a belief on their correlations: obtain a better signal-to-noise ratio and improve the discrimination of the resulting scoring rule. For example, Pacchiardi et al. (2024) introduced patched energy scores as scoring rules to minimize in order to train a generative neural network. The patched energy scores are defined for S=ESSES\mathrm{S}=\mathrm{ES}roman_S = roman_ES and square patches spaced by a given stride. Even though spatial patches may be more intuitive, it is possible to use temporal or spatio-temporal patches. Patch-based scoring rules appear as a natural member of the neighborhood-based methods of the spatial verification classification mentioned in Section 2.4. Given that the patches are correctly chosen (e.g., of a size appropriate to the problem at hand), patch-based scoring rules are not subject to the double-penalty effect.

As noticeable by the low number of examples available in the literature, aggregation (and plain use) of scoring rules based on projection in dimension k2𝑘2k\geq 2italic_k ≥ 2 is not standard practice, probably because such projections may lack interpretability. Instead, to assess the multivariate aspects of a forecast, scoring rules relying on summary statistics are often favored.

4.2 Summary statistics

Summary statistics are a central tool of statisticians’ toolboxes as they provide interpretable and understandable quantities that can be linked to the behavior of the phenomenon studied. Moreover, their interpretability can be enhanced by the forecaster’s experience and this can be leveraged when constructing scoring rules based on them. Summary statistics are commonly present during the verification procedure and this can be extended by the use of new scoring rules derived from any summary statistic of interest. For example, numerous summary statistics can come in handy when studying precipitations over a region covered by gridded observation and forecasts. Firstly, it is common practice to focus on binary events such as the exceedance of a threshold (e.g., the presence or absence of precipitation). This can be studied by using the BS (5) on all 1111-dimensional marginals as mentioned in the previous subsection but also in a multivariate manner through the fraction of threshold exceedances (FTE) over patches as presented further. Regarding precipitations, it is standard to be interested in the prediction of total precipitation over a region or a time period. This transformation of the field can be leveraged to construct a scoring rule. Finally, it is important to verify that the spatial structure of the forecast matches the spatial structure of observations. The spatial structure can be (partially) summarized by the variogram or by wavelet transformations. The predictive performance for the spatial structure can be assessed by their associated scoring rules: the VS of order p𝑝pitalic_p (14) and the wavelet-based score (Buschow et al., 2019). Other summary statistics can be of interest to the phenomenon studied, Heinrich-Mertsching et al. (2021) present summary statistics specific to point processes focusing on clustering and intensity.

The most well-known summary statistic is certainly the mean. In spatial statistics, it can be used to avoid double penalization when we are less interested in the exact location of the forecast but rather in a regional prediction. The transformation associated with the mean is

meanP(𝑿)=1|P|iPXi,subscriptmean𝑃𝑿1𝑃subscript𝑖𝑃subscript𝑋𝑖\mathrm{mean}_{P}(\bm{X})=\frac{1}{|P|}\sum_{i\in P}X_{i},roman_mean start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( bold_italic_X ) = divide start_ARG 1 end_ARG start_ARG | italic_P | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (17)

where P𝑃Pitalic_P denotes a patch and |P|𝑃|P|| italic_P | its dimension. Proposition 1 ensures that this transformation can be used to construct proper scoring rules. The scoring rule involved in the construction has to be univariate, however, the choice depends on the general properties preferred. For example, the SE would focus on the mean of the transformed quantity, whereas the AE would target its median. It is worth noting that the total can be derived by the mean transformation by removing the prefactor

totalP(𝑿)=iPXi.subscripttotal𝑃𝑿subscript𝑖𝑃subscript𝑋𝑖\mathrm{total}_{P}(\bm{X})=\sum_{i\in P}X_{i}.roman_total start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( bold_italic_X ) = ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

In the case of precipitation, the total is more used than the mean since the total precipitation over a river basin can be decisive in evaluating flood risk. For example, one could construct an adapted version of the amplitude component of the SAL method (Wernli et al., 2008; Radanovics et al., 2018) using the SE if the mean total precipitation is of interest. Gneiting (2011) presents other links between the quantity of interest and the scoring rule associated. Similarly, the transformations associated with the minimum and the maximum over a patch P𝑃Pitalic_P can be obtained :

minP(𝑿)subscriptmin𝑃𝑿\displaystyle\mathrm{min}_{P}(\bm{X})roman_min start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( bold_italic_X ) =miniP(Xi);absentsubscript𝑖𝑃subscript𝑋𝑖\displaystyle=\min_{i\in P}(X_{i});= roman_min start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ;
maxP(𝑿)subscriptmax𝑃𝑿\displaystyle\mathrm{max}_{P}(\bm{X})roman_max start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( bold_italic_X ) =maxiP(Xi).absentsubscript𝑖𝑃subscript𝑋𝑖\displaystyle=\max_{i\in P}(X_{i}).= roman_max start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

The maximum or minimum can be useful when considering extreme events. It can help understand if the severity of an event is well-captured. For example, as minimum and maximum temperatures affect crop yields (see, e.g., Agnolucci et al. 2020), it can be of particular interest that a weather forecast within an agricultural model correctly predicts the minimum and maximum temperatures. After studying the mean, it is natural to think of the moments of higher order. We can define the transformation associated with the variance over a patch P𝑃Pitalic_P as

VarP(𝑿)=1|P|iP(XimeanP(𝑿))2.subscriptVar𝑃𝑿1𝑃subscript𝑖𝑃superscriptsubscript𝑋𝑖subscriptmean𝑃𝑿2\mathrm{Var}_{P}(\bm{X})=\frac{1}{|P|}\sum_{i\in P}(X_{i}-\mathrm{mean}_{P}(% \bm{X}))^{2}.roman_Var start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( bold_italic_X ) = divide start_ARG 1 end_ARG start_ARG | italic_P | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_mean start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( bold_italic_X ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The variance transformation can provide information on the fluctuations over a patch and be used to assess the quality of the local variability of the forecast. In a more general setup, it can be of interest to use a transformation related to the moment of order n𝑛nitalic_n and the transformation associated follows naturally

Mn,P(𝑿)=1|P|iPXin.subscriptM𝑛𝑃𝑿1𝑃subscript𝑖𝑃superscriptsubscript𝑋𝑖𝑛\mathrm{M}_{n,P}(\bm{X})=\frac{1}{|P|}\sum_{i\in P}X_{i}^{n}.roman_M start_POSTSUBSCRIPT italic_n , italic_P end_POSTSUBSCRIPT ( bold_italic_X ) = divide start_ARG 1 end_ARG start_ARG | italic_P | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT .

More application-oriented transformations are the central or standardized moments (e.g., skewness or kurtosis). Their transformations can be obtained directly from estimators. As underlined in Heinrich-Mertsching et al. (2021), since Proposition 1 applies to any transformation, there is no condition on having an unbiased estimator to obtain proper scoring rules.

Threshold exceedance plays an important role in decision making such as weather alerts. For example, MeteoSwiss’ heat warning levels are based on the exceedance of daily mean temperature over three consecutive days (Allen et al., 2023a). They can be defined by the simultaneous exceedance of a certain threshold and the fraction of threshold exceedance (FTE) is the summary statistic associated.

FTEP,t(𝑿)=1|P|iP𝟙{Xit}.subscriptFTE𝑃𝑡𝑿1𝑃subscript𝑖𝑃subscript1subscript𝑋𝑖𝑡\mathrm{FTE}_{P,t}(\bm{X})=\frac{1}{|P|}\sum_{i\in P}\mathds{1}_{\{X_{i}\geq t% \}}.roman_FTE start_POSTSUBSCRIPT italic_P , italic_t end_POSTSUBSCRIPT ( bold_italic_X ) = divide start_ARG 1 end_ARG start_ARG | italic_P | end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_P end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_t } end_POSTSUBSCRIPT . (18)

FTEs can be used as an extension of univariate threshold exceedances and it prevents the double-penalty effect. FTEs may be used to target compound events (e.g., the simultaneous exceedances of a threshold at multiple locations of interest). Roberts and Lean (2008) used an FTE-based SE over different sizes of neighborhoods (patches) to verify at which scale forecasts become skillful. To assess extreme precipitation forecasts, Rivoire et al. (2023) introduces scores for extremes with temporal and spatial aggregation separately. Extreme events are defined as values higher than the seasonal 95%percent9595\%95 % quantile. In the subseasonal-to-seasonal range, the temporal patches are 7-day windows centered on the extreme event and the spatial patches are square boxes of 150 km ×\times× 150 km centered on the extreme event. The final scores are transformed BS (5) with a threshold of one event predicted across the patch.

Correctly predicting the structure dependence is crucial in multivariate forecasting. Variograms are summary statistics representing the dependence structure. The variogram of order p𝑝pitalic_p of the pair (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) corresponds to the following transformation :

γijp(𝑿)=|XiXj|p.superscriptsubscript𝛾𝑖𝑗𝑝𝑿superscriptsubscript𝑋𝑖subscript𝑋𝑗𝑝\gamma_{ij}^{p}(\bm{X})=|X_{i}-X_{j}|^{p}.italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( bold_italic_X ) = | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT .

As mentioned in the Introduction, using both the transformation and aggregation principles, we can recover the VS of order p𝑝pitalic_p (14) introduced in Scheuerer and Hamill (2015) :

VSp(F,𝒚)=i,j=1dwijSEγijp(F,𝒚)=i,j=1dwij(𝔼F[|XiXj|p]|yiyj|)2.subscriptVS𝑝𝐹𝒚superscriptsubscript𝑖𝑗1𝑑subscript𝑤𝑖𝑗subscriptSEsuperscriptsubscript𝛾𝑖𝑗𝑝𝐹𝒚superscriptsubscript𝑖𝑗1𝑑subscript𝑤𝑖𝑗superscriptsubscript𝔼𝐹delimited-[]superscriptsubscript𝑋𝑖subscript𝑋𝑗𝑝subscript𝑦𝑖subscript𝑦𝑗2\mathrm{VS}_{p}(F,\bm{y})=\sum_{i,j=1}^{d}w_{ij}\mathrm{SE}_{\gamma_{ij}^{p}}(% F,\bm{y})=\sum_{i,j=1}^{d}w_{ij}\left(\mathbb{E}_{F}[|X_{i}-X_{j}|^{p}]-|y_{i}% -y_{j}|\right)^{2}.roman_VS start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_SE start_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] - | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Along with the well-known VS of order p𝑝pitalic_p, Scheuerer and Hamill (2015) introduced alternatives where the scoring rule applied on the transformation is the CRPS (6) or the AE (3) instead of the SE (2). As mentioned previously, under the intrinsic hypothesis of Matheron (1963) (i.e., pairwise differences only depend on the distance between locations), the weights can be selected to obtain an optimal signal-to-noise ratio. Moreover, the weights could be selected to investigate a specific scale by giving a non-zero weight to pairs separated by a given distance.

In the case of spatial forecasts over a grid of size d×d𝑑𝑑d\times ditalic_d × italic_d, a spatial version of the variogram transformation is available :

γ𝒊,𝒋(𝑿)=|X𝒊X𝒋|p,subscript𝛾𝒊𝒋𝑿superscriptsubscript𝑋𝒊subscript𝑋𝒋𝑝\gamma_{\bm{i},\bm{j}}(\bm{X})=|X_{\bm{i}}-X_{\bm{j}}|^{p},italic_γ start_POSTSUBSCRIPT bold_italic_i , bold_italic_j end_POSTSUBSCRIPT ( bold_italic_X ) = | italic_X start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT bold_italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ,

where 𝒊,𝒋𝒟={1,,d}2𝒊𝒋𝒟superscript1𝑑2\bm{i},\bm{j}\in\mathcal{D}=\{1,\dots,d\}^{2}bold_italic_i , bold_italic_j ∈ caligraphic_D = { 1 , … , italic_d } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the coordinates of grid points. Under the intrinsic hypothesis of Matheron (1963), the variogram between grid points separated by the vector 𝒉𝒉\bm{h}bold_italic_h can be estimated by :

γ𝑿(𝒉)=12|𝒟(𝒉)|𝒊𝒟(𝒉)γ𝒊,𝒊+𝒉(𝑿),subscript𝛾𝑿𝒉12𝒟𝒉subscript𝒊𝒟𝒉subscript𝛾𝒊𝒊𝒉𝑿\gamma_{\bm{X}}(\bm{h})=\frac{1}{2|\mathcal{D}(\bm{h})|}\sum_{\bm{i}\in% \mathcal{D}(\bm{h})}\gamma_{\bm{i},\bm{i}+\bm{h}}(\bm{X}),italic_γ start_POSTSUBSCRIPT bold_italic_X end_POSTSUBSCRIPT ( bold_italic_h ) = divide start_ARG 1 end_ARG start_ARG 2 | caligraphic_D ( bold_italic_h ) | end_ARG ∑ start_POSTSUBSCRIPT bold_italic_i ∈ caligraphic_D ( bold_italic_h ) end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT bold_italic_i , bold_italic_i + bold_italic_h end_POSTSUBSCRIPT ( bold_italic_X ) ,

where 𝒟(𝒉)={𝒊𝒟:𝒊+𝒉𝒟}𝒟𝒉conditional-set𝒊𝒟𝒊𝒉𝒟\mathcal{D}(\bm{h})=\{\bm{i}\in\mathcal{D}:\bm{i}+\bm{h}\in\mathcal{D}\}caligraphic_D ( bold_italic_h ) = { bold_italic_i ∈ caligraphic_D : bold_italic_i + bold_italic_h ∈ caligraphic_D }. This directed variogram can be used to target the verification of the anisotropy of the dependence structure. The isotropy transformation associated to the distance hhitalic_h can be defined by

Tiso,h(𝑿)=(γX((h,0))γX((0,h)))22γX((h,0))2|𝒟((h,0))|+2γX((0,h))2|𝒟((0,h))|.subscript𝑇iso𝑿continued-fractionsuperscriptsubscript𝛾𝑋0subscript𝛾𝑋02continued-fraction2subscript𝛾𝑋superscript02𝒟0continued-fraction2subscript𝛾𝑋superscript02𝒟0T_{\mathrm{iso},h}(\bm{X})=-\cfrac{\big{(}\gamma_{X}((h,0))-\gamma_{X}((0,h))% \big{)}^{2}}{\cfrac{2\gamma_{X}((h,0))^{2}}{|\mathcal{D}((h,0))|}+\cfrac{2% \gamma_{X}((0,h))^{2}}{|\mathcal{D}((0,h))|}}.italic_T start_POSTSUBSCRIPT roman_iso , italic_h end_POSTSUBSCRIPT ( bold_italic_X ) = - continued-fraction start_ARG ( italic_γ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ( italic_h , 0 ) ) - italic_γ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ( 0 , italic_h ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG continued-fraction start_ARG 2 italic_γ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ( italic_h , 0 ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_D ( ( italic_h , 0 ) ) | end_ARG + continued-fraction start_ARG 2 italic_γ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ( 0 , italic_h ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_D ( ( 0 , italic_h ) ) | end_ARG end_ARG . (19)

This transformation is the isotropy pre-rank function proposed in Allen et al. (2024). The isotropy transformation considers the orthogonal directions formed by the abscissa and ordinate axes and evaluates how the variogram changes between these directions. The transformation leads to negative or zero quantities with values close to zero characterizing isotropy and negative values corresponding to the anisotropy of the variograms in the directions and at the scale involved.

4.3 Other transformations

Transformations other than projections or summary statistics can be used to target forecast characteristics. For example, a transformation in the form of a change of coordinates or a change of scale (e.g., a logarithmic scale) can be used to obtain proper scoring rules. We highlight two families of scoring rules that can be seen as transformation-based scoring rules: wavelet-based scoring rules and threshold-weighted scoring rules.

Generally speaking, wavelet-based scoring rules are built thanks to a projection of forecast and observation fields onto a wavelet basis. Based on the wavelet coefficients, dimension reduction might be performed to target specific characteristics such as the dependence structure or the location. The resulting coefficients of the forecast fields are compared to the coefficients of the observations fields using scoring rules (e.g., squared error (SE) or energy score (ES)). Wavelet transformations are (complex) transformations, and thus, the scoring rules associated fall within the scope of Proposition 1. In particular, Buschow et al. (2019) used a dimension reduction procedure resulting in the obtention of a mean and a scale spectra and used scoring rules to compare forecasts and observation spectra. For example, the ES of the mean spectrum is used and shows good discrimination ability when the scale structure is misspecified.

Note that Buschow et al. (2019) proposed two other wavelet-based scoring rules: one based on the earth mover’s distance (EMD) of the scale histograms and one based on the distance in the scale histograms’ center of mass. The EMD-based scoring rules are not proper since the EMD is not a proper scoring rule (Thorarinsdottir et al., 2013) and the so-called distance between centers of mass is not a distance but rather a difference of position leading to an improper scoring rule. However, the ES-based scoring rules are proper and could be derived from scale histograms. Despite their apparent complexity, wavelet transformations allow to target interpretable characteristics such as the location (Buschow, 2022), the scale structure (Buschow et al., 2019; Buschow and Friederichs, 2020) or the anisotropy (Buschow and Friederichs, 2021). The transformations proposed for the deterministic forecasts setting in most of these articles could be used as foundations for future work willing to propose wavelet-based proper scoring rules targeting the location, the scale structure or the anisotropy.

As showcased in Heinrich-Mertsching et al. (2021) for a specific example and hinted in Allen et al. (2024), transformations can also be used to emphasize certain outputs. Threshold weighting is one of the three main types of weighting conserving the propriety of scoring rules. Its name come from the fact that it corresponds to a weighting over different thresholds in the case of CRPS (7) (Gneiting, 2011). Recall that given a conditionally negative definite kernel ρ𝜌\rhoitalic_ρ, the kernel scoring associated SρsubscriptS𝜌\mathrm{S}_{\rho}roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT (15) is proper relative to 𝒫ρsubscript𝒫𝜌\mathcal{P}_{\rho}caligraphic_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT. Many popular scoring rules are kernel scores such as the BS (5), the CRPS (6), the ES (13) and the VS (14). By definition (Allen et al., 2023b, Definition 4), threshold-weighted kernel scores are constructed as

twSρ(F,𝒚;v)subscripttwS𝜌𝐹𝒚𝑣\displaystyle\mathrm{tw}\mathrm{S}_{\rho}(F,\bm{y};v)roman_twS start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F , bold_italic_y ; italic_v ) =𝔼F[ρ(v(𝑿),v(𝒚))]12𝔼F[ρ(v(𝑿),v(𝑿))]12ρ(v(𝒚),v(𝒚));absentsubscript𝔼𝐹delimited-[]𝜌𝑣𝑿𝑣𝒚12subscript𝔼𝐹delimited-[]𝜌𝑣𝑿𝑣superscript𝑿12𝜌𝑣𝒚𝑣𝒚\displaystyle=\mathbb{E}_{F}[\rho(v(\bm{X}),v(\bm{y}))]-\frac{1}{2}\mathbb{E}_% {F}[\rho(v(\bm{X}),v(\bm{X}^{\prime}))]-\frac{1}{2}\rho(v(\bm{y}),v(\bm{y}));= blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_ρ ( italic_v ( bold_italic_X ) , italic_v ( bold_italic_y ) ) ] - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_ρ ( italic_v ( bold_italic_X ) , italic_v ( bold_italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ] - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ρ ( italic_v ( bold_italic_y ) , italic_v ( bold_italic_y ) ) ;
=Sρ(v(F),v(𝒚)),absentsubscriptS𝜌𝑣𝐹𝑣𝒚\displaystyle=\mathrm{S}_{\rho}(v(F),v(\bm{y})),= roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_v ( italic_F ) , italic_v ( bold_italic_y ) ) ,

where v𝑣vitalic_v is the chaining function capturing how the emphasis is put on certain outputs. With this explicit definition, it is obvious that threshold-weighted kernel scores are covered by the framework of Proposition 1. It can be noted that Proposition 4 in Allen et al. (2023b) states that strict propriety of the kernel scoring rule is preserved by the chaining function v𝑣vitalic_v if and only if v𝑣vitalic_v is injective. Weighted scoring rules allow to emphasize particular outcomes: when studying extreme events, it is often of particular interest to focus on values larger than a given threshold t𝑡titalic_t and this can be achieved using the chaining rule v(x)=𝟙xt𝑣𝑥subscript1𝑥𝑡v(x)=\mathds{1}_{x\geq t}italic_v ( italic_x ) = blackboard_1 start_POSTSUBSCRIPT italic_x ≥ italic_t end_POSTSUBSCRIPT. Threshold-weighted scoring rules have been used in verification procedures in the literature; we illustrate its use through three different studies. Lerch and Thorarinsdottir (2013) aggregated across stations twCRPS to compare the upper tail performance of different daily maximum wind speed forecasts. Chapman et al. (2022) aggregated the threshold-weighted CRPS across locations to study the improvement of statistical postprocessing techniques, the importance of predictors and the influence of the size of the training set on the performance. Allen et al. (2023a) used threshold-weighted versions of the CRPS, the ES, and the VS to compare the predictive performance of forecasts regarding heatwave severity; the scoring rules were aggregated across stations. Readers may refer to Allen et al. (2023a) and Allen et al. (2023b) for insightful reviews of weighted scoring rules in both univariate and multivariate settings.

5 Simulation study

This section provides simulated examples to showcase the different uses of the framework introduced in Section 3 to construct interpretable proper scoring rules for multivariate forecasts. Four examples are developed. Firstly, a setup where the emphasis is put on 1111-marginal verification is proposed. This setup serves as a means of recalling and showing the limitations of strictly proper scoring rules and the benefits of interpretable scoring rules in a concrete setting. Secondly, a standard multivariate setup is studied where popular multivariate scoring rules (i.e., VS and ES) are compared to a multivariate scoring rule aggregated over patches and an aggregation-and-transformation-based scoring rule in their discrimination ability regarding the dependence structure. Thirdly, a setup introducing anisotropy in both observations and forecasts is introduced. The anisotropic score is constructed based on the transformation principle with the goal of discriminating differences of anisotropy in the dependence structure between forecast and observations. Fourthly, we propose a setup to test the sensitivity of scoring rules to the double-penalty effect and we introduce scoring rules that can be built to be resilient to some manifestation of the double-penalty effect.

In these four numerical experiments, the spatial field is observed and predicted on a regular 20×20202020\times 2020 × 20 grid 𝒟={1,,20}×{1,,20}𝒟120120\mathcal{D}=\{1,\ldots,20\}\times\{1,\ldots,20\}caligraphic_D = { 1 , … , 20 } × { 1 , … , 20 }. Observations are realizations of a Gaussian random field (G(s))s𝒟subscript𝐺𝑠𝑠𝒟(G(s))_{s\in\mathcal{D}}( italic_G ( italic_s ) ) start_POSTSUBSCRIPT italic_s ∈ caligraphic_D end_POSTSUBSCRIPT with zero mean and power-exponential covariance defined as

cov(G(s),G(s))=σ02exp((ssλ0)β0),s,s𝒟.formulae-sequencecov𝐺𝑠𝐺superscript𝑠superscriptsubscript𝜎02superscriptdelimited-∥∥𝑠superscript𝑠subscript𝜆0subscript𝛽0𝑠superscript𝑠𝒟\mathrm{cov}(G(s),G(s^{\prime}))={\sigma_{0}}^{2}\exp\left(-\left(\frac{\lVert s% -s^{\prime}\rVert}{\lambda_{0}}\right)^{\beta_{0}}\right),\quad s,s^{\prime}% \in\mathcal{D}.roman_cov ( italic_G ( italic_s ) , italic_G ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_exp ( - ( divide start_ARG ∥ italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) , italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_D . (20)

The parameters are taken equal to σ0=1subscript𝜎01\sigma_{0}=1italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, λ0=3subscript𝜆03\lambda_{0}=3italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 3 and β0=1subscript𝛽01\beta_{0}=1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1.

In each numerical experiment, we compare a few predictive distributions, including the distribution generating observations and other ones deviating from the generative distributions in a specific way. These different predictive distributions are evaluated with different scoring rules and the aim is to illustrate the discriminatory ability of the different scoring rules.

The simulation study uses 500 observations of the random field (G(s))s𝒟subscript𝐺𝑠𝑠𝒟(G(s))_{s\in\mathcal{D}}( italic_G ( italic_s ) ) start_POSTSUBSCRIPT italic_s ∈ caligraphic_D end_POSTSUBSCRIPT. The scoring rules are computed using exact formulas when possible (see Appendix E), and, when exact formulas are not available, they are computed based on a sample of size 100 (i.e., ensemble forecasts) of the probabilistic forecast. Estimated expectations over the 500 observations are computed and the experiment is repeated 10 times. The corresponding results are represented by boxplots. The units of the scoring rules are rescaled by the average expected score of the true distribution (i.e., the ideal forecast). The statistical significativity of the ranking between forecasts is tested using a Diebold-Mariano test (Diebold and Mariano, 1995). When deemed necessary, statistical significativity is mentioned for a confidence level of 95%.

The code used for the different numerical experiments is publicly available111https://github.com/pic-romain/aggregation-transformation.

5.1 Marginals

Refer to caption
(a) Aggregated CRPS
Refer to caption
(b) Aggregated QS
Refer to caption
(c) Aggregated BS
Refer to caption
(d) Aggregated DSS and SE
Figure 1: Expectation of aggregated univariate scoring rules: (a) the CRPS, (b) the quantile score, (c) the Brier score, and (d) the squared error and the Dawid-Sebastiani score, for the ideal forecast (light violet), a biased forecast (orange), an under-dispersed forecast (lighter blue), an over-dispersed forecast (darker blue) and a local-scale Student forecast (green). More details are available in the main text.

This first numerical experiment focuses on the prediction of the 1-dimensional marginal distributions and the aggregation of univariate scoring rules. For simplicity, we consider only stationary random fields so that the 1-marginal distribution is the same at all grid points. Although similar conclusions could be drawn from an univariate framework (i.e., with independent 1-dimensional rather than spatial observations), this example aims to clarify the notion of interpretability and presents notions that will be reused in the following examples. The verification of marginals, along with other simple quantities, is usually one of the first steps of any multivariate forecast verification process.

Observations follow the model of (20) and multiple competing forecasts are considered:

  • -

    the ideal forecast is the Gaussian distribution generating observations and is used as a reference;

  • -

    the biased forecast is a Gaussian predictive distribution with the same covariance structure as the observation but a different mean 𝔼[Fbias(s)]=c=0.255𝔼delimited-[]subscript𝐹bias𝑠𝑐0.255\mathbb{E}[F_{\mathrm{bias}}(s)]=c=0.255blackboard_E [ italic_F start_POSTSUBSCRIPT roman_bias end_POSTSUBSCRIPT ( italic_s ) ] = italic_c = 0.255;

  • -

    the overdispersed forecast and the underdispersed forecast are Gaussian predictive distributions from the same model as the observations except for an overestimation (σ=1.4𝜎1.4\sigma=1.4italic_σ = 1.4) and an underestimation (σ=2/3𝜎23\sigma=2/3italic_σ = 2 / 3) of the variance respectively;

  • -

    the location-scale Student forecast is used where the marginals follow location-scale Student-t𝑡titalic_t distributions with parameters μ=0𝜇0\mu=0italic_μ = 0, df=5𝑑𝑓5df=5italic_d italic_f = 5, and τ𝜏\tauitalic_τ is such that the standard deviation is 0.7450.7450.7450.745 and the covariance structure the same as in (20).

In order to compare the predictive performance of forecasts, we use scoring rules constructed by aggregating univariate scoring rules. Here, the aggregation is done with uniform weights since there is no prior knowledge on the locations. The univariate scoring rules considered are the continuous ranked probability score (CRPS), the Brier score (BS), the quantile score (QS), the squared error (SE) and the Dawid-Sebastiani score (DSS). Figure 1(a) compares five different forecasts based on their expected CRPS. It can be seen that all forecasts except for the ideal one have similar expected values and no sub-efficient forecast is significantly better than the others. In order to gain more insight into the predictive performance of the forecast, it is necessary to use other scoring rules. In practice, the distribution is unknown; thus, it is impossible to know if a forecast is efficient; it is only possible to provide a ranking linked to the closeness of the forecast with respect to the observations. The definition of closeness depends on the scoring rule used: for example, the CRPS defines closeness in terms of the integrated quadratic distance between the two cumulative distribution functions (see, e.g., Thorarinsdottir and Schuhen 2018).

If the quantity of interest is the value of a quantile of a certain level α𝛼\alphaitalic_α, the aggregated QS is an appropriate scoring rule. Figure 1(b) shows the expected aggregated QS for three different levels α𝛼\alphaitalic_α : α=0.5𝛼0.5\alpha=0.5italic_α = 0.5, α=.75𝛼.75\alpha=.75italic_α = .75 and α=0.95𝛼0.95\alpha=0.95italic_α = 0.95. α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 is associated with the prediction of the median and, since all the forecasts are symmetric and only the biased forecast is not centered on zero, the other forecasts are equally the best and efficient forecasts. If the third quartile is of interest (α=0.75𝛼0.75\alpha=0.75italic_α = 0.75), the location-scale Student forecast appears as significantly the best (among the non-ideal). For the higher level of α=0.95𝛼0.95\alpha=0.95italic_α = 0.95, the biased forecast is significantly the best since its bias error seems to be compensated by its correct prediction of the variance. Depending on the level of interest, the best forecast varies; the only forecast that would appear to be the best regardless of the level α𝛼\alphaitalic_α is the ideal forecast, as implied by (8).

If a quantity of interest is the exceedance of a threshold t𝑡titalic_t at each location, then the aggregated BS is an interesting scoring rule. Figure 1(c) shows the expectation of aggregated BS for the different forecasts and for two different thresholds (t=0.5𝑡0.5t=0.5italic_t = 0.5 and t=1𝑡1t=1italic_t = 1). Among the non-ideal forecasts, there seems to be a clearer ranking than for the CRPS. The overdispersed forecast is significantly the best regarding the prediction of the exceedance of the threshold t=0.5𝑡0.5t=0.5italic_t = 0.5 and the biased forecast is significantly the best regarding the exceedance of t=1𝑡1t=1italic_t = 1. As for the aggregated quantile score, the best forecast depends on the threshold t𝑡titalic_t considered and the only forecast that is the best regardless of the threshold t𝑡titalic_t is the ideal one (see Eq. (7)).

If the moments are of interest, the aggregated SE discriminates the first moment (i.e., the mean) and the aggregated DSS discriminates the first two moments (i.e., the mean and the variance). Figure 1(d) presents the expected values of these scoring rules for the different forecasts considered in this example. The aggregated SEs of all forecasts, except the biased forecast, are equal since they have the same (correct) marginal means. The aggregated DSS presents the biased forecast as significantly the best one (among non-ideal). This is caused by the combined discrimination of the first two moments of the Dawid-Sebastiani score (see Eq. (9) and Appendix A).

5.2 Multivariate scores over patches

This second numerical experiment focuses on the prediction of the dependence structure. Observations are sampled from the model of Eq. (20) and we compare forecasts that differ only in their dependence structure through misspecification of the range parameter λ𝜆\lambdaitalic_λ and the smoothness parameter β𝛽\betaitalic_β:

  • -

    the ideal forecast is the Gaussian distribution generating the observations;

  • -

    the small-range forecast and the large-range forecast are Gaussian predictive distributions from the same model (20) as the observations except for an underestimation (λ=1𝜆1\lambda=1italic_λ = 1) and an overestimation (λ=5𝜆5\lambda=5italic_λ = 5), respectively, of the range;

  • -

    the under-smooth forecast and the over-smooth forecast are Gaussian predictive distributions from the same model as the observations except for an underestimation (β=0.5𝛽0.5\beta=0.5italic_β = 0.5) and an overestimation (β=2𝛽2\beta=2italic_β = 2), respectively, of the smoothness.

Since the forecasts differ only in their dependence structure, scoring rules acting on the 1-dimensional marginals would not be able to distinguish the ideal forecast from the others. We use the variogram score (VS) as a reference since it is known to discriminate misspecification of the dependence structure. We introduce the patched energy score, which results from the aggregation of the ES (with α=1)\alpha=1)italic_α = 1 ) over patches, defined as

ES𝒫,𝒘𝒫(F,𝒚)=P𝒫wPES1(FP,𝒚P),subscriptES𝒫subscript𝒘𝒫𝐹𝒚subscript𝑃𝒫subscript𝑤𝑃subscriptES1subscript𝐹𝑃subscript𝒚𝑃\mathrm{ES}_{\mathcal{P},\bm{w}_{\mathcal{P}}}(F,\bm{y})=\sum_{P\in\mathcal{P}% }w_{P}\mathrm{ES}_{1}(F_{P},\bm{y}_{P}),roman_ES start_POSTSUBSCRIPT caligraphic_P , bold_italic_w start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_ES start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ) ,

where 𝒫𝒫\mathcal{P}caligraphic_P is an ensemble of spatial patches, wPsubscript𝑤𝑃w_{P}italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is the weight associated with a patch P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P and FPsubscript𝐹𝑃F_{P}italic_F start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is the marginal of F𝐹Fitalic_F over the patch P𝑃Pitalic_P. In order to make the scoring more interpretable, only square patches of a given size s𝑠sitalic_s are considered and the weights wPsubscript𝑤𝑃w_{P}italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT are uniform (wP=1/|𝒫|subscript𝑤𝑃1𝒫w_{P}=1/|\mathcal{P}|italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT = 1 / | caligraphic_P |). Moreover, we consider the aggregated CRPS and the ES since they are limiting cases of the patched ES for 1×1111\times 11 × 1 patches and a single patch over the whole domain 𝒟𝒟\mathcal{D}caligraphic_D, respectively. Additionally, we proposed the p𝑝pitalic_p-variation score (p𝑝pitalic_pVS), which is based on the p𝑝pitalic_p-variation transformation to focus on the discrimination of the regularity of the random fields,

Tpvar,𝒔(𝑿)=|𝑿𝒔+(1,1)𝑿𝒔+(1,0)𝑿𝒔+(0,1)+𝑿𝒔|psubscript𝑇𝑝𝑣𝑎𝑟𝒔𝑿superscriptsubscript𝑿𝒔11subscript𝑿𝒔10subscript𝑿𝒔01subscript𝑿𝒔𝑝T_{p-var,\bm{s}}(\bm{X})=|\bm{X}_{\bm{s}+(1,1)}-\bm{X}_{\bm{s}+(1,0)}-\bm{X}_{% \bm{s}+(0,1)}+\bm{X}_{\bm{s}}|^{p}italic_T start_POSTSUBSCRIPT italic_p - italic_v italic_a italic_r , bold_italic_s end_POSTSUBSCRIPT ( bold_italic_X ) = | bold_italic_X start_POSTSUBSCRIPT bold_italic_s + ( 1 , 1 ) end_POSTSUBSCRIPT - bold_italic_X start_POSTSUBSCRIPT bold_italic_s + ( 1 , 0 ) end_POSTSUBSCRIPT - bold_italic_X start_POSTSUBSCRIPT bold_italic_s + ( 0 , 1 ) end_POSTSUBSCRIPT + bold_italic_X start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT
pVS(F,𝒚)𝑝VS𝐹𝒚\displaystyle p\mathrm{VS}(F,\bm{y})italic_p roman_VS ( italic_F , bold_italic_y ) =𝒔𝒟w𝒔SETpvar,𝒔(F,𝒚);absentsubscript𝒔superscript𝒟subscript𝑤𝒔subscriptSEsubscript𝑇𝑝𝑣𝑎𝑟𝒔𝐹𝒚\displaystyle=\sum_{\bm{s}\in\mathcal{D}^{\ast}}w_{\bm{s}}\mathrm{SE}_{T_{p-% var,\bm{s}}}(F,\bm{y});= ∑ start_POSTSUBSCRIPT bold_italic_s ∈ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT roman_SE start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_p - italic_v italic_a italic_r , bold_italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) ;
=𝒔𝒟w𝒔(𝔼F[Tpvar,𝒔(𝑿)]Tpvar,𝒔(𝒚))2,absentsubscript𝒔superscript𝒟subscript𝑤𝒔superscriptsubscript𝔼𝐹delimited-[]subscript𝑇𝑝𝑣𝑎𝑟𝒔𝑿subscript𝑇𝑝𝑣𝑎𝑟𝒔𝒚2\displaystyle=\sum_{\bm{s}\in\mathcal{D}^{\ast}}w_{\bm{s}}(\mathbb{E}_{F}[T_{p% -var,\bm{s}}(\bm{X})]-T_{p-var,\bm{s}}(\bm{y}))^{2},= ∑ start_POSTSUBSCRIPT bold_italic_s ∈ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_T start_POSTSUBSCRIPT italic_p - italic_v italic_a italic_r , bold_italic_s end_POSTSUBSCRIPT ( bold_italic_X ) ] - italic_T start_POSTSUBSCRIPT italic_p - italic_v italic_a italic_r , bold_italic_s end_POSTSUBSCRIPT ( bold_italic_y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where 𝒟superscript𝒟\mathcal{D}^{\ast}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the domain 𝒟𝒟\mathcal{D}caligraphic_D restricted to grid points such that Tpvar,𝒔subscript𝑇𝑝𝑣𝑎𝑟𝒔T_{p-var,\bm{s}}italic_T start_POSTSUBSCRIPT italic_p - italic_v italic_a italic_r , bold_italic_s end_POSTSUBSCRIPT is defined (i.e., 𝒟={1,,19}×{1,,19}superscript𝒟119119\mathcal{D}^{\ast}=\{1,\ldots,19\}\times\{1,\ldots,19\}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { 1 , … , 19 } × { 1 , … , 19 }). Note that in the literature on fractional random fields, the p𝑝pitalic_p-variation is an important characteristic used to characterize the roughness of a random field and is commonly used for estimation purposes, see Benassi et al. (2004), Basse-O’Connor et al. (2021) and the references therein.

Refer to caption
(a) Variogram score
Refer to caption
(b) p𝑝pitalic_p-Variation score
Refer to caption
(c) Aggregated CRPS, patched ESs and ES
Figure 2: Expectation of scoring rules focused the dependence structure: (a) the variogram score, (b) the p𝑝pitalic_p-variation score and (c) the patched energy score (and its limiting cases: the aggregated CRPS and the energy score), for the ideal forecast (violet), the small-range forecast (lighter blue), the large-range forecast (darker blue), the under-smooth forecast (lighter orange), and the over-smooth forecast (darker orange). More details are available in the main text.

In Figure 2, the ES and the patched ES were computed using samples from the forecasts since closed expressions could not be derived. However, closed formulas for the VS and the p𝑝pitalic_pVS were derived and are available in Appendix E. As already shown in Scheuerer and Hamill (2015), the VS is able to significantly discriminate misspecification of the dependence structure induced by the range parameter λ𝜆\lambdaitalic_λ (see Fig. 2(a)). Smaller orders of p𝑝pitalic_p (such as p=0.5𝑝0.5p=0.5italic_p = 0.5) appear as more informative than higher ones. Moreover, it is able to discriminate misspecification induced by the smoothness parameter β𝛽\betaitalic_β (significantly for all orders p𝑝pitalic_p considered) even if it is less marked than for the misspecification of the range λ𝜆\lambdaitalic_λ.

Figure 2(b) compares the forecasts using the p𝑝pitalic_p-variation score with p{0.5,1,2}𝑝0.512p\in\{0.5,1,2\}italic_p ∈ { 0.5 , 1 , 2 }. Note that the forecasts are provided in the same order as in the other sub-figures. The p𝑝pitalic_pVS is able to (significantly) discriminate all four sub-efficient forecasts from the ideal forecast at all order p𝑝pitalic_p. In the cases considered, the p𝑝pitalic_pVS has a stronger discriminating ability than the VS; in particular, for misspecification of the smoothness parameter β𝛽\betaitalic_β. The overall improvement in the discrimination ability of the p𝑝pitalic_pVS compared to the VS is due to the fact that it only considers local pair interactions between grid points; which in the experimental setup considered greatly improves the signal-to-noise ratio compared to the VS. For example, it would be incapable of differentiating two forecasts that only differ in their longer-range dependence structure, where the VS should discriminate the two forecasts.

Figure 2(c) shows that the patched ESs have a better discrimination ability than the ES. As expected by the clear analogy between the variogram score weights and the selection of valid patches, focusing on smaller patches improves the signal-to-noise ratio. For all patch size s𝑠sitalic_s considered, the patched ES significantly discriminates the ideal forecast from the others. Whereas the ES does not significantly discriminate the misspecification of smoothness of the under-smooth and over-smooth forecasts. Nonetheless, the patched ES remains less sensitive than the VS to misspecifications in the dependence structure through the range parameter λ𝜆\lambdaitalic_λ or the smoothness parameter β𝛽\betaitalic_β.

The VS relies on the aggregation and transformation principles and is able to discriminate the dependence structure. Similarly, the p𝑝pitalic_pVS is able to discriminate misspecifications of the dependence structure. Being based on more local transformations (i.e., p𝑝pitalic_p-variation transformation instead of variogram transformation), it has a greater discrimination ability than the VS in this experimental setup. In addition to this known application of the aggregation and transformation principles, it has been shown that multivariate transformations can be used to obtain patched scores that, in the case of the ES, lead to an improvement in the signal-to-noise ratio with respect to the original scoring rule.

5.3 Anisotropy

In this example, we focus on the anisotropy of the dependence structure. We introduce geometric anisotropy in observations and forecasts via the covariance function in the following way

cov(G(s),G(s))=exp((ssAλ0))cov𝐺𝑠𝐺superscript𝑠subscriptdelimited-∥∥𝑠superscript𝑠𝐴subscript𝜆0\mathrm{cov}(G(s),G(s^{\prime}))=\exp\left(-\left(\frac{\lVert s-s^{\prime}% \rVert_{A}}{\lambda_{0}}\right)\right)roman_cov ( italic_G ( italic_s ) , italic_G ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) = roman_exp ( - ( divide start_ARG ∥ italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_ARG start_ARG italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) )

with ssA=(ss)TA(ss)subscriptdelimited-∥∥𝑠superscript𝑠𝐴superscript𝑠superscript𝑠𝑇𝐴𝑠superscript𝑠\lVert s-s^{\prime}\rVert_{A}=(s-s^{\prime})^{T}A(s-s^{\prime})∥ italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = ( italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A ( italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). The matrix A𝐴Aitalic_A has the following form :

A=[cosθsinθρsinθρcosθ]𝐴matrix𝜃𝜃𝜌𝜃𝜌𝜃A=\begin{bmatrix}\cos\theta&-\sin\theta\\ \rho\sin\theta&\rho\cos\theta\end{bmatrix}italic_A = [ start_ARG start_ROW start_CELL roman_cos italic_θ end_CELL start_CELL - roman_sin italic_θ end_CELL end_ROW start_ROW start_CELL italic_ρ roman_sin italic_θ end_CELL start_CELL italic_ρ roman_cos italic_θ end_CELL end_ROW end_ARG ]

with θ[π/2,π/2]𝜃𝜋2𝜋2\theta\in[-\pi/2,\pi/2]italic_θ ∈ [ - italic_π / 2 , italic_π / 2 ] the direction of the anisotropy and ρ𝜌\rhoitalic_ρ the ratio between the axes.

The observations follow the anisotropic version of the model in Eq. (20) where the covariance function presents the geometric anisotropy introduced above with λ0=3subscript𝜆03\lambda_{0}=3italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 3 (as previously) and ρ0=2subscript𝜌02\rho_{0}=2italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 2 and θ0=π/4subscript𝜃0𝜋4\theta_{0}=\pi/4italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_π / 4. Multiple forecasts are considered that only differ in their prediction of the anisotropy in the model:

  • -

    the ideal forecast has the same distribution as the observations and is used as a reference;

  • -

    the small-angle forecast and the large-angle forecast have a correct ratio ρ𝜌\rhoitalic_ρ but an under- and over-estimation of the angle, respectively (i.e., θsmall=0subscript𝜃small0\theta_{\mathrm{small}}=0italic_θ start_POSTSUBSCRIPT roman_small end_POSTSUBSCRIPT = 0 and θlarge=π/2subscript𝜃large𝜋2\theta_{\mathrm{large}}=\pi/2italic_θ start_POSTSUBSCRIPT roman_large end_POSTSUBSCRIPT = italic_π / 2);

  • -

    the isotropic forecast and the over-anisotropic forecast have a ratio ρ=1𝜌1\rho=1italic_ρ = 1 and ρ=3𝜌3\rho=3italic_ρ = 3, respectively, but a correct angle θ𝜃\thetaitalic_θ.

Refer to caption
(a) Variogram score
Refer to caption
(b) Anisotropic score for different scales hhitalic_h and aggregated across scales (wh=1/hsubscript𝑤1w_{h}=1/hitalic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 1 / italic_h)
Figure 3: Expectation of interpretable proper scoring rules focused the dependence structure: (a) the variogram score and (b) the anisotropic score, for the ideal forecast (violet), the small-angle forecast (lighter blue), the large-angle forecast (darker blue), the isotropic forecast (lighter orange) and the over-anisotropic forecast (darker orange). More details are available in the main text.

Since these forecasts differ only in the anisotropy of their dependence structure, scoring rules not suited to discriminate the dependence structure would not be able to differentiate them. We compare two proper scoring rules: the variogram score and the anisotropic scoring rule. The variogram score is studied in two different settings: one where the weights are proportional to the inverse of the distance and one where the weights are proportional to the inverse of the anisotropic distance Asubscriptdelimited-∥∥𝐴\lVert\cdot\rVert_{A}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, which is supposed to be more informed since it is the quantity present in the covariance function. The anisotropic score (AS) is a scoring rule based on the framework introduced in Section 3 and, in its general form, it is defined as

AS(F,𝒚)=hwhSTiso,h(F,𝒚)=hwhS(Tiso,h(F),Tiso,h(𝒚)),AS𝐹𝒚subscriptsubscript𝑤subscriptSsubscript𝑇iso𝐹𝒚subscriptsubscript𝑤Ssubscript𝑇iso𝐹subscript𝑇iso𝒚\mathrm{AS}(F,\bm{y})=\sum_{h}w_{h}\mathrm{S}_{T_{\mathrm{iso},h}}(F,\bm{y})=% \sum_{h}w_{h}\mathrm{S}(T_{\mathrm{iso},h}(F),T_{\mathrm{iso},h}(\bm{y})),roman_AS ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT roman_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT roman_iso , italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT roman_S ( italic_T start_POSTSUBSCRIPT roman_iso , italic_h end_POSTSUBSCRIPT ( italic_F ) , italic_T start_POSTSUBSCRIPT roman_iso , italic_h end_POSTSUBSCRIPT ( bold_italic_y ) ) , (21)

where Tiso,hsubscript𝑇isoT_{\mathrm{iso},h}italic_T start_POSTSUBSCRIPT roman_iso , italic_h end_POSTSUBSCRIPT is a transformation summarizing the anisotropy of a field such as the one introduced in (19). Additionally, we use a special case of this scoring rule where we do not aggregate across the scales hhitalic_h and where SS\mathrm{S}roman_S is the squared error :

STiso,h(F,𝒚)=SE(Tiso,h(F),Tiso,h(𝒚))=(𝔼Tiso,h(F)[X]Tiso,h(𝒚))2.subscript𝑆subscript𝑇iso𝐹𝒚SEsubscript𝑇iso𝐹subscript𝑇iso𝒚superscriptsubscript𝔼subscript𝑇iso𝐹delimited-[]𝑋subscript𝑇iso𝒚2S_{T_{\mathrm{iso},h}}(F,\bm{y})=\mathrm{SE}(T_{\mathrm{iso},h}(F),T_{\mathrm{% iso},h}(\bm{y}))=\left(\mathbb{E}_{T_{\mathrm{iso},h}(F)}[X]-T_{\mathrm{iso},h% }(\bm{y})\right)^{2}.italic_S start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT roman_iso , italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = roman_SE ( italic_T start_POSTSUBSCRIPT roman_iso , italic_h end_POSTSUBSCRIPT ( italic_F ) , italic_T start_POSTSUBSCRIPT roman_iso , italic_h end_POSTSUBSCRIPT ( bold_italic_y ) ) = ( blackboard_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT roman_iso , italic_h end_POSTSUBSCRIPT ( italic_F ) end_POSTSUBSCRIPT [ italic_X ] - italic_T start_POSTSUBSCRIPT roman_iso , italic_h end_POSTSUBSCRIPT ( bold_italic_y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (22)

We use a transformation similar to the one of (19) where instead the axes are the first and second bisectors. This leads to the following formula:

Tiso,h(𝑿)=(γX((h,h))γX((h,h)))22γX((h,h))2|𝒟((h,h))|+2γX((h,h))2|𝒟((h,h))|.subscript𝑇iso𝑿continued-fractionsuperscriptsubscript𝛾𝑋subscript𝛾𝑋2continued-fraction2subscript𝛾𝑋superscript2𝒟continued-fraction2subscript𝛾𝑋superscript2𝒟T_{\mathrm{iso},h}(\bm{X})=-\cfrac{\big{(}\gamma_{X}((h,h))-\gamma_{X}((-h,h))% \big{)}^{2}}{\cfrac{2\gamma_{X}((h,h))^{2}}{|\mathcal{D}((h,h))|}+\cfrac{2% \gamma_{X}((-h,h))^{2}}{|\mathcal{D}((-h,h))|}}.italic_T start_POSTSUBSCRIPT roman_iso , italic_h end_POSTSUBSCRIPT ( bold_italic_X ) = - continued-fraction start_ARG ( italic_γ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ( italic_h , italic_h ) ) - italic_γ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ( - italic_h , italic_h ) ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG continued-fraction start_ARG 2 italic_γ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ( italic_h , italic_h ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_D ( ( italic_h , italic_h ) ) | end_ARG + continued-fraction start_ARG 2 italic_γ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ( - italic_h , italic_h ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | caligraphic_D ( ( - italic_h , italic_h ) ) | end_ARG end_ARG .

The choice of this transformation instead of the transformation based on the anisotropy along the abscissa and ordinate is motivated by the fact that it leads to a clearer differentiation of the forecasts (not shown).

Figure 3(a) presents the variogram score of order p=0.5𝑝0.5p=0.5italic_p = 0.5 in its two variants. Both the standard VS and the informed VS are able to significantly discriminate the ideal forecast from the other forecasts but they have a weak sensitivity to misspecification of the geometric anisotropy. Even though the informed VS is supposed to increase the signal-to-noise ratio compared to the standard VS; it is not more sensitive to misspecifications in the experimental setup considered. Other orders of variograms were tested and worsened the discrimination ability of both standard and informed VS (not shown).

Figure 3(b) shows the AS (22) with scales 1h5151\leq h\leq 51 ≤ italic_h ≤ 5 for the different forecasts and the aggregated AS (21), where the scales are aggregated with weights wh=1/hsubscript𝑤1w_{h}=1/hitalic_w start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = 1 / italic_h. The anisotropic scores were computed using samples drawn from the forecasts; this explains why the ideal forecast may appear sub-efficient for some values of hhitalic_h (e.g., h=44h=4italic_h = 4). As aimed by its construction, the AS is able to significantly distinguish the correct anisotropy behavior in the dependence structure for values of hhitalic_h up to h=33h=3italic_h = 3 included. For h=44h=4italic_h = 4, the AS does not significantly discriminate the isotropic forecast and the over-anisotropic forecast from the ideal one. The fact that h=11h=1italic_h = 1 is the most sensitive to misspecifications is probably caused by the fact that the strength of the dependence structure decays with the distance (i.e., with hhitalic_h). This shows that the hyperparameter hhitalic_h plays an important role in having an informative AS (as do the weights and the order p𝑝pitalic_p for the variogram score). For h=22h=2italic_h = 2 in particular, it can be seen that the AS is not sensitive to the misspecification of the ratio ρ𝜌\rhoitalic_ρ and the angle θ𝜃\thetaitalic_θ in the same manner. This depends on the degree of misspecification but also on the hyperparameters of the AS. The aggregated AS allows us to avoid the selection of a scale hhitalic_h while maintaining the discrimination ability of the lower values of hhitalic_h.

The anisotropic score is an interpretable scoring rule targeting the anisotropy of the dependence structure. However, it has the limitation of introducing hyperparameters in the form of the scale hhitalic_h and the axes along which the anisotropy is measured. Aggregation across scales and axes can circumvent the selection of these hyperparameters; however, a clever choice of weights will be required to maintain the signal-to-noise ratio.

5.4 Double-penalty effect

In this example, we illustrate in a simple setting how scoring rules over patches can be robust to the double-penalty effect (see Section 2.4). The double-penalty effect is introduced in the form of forecasts that deviate from the ideal forecast by an additive or multiplicative noise term (i.e., nugget effect). The noises are centered uniforms such that the forecasts are correct on average but incorrect over each grid point.

Observations follow the model of (20) with the parameters σ0=1subscript𝜎01\sigma_{0}=1italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1, λ0=3subscript𝜆03\lambda_{0}=3italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 3 and β0=1subscript𝛽01\beta_{0}=1italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1. As per usual the ideal forecast, having the same distribution as the observations, is used as a reference. Additive-noised forecasts are the first type of forecast introduced to test the sensitivity of scoring rules to the form of the double-penalty effect (presented above). They differ from the ideal forecast through their marginals in the following way :

Fadd(s)=𝒩(ϵs,σ02),subscript𝐹add𝑠𝒩subscriptitalic-ϵ𝑠superscriptsubscript𝜎02F_{\mathrm{add}}(s)=\mathcal{N}(\epsilon_{s},\sigma_{0}^{2}),italic_F start_POSTSUBSCRIPT roman_add end_POSTSUBSCRIPT ( italic_s ) = caligraphic_N ( italic_ϵ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where ϵsUnif([r,r])similar-tosubscriptitalic-ϵ𝑠Unif𝑟𝑟\epsilon_{s}\sim\mathrm{Unif}([-r,r])italic_ϵ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∼ roman_Unif ( [ - italic_r , italic_r ] ) is a spatial white noise independent at each location s𝒟𝑠𝒟s\in\mathcal{D}italic_s ∈ caligraphic_D. This has an effect on the mean of the marginals at each grid point. Three different noise range values are tested r{0.1,0.25,0.5}𝑟0.10.250.5r\in\{0.1,0.25,0.5\}italic_r ∈ { 0.1 , 0.25 , 0.5 }. Similarly, multiplicative-noised forecasts that differ from the ideal forecast through their marginals are introduced :

Fmul(s)=𝒩(0,σ2(1+ηs)2),subscript𝐹mul𝑠𝒩0superscript𝜎2superscript1subscript𝜂𝑠2F_{\mathrm{mul}}(s)=\mathcal{N}(0,\sigma^{2}(1+\eta_{s})^{2}),italic_F start_POSTSUBSCRIPT roman_mul end_POSTSUBSCRIPT ( italic_s ) = caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where ηsUnif([r,r])similar-tosubscript𝜂𝑠Unif𝑟𝑟\eta_{s}\sim\mathrm{Unif}([-r,r])italic_η start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∼ roman_Unif ( [ - italic_r , italic_r ] ) and s𝒟𝑠𝒟s\in\mathcal{D}italic_s ∈ caligraphic_D. This has an effect on the variance of the marginals at each grid point and, thus, on the covariance. The same noise range values are tested r{0.1,0.25,0.5}𝑟0.10.250.5r\in\{0.1,0.25,0.5\}italic_r ∈ { 0.1 , 0.25 , 0.5 }.

The aggregated CRPS is a naive scoring rule that is sensitive to the double-penalty effect. We propose the aggregated CRPS of spatial mean which is defined as

CRPSmean𝒫,𝒘𝓟(F,𝒚)subscriptCRPSsubscriptmean𝒫subscript𝒘𝓟𝐹𝒚\displaystyle\mathrm{CRPS}_{\mathrm{mean}_{\mathcal{P}},\bm{w_{\mathcal{P}}}}(% F,\bm{y})roman_CRPS start_POSTSUBSCRIPT roman_mean start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT bold_caligraphic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) =P𝒫wPCRPSmeanP(F,𝒚);absentsubscript𝑃𝒫subscript𝑤𝑃subscriptCRPSsubscriptmean𝑃𝐹𝒚\displaystyle=\sum_{P\in\mathcal{P}}w_{P}\mathrm{CRPS}_{\mathrm{mean}_{P}}(F,% \bm{y});= ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_CRPS start_POSTSUBSCRIPT roman_mean start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) ;
=P𝒫wPCRPS(meanP(F),meanP(𝒚)),absentsubscript𝑃𝒫subscript𝑤𝑃CRPSsubscriptmean𝑃𝐹subscriptmean𝑃𝒚\displaystyle=\sum_{P\in\mathcal{P}}w_{P}\mathrm{CRPS}(\mathrm{mean}_{P}(F),% \mathrm{mean}_{P}(\bm{y})),= ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_CRPS ( roman_mean start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_F ) , roman_mean start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( bold_italic_y ) ) ,

where 𝒫𝒫\mathcal{P}caligraphic_P is an ensemble of spatial patches, wPsubscript𝑤𝑃w_{P}italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is the weight associated with a patch P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P and meanPsubscriptmean𝑃\mathrm{mean}_{P}roman_mean start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT the spatial mean over the patch P𝑃Pitalic_P (17). It is a proper scoring rule, and it has an interpretation similar to the aggregated CRPS, but the forecasts are only evaluated on the performance of their spatial mean. In order to make the scoring more interpretable, only square patches of a given size s𝑠sitalic_s are considered and the weights wPsubscript𝑤𝑃w_{P}italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT are uniform. The size of the patches s𝑠sitalic_s can be determined by multiple factors such as the physics of the problem, the constraints of the verification in the case of models on different scales, or hypotheses on a different behavior below and above the scale of the patch (e.g., independent and identically distributed; Taillardat and Mestre 2020). Note that the aggregated CRPS of spatial mean is equal to the aggregated CRPS when patches of size s=1𝑠1s=1italic_s = 1 are considered.

Refer to caption
(a) Aggregated CRPS and CRPS of spatial mean
Refer to caption
(b) Aggregated BS and SE of FTE
Figure 4: Expectation of scoring rules tested on their sensitivity to double-penalty effect : (a) the aggregated CRPS and the aggregated CRPS of spatial mean, and (b) the aggregated Brier score and the aggregated squared error of fraction of threshold exceedances, for the ideal forecast (violet), the additive-noised forecasts (shades of blue), and the multiplicative-noised forecasts (shades of orange). For the noised forecasts, darker colors correspond to larger values of the range r{0.1, 0.25, 0.5}𝑟0.10.250.5r\in\{0.1,\ 0.25,\ 0.5\}italic_r ∈ { 0.1 , 0.25 , 0.5 }. More details are available in the main text.

If a quantity of interest is the exceedance of a threshold t𝑡titalic_t, the scoring rule associated with that is the Brier score (5). We compare the aggregated BS with its counterpart over patches: the aggregated SE of the FTE. It is defined as

SEFTE𝒫,t,𝒘𝓟(F,𝒚)subscriptSEsubscriptFTE𝒫𝑡subscript𝒘𝓟𝐹𝒚\displaystyle\mathrm{SE}_{\mathrm{FTE}_{\mathcal{P},t},\bm{w_{\mathcal{P}}}}(F% ,\bm{y})roman_SE start_POSTSUBSCRIPT roman_FTE start_POSTSUBSCRIPT caligraphic_P , italic_t end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT bold_caligraphic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) =P𝒫wPSEFTEP,t(F,𝒚);absentsubscript𝑃𝒫subscript𝑤𝑃subscriptSEsubscriptFTE𝑃𝑡𝐹𝒚\displaystyle=\sum_{P\in\mathcal{P}}w_{P}\mathrm{SE}_{\mathrm{FTE}_{P,t}}(F,% \bm{y});= ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_SE start_POSTSUBSCRIPT roman_FTE start_POSTSUBSCRIPT italic_P , italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) ;
=P𝒫wPSE(FTEP,t(F),FTEP,t(𝒚))absentsubscript𝑃𝒫subscript𝑤𝑃SEsubscriptFTE𝑃𝑡𝐹subscriptFTE𝑃𝑡𝒚\displaystyle=\sum_{P\in\mathcal{P}}w_{P}\mathrm{SE}\big{(}\mathrm{FTE}_{P,t}(% F),\mathrm{FTE}_{P,t}(\bm{y})\big{)}= ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_SE ( roman_FTE start_POSTSUBSCRIPT italic_P , italic_t end_POSTSUBSCRIPT ( italic_F ) , roman_FTE start_POSTSUBSCRIPT italic_P , italic_t end_POSTSUBSCRIPT ( bold_italic_y ) )
=P𝒫wP(𝔼F[FTEP,t(X)]FTEP,t(𝒚))2absentsubscript𝑃𝒫subscript𝑤𝑃superscriptsubscript𝔼𝐹delimited-[]subscriptFTE𝑃𝑡𝑋subscriptFTE𝑃𝑡𝒚2\displaystyle=\sum_{P\in\mathcal{P}}w_{P}\big{(}\mathbb{E}_{F}[\mathrm{FTE}_{P% ,t}(X)]-\mathrm{FTE}_{P,t}(\bm{y})\big{)}^{2}= ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ roman_FTE start_POSTSUBSCRIPT italic_P , italic_t end_POSTSUBSCRIPT ( italic_X ) ] - roman_FTE start_POSTSUBSCRIPT italic_P , italic_t end_POSTSUBSCRIPT ( bold_italic_y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

where 𝒫𝒫\mathcal{P}caligraphic_P is an ensemble of spatial patches, wPsubscript𝑤𝑃w_{P}italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is the weight associated with a patch P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P and FTEP,tsubscriptFTE𝑃𝑡\mathrm{FTE}_{P,t}roman_FTE start_POSTSUBSCRIPT italic_P , italic_t end_POSTSUBSCRIPT the fraction of threshold exceedance over the patch P𝑃Pitalic_P and for the threshold t𝑡titalic_t (18). This scoring rule is proper and focuses on the prediction of the exceedance of a threshold t𝑡titalic_t via the fraction of locations over a patch P𝑃Pitalic_P exceeding said threshold. The resemblance with the Brier score is clear and the aggregated SE of FTE becomes the aggregated BS when patches of size s=1𝑠1s=1italic_s = 1 are considered.

In Figure 4, the values of the aggregated SE of FTE have been obtained by sampling the forecasts’ distribution. Figure 4(a) compares the aggregated CRPS and the aggregated CRPS of spatial mean for different patch size s𝑠sitalic_s. For all the scoring rules, we observe an increase in the expected value with the increase of the range of the noise r𝑟ritalic_r. As expected, the aggregated CRPS is very sensitive to noise in the mean or the variance and, thus, is prone to the double-penalty effect. The aggregated CRPS of spatial mean is less sensitive to noise on the mean or the variance. Moreover, different patch sizes allow us to select the spatial scale below which we want to avoid a double penalty. Given that the distribution of the noise is fixed in this simulation (i.e., uniform), patch size is related to the level of random fluctuations (i.e., the range r𝑟ritalic_r) tolerated by the scoring rule before significant discrimination with respect to the ideal forecast. It is worth noting that the range r𝑟ritalic_r of the noise leads to a stronger increase in the values of these CRPS-related scoring rules when the noise is on the mean rather than on the variance.

Figure 4(b) compares the aggregated BS and the aggregated squared error of fraction of threshold exceedances. For simplicity, we fix the threshold t=1𝑡1t=1italic_t = 1. The aggregated BS is, as expected, sensitive to noise in the mean or the variance, and an increase in the range of the noise leads to an increase in the expected value of the score. The aggregated SE of FTE acts as a natural extension of the aggregated BS to patches and provides scoring rules that are less sensitive to noise on the mean or the variance. The sensitivity evolves differently with the increase of the patch size s𝑠sitalic_s compared to the aggregated CRPS of spatial mean since the aggregated SE of FTE measures the effect on the average exceedance over a patch. The range r𝑟ritalic_r of the noise apparently leads to a comparable increase in the values of the aggregated SE of FTE when the noise is additive or multiplicative.

The use of transformations over patches is similar to neighborhood-based methods in the spatial verification tools framework. Even though avoiding the double-penalty effect is not restricted to tools that do not penalize forecasts below a certain scale, this simulation setup presents a type of test relevant to probability forecasts. The patched-based scoring rules proposed here are not by themselves a sufficient verification tool since they are insensitive to some unrealistic forecast (e.g., if the mean value over the patch is correct but deviations may be as large as possible and lead to unchanged values of the scoring rule). As for any other scoring rule, they should be used with other scoring rules.

6 Conclusion

Verification of probabilistic forecasts is an essential but complex step of all forecasting procedures. Scoring rules may appear as the perfect tool to compare forecast performance since, when proper, they can simultaneously assess calibration and sharpness. However, propriety, even if strict, does not ensure that a scoring rule is relevant to the problem at hand. With that in mind, we agree with the recommendation of Scheuerer and Hamill (2015) that "several different scores be always considered before drawing conclusions". This is even more important in a multivariate setting where forecasts are characterized by more complex objects.

We proposed a framework to construct proper scoring rules in a multivariate setting using aggregation and transformation principles. Aggregation-and-transformation-based scoring rules can improve the conclusions drawn since they enable the verification of specific aspects of the forecast (e.g., anisotropy of the dependence structure). This has been illustrated both using examples from the literature and numerical experiments showcasing different settings. Moreover, we showed that the aggregation and transformation principles can be used to construct scoring rules that are proper, interpretable, and not affected by the double-penalty effect. This could be a starting point to help bridging the gap between the proper scoring rule community and the spatial verification tools community.

As the interest for machine learning-based weather forecast is increasing (see, e.g., Ben Bouallègue et al. 2024a), multiple approaches have performance comparable to ECMWF deterministic high-resolution forecasts (Keisler, 2022; Pathak et al., 2022; Bi et al., 2023; Lam et al., 2022; Chen et al., 2023). The natural extension to probabilistic forecast is already develo** and enabled by publicly available benchmark datasets such as WeatherBench 2 (Rasp et al., 2024). Aggregation-and-transformation-based methods can help ensure that parameter inference does not hedge certain important aspects of the multivariate probabilistic forecasts.

There seems to be a trade-off between discrimination ability and strict propriety. Discrimination ability comes from the ability of scoring rules to differentiate misspecification of certain characteristics. By definition, the expectation of strictly proper scoring rules is minimized when the probabilistic forecast is the true distribution. Nonetheless, it does not guarantee that this global minimum is steep in any misspecification direction. However, interpretable scoring rules can discriminate the misspecification of their target characteristic. Should scoring rules discriminating any misspecification be pursued? Or should interpretable scoring rules discriminating a specific type of misspecification be used instead?

Acknowledgments

The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR) under reference ANR-20-CE40-0025-01 (T-REX project) and the Energy-oriented Centre of Excellence II (EoCoE-II), Grant Agreement 824158, funded within the Horizon2020 framework of the European Union. Part of this work was also supported by the ExtremesLearning grant from 80 PRIME CNRS-INSU and this study has received funding from Agence Nationale de la Recherche - France 2030 as part of the PEPR TRACCS program under grant number ANR-22-EXTR-0005 and the ANR EXSTA.
Sam Allen is thanked for fruitful discussions during the preparation of this manuscript.

References

  • Agnolucci et al. (2020) Paolo Agnolucci, Chrysanthi Rapti, Peter Alexander, Vincenzo De Lipsis, Robert A. Holland, Felix Eigenbrod, and Paul Ekins. Impacts of rising temperatures and farm management practices on global yields of 18 crops. Nature Food, 1(9):562–571, September 2020. ISSN 2662-1355. https://doi.org/10.1038/s43016-020-00148-x.
  • Al Masry et al. (2023) Zeina Al Masry, Romain Pic, Clément Dombry, and Chrisine Devalland. A new methodology to predict the oncotype scores based on clinico-pathological data with similar tumor profiles. Breast Cancer Research and Treatment, 2023. ISSN 1573-7217. https://doi.org/10.1007/s10549-023-07141-5.
  • Alexander et al. (2022) Carol Alexander, Michael Coulon, Y. Han, and Xiaochun Meng. Evaluating the discrimination ability of proper multi-variate scoring rules. Annals of Operations Research, March 2022. ISSN 1572-9338. https://doi.org/10.1007/s10479-022-04611-9.
  • Allen et al. (2023a) Sam Allen, Jonas Bhend, Olivia Martius, and Johanna Ziegel. Weighted verification tools to evaluate univariate and multivariate probabilistic forecasts for high-impact weather events. Weather and Forecasting, 38(3):499–516, March 2023a. ISSN 1520-0434. https://doi.org/10.1175/waf-d-22-0161.1.
  • Allen et al. (2023b) Sam Allen, David Ginsbourger, and Johanna Ziegel. Evaluating forecasts for high-impact events using transformed kernel scores. SIAM/ASA Journal on Uncertainty Quantification, 11(3):906–940, August 2023b. ISSN 2166-2525. https://doi.org/10.1137/22m1532184.
  • Allen et al. (2024) Sam Allen, Johanna Ziegel, and David Ginsbourger. Assessing the calibration of multivariate probabilistic forecasts. Quarterly Journal of the Royal Meteorological Society, 150(760):1315–1335, February 2024. ISSN 1477-870X. https://doi.org/10.1002/qj.4647.
  • Anderson (1996) Jeffrey L. Anderson. A method for producing and evaluating probabilistic forecasts from ensemble model integrations. Journal of Climate, 9(7):1518–1530, July 1996. ISSN 1520-0442. https://doi.org/10.1175/1520-0442(1996)009<1518:amfpae>2.0.co;2.
  • Basse-O’Connor et al. (2021) Andreas Basse-O’Connor, Vytautė Pilipauskaitė, and Mark Podolskij. Power variations for fractional type infinitely divisible random fields. Electronic Journal of Probability, 26(none):1 – 35, 2021. https://doi.org/10.1214/21-EJP617. URL https://doi.org/10.1214/21-EJP617.
  • Ben Bouallègue et al. (2024a) Zied Ben Bouallègue, Mariana C. A. Clare, Linus Magnusson, Estibaliz Gascón, Michael Maier-Gerber, Martin Janoušek, Mark Rodwell, Florian Pinault, Jesper S. Dramsch, Simon T. K. Lang, Baudouin Raoult, Florence Rabier, Matthieu Chevallier, Irina Sandu, Peter Dueben, Matthew Chantry, and Florian Pappenberger. The rise of data-driven weather forecasting: A first statistical assessment of machine learning–based weather forecasts in an operational-like context. Bulletin of the American Meteorological Society, 105(6):E864–E883, June 2024a. ISSN 1520-0477. https://doi.org/10.1175/bams-d-23-0162.1.
  • Ben Bouallègue et al. (2024b) Zied Ben Bouallègue, Jonathan A. Weyn, Mariana C. A. Clare, Jesper Dramsch, Peter Dueben, and Matthew Chantry. Improving medium-range ensemble weather forecasts with hierarchical ensemble transformers. Artificial Intelligence for the Earth Systems, 3(1), January 2024b. ISSN 2769-7525. https://doi.org/10.1175/aies-d-23-0027.1.
  • Benassi et al. (2004) Albert Benassi, Serge Cohen, and Jacques Istas. On roughness indices for fractional fields. Bernoulli, 10(2):357 – 373, 2004. https://doi.org/10.3150/bj/1082380223. URL https://doi.org/10.3150/bj/1082380223.
  • Berlinet and Thomas-Agnan (2004) Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Kluwer Academic Publishers, Boston, MA, 2004. ISBN 1-4020-7679-7. https://doi.org/10.1007/978-1-4419-9096-9. URL https://doi.org/10.1007/978-1-4419-9096-9. With a preface by Persi Diaconis.
  • Bi et al. (2023) Kaifeng Bi, Lingxi Xie, Hengheng Zhang, Xin Chen, Xiaotao Gu, and Qi Tian. Accurate medium-range global weather forecasting with 3d neural networks. Nature, 619(7970):533–538, July 2023. ISSN 1476-4687. https://doi.org/10.1038/s41586-023-06185-3.
  • Bjerregård et al. (2021) Mathias Blicher Bjerregård, Jan Kloppenborg Møller, and Henrik Madsen. An introduction to multivariate probabilistic forecast evaluation. Energy and AI, 4:100058, June 2021. ISSN 2666-5468. https://doi.org/10.1016/j.egyai.2021.100058.
  • Bolin and Wallin (2023) David Bolin and Jonas Wallin. Local scale invariance and robustness of proper scoring rules. Statistical Science, 38(1), feb 2023. https://doi.org/10.1214/22-sts864.
  • Bosse et al. (2023) Nikos I. Bosse, Sam Abbott, Anne Cori, Edwin van Leeuwen, Johannes Bracher, and Sebastian Funk. Scoring epidemiological forecasts on transformed scales. PLOS Computational Biology, 19(8):e1011393, August 2023. ISSN 1553-7358. https://doi.org/10.1371/journal.pcbi.1011393.
  • Brehmer (2017) Jonas Brehmer. Elicitability and its application in risk management. July 2017. https://doi.org/10.48550/ARXIV.1707.09604.
  • Brehmer and Strokorb (2019) Jonas R. Brehmer and Kirstin Strokorb. Why scoring functions cannot assess tail properties. Electronic Journal of Statistics, 13(2), January 2019. ISSN 1935-7524. https://doi.org/10.1214/19-ejs1622.
  • Bremnes (2019) John Bjørnar Bremnes. Ensemble postprocessing using quantile function regression based on neural networks and bernstein polynomials. Monthly Weather Review, 148(1):403–414, December 2019. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-19-0227.1.
  • Brier (1950) Glenn W. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1):1–3, 1950. ISSN 1520-0493. https://doi.org/10.1175/1520-0493(1950)078<0001:vofeit>2.0.co;2.
  • Bröcker (2009) Jochen Bröcker. Reliability, sufficiency, and the decomposition of proper scores. Quarterly Journal of the Royal Meteorological Society, 135(643):1512–1519, July 2009. ISSN 1477-870X. https://doi.org/10.1002/qj.456.
  • Bröcker and Ben Bouallègue (2020) Jochen Bröcker and Zied Ben Bouallègue. Stratified rank histograms for ensemble forecast verification under serial dependence. Quarterly Journal of the Royal Meteorological Society, 146(729):1976–1990, April 2020. ISSN 1477-870X. https://doi.org/10.1002/qj.3778.
  • Bröcker and Smith (2007) Jochen Bröcker and Leonard A. Smith. Scoring probabilistic forecasts: The importance of being proper. Weather and Forecasting, 22(2):382–388, April 2007. ISSN 0882-8156. https://doi.org/10.1175/waf966.1.
  • Buschow (2022) Sebastian Buschow. Measuring displacement errors with complex wavelets. Weather and Forecasting, 37(6):953–970, June 2022. ISSN 1520-0434. https://doi.org/10.1175/waf-d-21-0180.1.
  • Buschow and Friederichs (2020) Sebastian Buschow and Petra Friederichs. Using wavelets to verify the scale structure of precipitation forecasts. Advances in Statistical Climatology, Meteorology and Oceanography, 6(1):13–30, March 2020. ISSN 2364-3587. https://doi.org/10.5194/ascmo-6-13-2020.
  • Buschow and Friederichs (2021) Sebastian Buschow and Petra Friederichs. Sad: Verifying the scale, anisotropy and direction of precipitation forecasts. Quarterly Journal of the Royal Meteorological Society, 147(735):1150–1169, January 2021. ISSN 1477-870X. https://doi.org/10.1002/qj.3964.
  • Buschow et al. (2019) Sebastian Buschow, Jakiw Pidstrigach, and Petra Friederichs. Assessment of wavelet-based spatial verification by means of a stochastic precipitation model (wv_verif v0.1.0). Geoscientific Model Development, 12(8):3401–3418, August 2019. ISSN 1991-9603. https://doi.org/10.5194/gmd-12-3401-2019.
  • Casati et al. (2022) Barbara Casati, Manfred Dorninger, Caio A. S. Coelho, Elizabeth E. Ebert, Chiara Marsigli, Marion P. Mittermaier, and Eric Gilleland. The 2020 international verification methods workshop online: Major outcomes and way forward. Bulletin of the American Meteorological Society, 103(3):E899–E910, March 2022. ISSN 1520-0477. https://doi.org/10.1175/bams-d-21-0126.1.
  • Chapman et al. (2022) William E. Chapman, Luca Delle Monache, Stefano Alessandrini, Aneesh C. Subramanian, F. Martin Ralph, Shang-** Xie, Sebastian Lerch, and Negin Hayatbini. Probabilistic predictions from deterministic atmospheric river forecasts with deep learning. Monthly Weather Review, 150(1):215–234, January 2022. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-21-0106.1.
  • Chen et al. (2023) Kang Chen, Tao Han, Junchao Gong, Lei Bai, Fenghua Ling, **g-Jia Luo, Xi Chen, Leiming Ma, Tianning Zhang, Rui Su, Yuanzheng Ci, Bin Li, Xiaokang Yang, and Wanli Ouyang. Fengwu: Pushing the skillful global medium-range weather forecast beyond 10 days lead. April 2023. https://doi.org/10.48550/ARXIV.2304.02948.
  • Christensen et al. (2014) H. M. Christensen, I. M. Moroz, and T. N. Palmer. Evaluation of ensemble forecast uncertainty using a new proper score: Application to medium-range and seasonal forecasts. Quarterly Journal of the Royal Meteorological Society, 141(687):538–549, May 2014. ISSN 1477-870X. https://doi.org/10.1002/qj.2375.
  • Dawid (1984) A. P. Dawid. Present position and potential developments: Some personal views: Statistical theory: The prequential approach. Journal of the Royal Statistical Society. Series A (General), 147(2):278, 1984. ISSN 0035-9238. https://doi.org/10.2307/2981683.
  • Dawid and Sebastiani (1999) A. Philip Dawid and Paola Sebastiani. Coherent dispersion criteria for optimal experimental design. The Annals of Statistics, 27(1), March 1999. ISSN 0090-5364. https://doi.org/10.1214/aos/1018031101.
  • Dawid et al. (2015) A. Philip Dawid, Monica Musio, and Laura Ventura. Minimum scoring rule inference. Scandinavian Journal of Statistics, 43(1):123–138, August 2015. ISSN 1467-9469. https://doi.org/10.1111/sjos.12168.
  • Dawid and Musio (2014) Alexander Philip Dawid and Monica Musio. Theory and applications of proper scoring rules. METRON, 72(2):169–183, April 2014. ISSN 2281-695X. https://doi.org/10.1007/s40300-014-0039-y.
  • Delle Monache et al. (2013) Luca Delle Monache, F. Anthony Eckel, Daran L. Rife, Badrinath Nagarajan, and Keith Searight. Probabilistic weather prediction with an analog ensemble. Monthly Weather Review, 141(10):3498–3516, September 2013. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-12-00281.1.
  • Diebold and Mariano (1995) Francis X. Diebold and Roberto S. Mariano. Comparing predictive accuracy. Journal of Business & Economic Statistics, 13(3):253–263, July 1995. ISSN 1537-2707. https://doi.org/10.1080/07350015.1995.10524599.
  • Dorninger et al. (2018) Manfred Dorninger, Eric Gilleland, Barbara Casati, Marion P. Mittermaier, Elizabeth E. Ebert, Barbara G. Brown, and Laurence J. Wilson. The setup of the mesovict project. Bulletin of the American Meteorological Society, 99(9):1887–1906, September 2018. ISSN 1520-0477. https://doi.org/10.1175/bams-d-17-0164.1.
  • Ebert (2008) Elizabeth E. Ebert. Fuzzy verification of high-resolution gridded forecasts: a review and proposed framework. Meteorological Applications, 15(1):51–64, 2008. https://doi.org/10.1002/met.25.
  • Ehm and Gneiting (2012) Werner Ehm and Tilmann Gneiting. Local proper scoring rules of order two. The Annals of Statistics, 40(1), February 2012. ISSN 0090-5364. https://doi.org/10.1214/12-aos973.
  • Ferro et al. (2008) Christopher A. T. Ferro, David S. Richardson, and Andreas P. Weigel. On the effect of ensemble size on the discrete and continuous ranked probability scores. Meteorological Applications, 15(1):19–24, March 2008. ISSN 1469-8080. https://doi.org/10.1002/met.45.
  • Friederichs and Hense (2008) Petra Friederichs and Andreas Hense. A probabilistic forecast approach for daily precipitation totals. Weather and Forecasting, 23(4):659–673, August 2008. ISSN 0882-8156. https://doi.org/10.1175/2007waf2007051.1.
  • Gilleland (2011) Eric Gilleland. Spatial forecast verification: Baddeley’s delta metric applied to the icp test cases. Weather and Forecasting, 26(3):409–415, June 2011. ISSN 1520-0434. https://doi.org/10.1175/waf-d-10-05061.1.
  • Gilleland et al. (2009) Eric Gilleland, David Ahijevych, Barbara G. Brown, Barbara Casati, and Elizabeth E. Ebert. Intercomparison of spatial forecast verification methods. Weather and Forecasting, 24(5):1416–1430, October 2009. ISSN 0882-8156. https://doi.org/10.1175/2009waf2222269.1.
  • Gneiting (2011) Tilmann Gneiting. Making and evaluating point forecasts. Journal of the American Statistical Association, 106(494):746–762, June 2011. ISSN 1537-274X. https://doi.org/10.1198/jasa.2011.r10138.
  • Gneiting and Katzfuss (2014) Tilmann Gneiting and Matthias Katzfuss. Probabilistic forecasting. Annual Review of Statistics and Its Application, 1(1):125–151, January 2014. ISSN 2326-831X. https://doi.org/10.1146/annurev-statistics-062713-085831.
  • Gneiting and Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, March 2007. ISSN 1537-274X. https://doi.org/10.1198/016214506000001437.
  • Gneiting et al. (2005) Tilmann Gneiting, Adrian E. Raftery, Anton H. Westveld, and Tom Goldman. Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Monthly Weather Review, 133(5):1098–1118, May 2005. ISSN 0027-0644. https://doi.org/10.1175/mwr2904.1.
  • Gneiting et al. (2007) Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E. Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society Series B: Statistical Methodology, 69(2):243–268, March 2007. ISSN 1467-9868. https://doi.org/10.1111/j.1467-9868.2007.00587.x.
  • Gneiting et al. (2008) Tilmann Gneiting, Larissa I. Stanberry, Eric P. Grimit, Leonhard Held, and Nicholas A. Johnson. Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds. TEST, 17(2):211–235, July 2008. ISSN 1863-8260. https://doi.org/10.1007/s11749-008-0114-x.
  • Gneiting et al. (2023) Tilmann Gneiting, Sebastian Lerch, and Benedikt Schulz. Probabilistic solar forecasting: Benchmarks, post-processing, verification. Solar Energy, 252:72–80, March 2023. ISSN 0038-092X. https://doi.org/10.1016/j.solener.2022.12.054.
  • Good (1952) I. J. Good. Rational decisions. Journal of the Royal Statistical Society: Series B (Methodological), 14(1):107–114, January 1952. ISSN 2517-6161. https://doi.org/10.1111/j.2517-6161.1952.tb00104.x.
  • Han and Szunyogh (2018) Fan Han and Istvan Szunyogh. A technique for the verification of precipitation forecasts and its application to a problem of predictability. Monthly Weather Review, 146(5):1303–1318, April 2018. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-17-0040.1.
  • Heinrich-Mertsching et al. (2021) Claudio Heinrich-Mertsching, Thordis L. Thorarinsdottir, Peter Guttorp, and Max Schneider. Validation of point process predictions with proper scoring rules. October 2021.
  • Hersbach (2000) Hans Hersbach. Decomposition of the continuous ranked probability score for ensemble prediction systems. Weather and Forecasting, 15(5):559–570, October 2000. ISSN 1520-0434. https://doi.org/10.1175/1520-0434(2000)015<0559:dotcrp>2.0.co;2.
  • Holzmann and Eulert (2014) Hajo Holzmann and Matthias Eulert. The role of the information set for forecasting—with applications to risk management. The Annals of Applied Statistics, 8(1), March 2014. ISSN 1932-6157. https://doi.org/10.1214/13-aoas709.
  • Hu et al. (2023) Weiming Hu, Mohammadvaghef Ghazvinian, William E. Chapman, Agniv Sengupta, Fred Martin Ralph, and Luca Delle Monache. Deep learning forecast uncertainty for precipitation over the western united states. Monthly Weather Review, 151(6):1367–1385, June 2023. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-22-0268.1.
  • Hyvärinen (2005) Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005. URL http://jmlr.org/papers/v6/hyvarinen05a.html.
  • Jolliffe and Primo (2008) Ian T. Jolliffe and Cristina Primo. Evaluating rank histograms using decompositions of the chi-square test statistic. Monthly Weather Review, 136(6):2133–2139, June 2008. ISSN 0027-0644. https://doi.org/10.1175/2007mwr2219.1.
  • Jordan et al. (2019) Alexander Jordan, Fabian Krüger, and Sebastian Lerch. Evaluating probabilistic forecasts with scoringrules. Journal of Statistical Software, 90(12), 2019. ISSN 1548-7660. https://doi.org/10.18637/jss.v090.i12.
  • Jordan et al. (2011) Thomas H. Jordan, Yun-Tai Chen, Paolo Gasparini, Raul Madariaga, Ian Main, Warner Marzocchi, Gerassimos Papadopoulos, Gennady Sobolev, Koshun Yamaoka, and Jochen Zschau. Operational earthquake forecasting. state of knowledge and guidelines for utilization. Annals of Geophysics, 54(4), August 2011. ISSN 2037-416X. https://doi.org/10.4401/ag-5350.
  • Jose (2007) Victor Richmond Jose. A characterization for the spherical scoring rule. Theory and Decision, 66(3):263–281, July 2007. ISSN 1573-7187. https://doi.org/10.1007/s11238-007-9067-x.
  • Keisler (2022) Ryan Keisler. Forecasting global weather with graph neural networks. February 2022. https://doi.org/10.48550/ARXIV.2202.07575.
  • Kullback and Leibler (1951) S. Kullback and R. A. Leibler. On information and sufficiency. The Annals of Mathematical Statistics, 22(1):79–86, March 1951. ISSN 0003-4851. https://doi.org/10.1214/aoms/1177729694.
  • Lam et al. (2022) Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, Alexander Merose, Stephan Hoyer, George Holland, Oriol Vinyals, Jacklynn Stott, Alexander Pritzel, Shakir Mohamed, and Peter Battaglia. Graphcast: Learning skillful medium-range global weather forecasting. December 2022. https://doi.org/10.48550/ARXIV.2212.12794.
  • Lerch and Polsterer (2022) Sebastian Lerch and Kai L. Polsterer. Convolutional autoencoders for spatially-informed ensemble post-processing. In ICLR 2022 - AI for Earth and Space Science Workshop, 2022.
  • Lerch and Thorarinsdottir (2013) Sebastian Lerch and Thordis L. Thorarinsdottir. Comparison of non-homogeneous regression models for probabilistic wind speed forecasting. Tellus A: Dynamic Meteorology and Oceanography, 65(1):21206, December 2013. ISSN 1600-0870. https://doi.org/10.3402/tellusa.v65i0.21206.
  • Lerch et al. (2017) Sebastian Lerch, Thordis L. Thorarinsdottir, Francesco Ravazzolo, and Tilmann Gneiting. Forecaster’s dilemma: Extreme events and forecast evaluation. Statistical Science, 32(1), February 2017. ISSN 0883-4237. https://doi.org/10.1214/16-sts588.
  • Matheron (1963) Georges Matheron. Principles of geostatistics. Economic Geology, 58(8):1246–1266, December 1963. ISSN 0361-0128. https://doi.org/10.2113/gsecongeo.58.8.1246.
  • Matheson and Winkler (1976) James E. Matheson and Robert L. Winkler. Scoring rules for continuous probability distributions. Management Science, 22, 1976. https://doi.org/10.2307/2629907.
  • Meng et al. (2023) Xiaochun Meng, James W. Taylor, Souhaib Ben Taieb, and Siran Li. Scores for multivariate distributions and level sets. Operations Research, July 2023. ISSN 1526-5463. https://doi.org/10.1287/opre.2020.0365.
  • Murphy and Winkler (1987) Allan H. Murphy and Robert L. Winkler. A general framework for forecast verification. Monthly Weather Review, 115(7):1330–1338, July 1987. ISSN 1520-0493. https://doi.org/10.1175/1520-0493(1987)115<1330:agfffv>2.0.co;2.
  • Nowotarski and Weron (2018) Jakub Nowotarski and Rafał Weron. Recent advances in electricity price forecasting: A review of probabilistic forecasting. Renewable and Sustainable Energy Reviews, 81:1548–1568, January 2018. ISSN 1364-0321. https://doi.org/10.1016/j.rser.2017.05.234.
  • Pacchiardi et al. (2024) Lorenzo Pacchiardi, Rilwan Adewoyin, Peter Dueben, and Ritabrata Dutta. Probabilistic forecasting with generative networks via scoring rule minimization. Journal of Machine Learning Research, 25(45):1–64, 2024. URL https://jmlr.org/papers/v25/23-0038.html.
  • Palmer (2012) T. N. Palmer. Towards the probabilistic earth-system simulator: a vision for the future of climate and weather prediction. Quarterly Journal of the Royal Meteorological Society, 138(665):841–861, April 2012. ISSN 1477-870X. https://doi.org/10.1002/qj.1923.
  • Parry et al. (2012) Matthew Parry, A. Philip Dawid, and Steffen Lauritzen. Proper local scoring rules. The Annals of Statistics, 40(1), February 2012. ISSN 0090-5364. https://doi.org/10.1214/12-aos971.
  • Pathak et al. (2022) Jaideep Pathak, Shashank Subramanian, Peter Harrington, Sanjeev Raja, Ashesh Chattopadhyay, Morteza Mardani, Thorsten Kurth, David Hall, Zongyi Li, Kamyar Azizzadenesheli, Pedram Hassanzadeh, Karthik Kashinath, and Animashree Anandkumar. Fourcastnet: A global data-driven high-resolution weather model using adaptive fourier neural operators. February 2022.
  • Pinson and Girard (2012) P. Pinson and R. Girard. Evaluating the quality of scenarios of short-term wind power generation. Applied Energy, 96:12–20, aug 2012. https://doi.org/10.1016/j.apenergy.2011.11.004.
  • Pinson (2013) Pierre Pinson. Wind energy: Forecasting challenges for its operational management. Statistical Science, 28(4), November 2013. ISSN 0883-4237. https://doi.org/10.1214/13-sts445.
  • Pinson and Tastu (2013) Pierre Pinson and Julija Tastu. Discrimination ability of the energy score. DTU Compute - Technical Report, 2013.
  • Radanovics et al. (2018) Sabine Radanovics, Jean-Philippe Vidal, and Eric Sauquet. Spatial verification of ensemble precipitation: An ensemble version of sal. Weather and Forecasting, 33(4):1001–1020, July 2018. ISSN 1520-0434. https://doi.org/10.1175/waf-d-17-0162.1.
  • Rasp and Lerch (2018) Stephan Rasp and Sebastian Lerch. Neural networks for postprocessing ensemble weather forecasts. Monthly Weather Review, 146(11):3885–3900, October 2018. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-18-0187.1.
  • Rasp et al. (2024) Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russel, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallègue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell, and Fei Sha. Weatherbench 2: A benchmark for the next generation of data-driven global weather models. 2024. https://doi.org/10.48550/ARXIV.2308.15560.
  • Rivoire et al. (2023) Pauline Rivoire, Olivia Martius, Philippe Naveau, and Alexandre Tuel. Assessment of subseasonal-to-seasonal (s2s) ensemble extreme precipitation forecast skill over europe. Natural Hazards and Earth System Sciences, 23(8):2857–2871, August 2023. ISSN 1684-9981. https://doi.org/10.5194/nhess-23-2857-2023.
  • Roberts and Lean (2008) Nigel M. Roberts and Humphrey W. Lean. Scale-selective verification of rainfall accumulations from high-resolution forecasts of convective events. Monthly Weather Review, 136(1):78–97, January 2008. ISSN 0027-0644. https://doi.org/10.1175/2007mwr2123.1.
  • Roulston and Smith (2002) Mark S. Roulston and Leonard A. Smith. Evaluating probabilistic forecasts using information theory. Monthly Weather Review, 130(6):1653–1660, June 2002. ISSN 1520-0493. https://doi.org/10.1175/1520-0493(2002)130<1653:epfuit>2.0.co;2.
  • Scheuerer and Hamill (2015) Michael Scheuerer and Thomas M. Hamill. Variogram-based proper scoring rules for probabilistic forecasts of multivariate quantities\ast. Monthly Weather Review, 143(4):1321–1334, 2015. https://doi.org/10.1175/mwr-d-14-00269.1.
  • Schorlemmer et al. (2018) Danijel Schorlemmer, Maximilian J. Werner, Warner Marzocchi, Thomas H. Jordan, Yosihiko Ogata, David D. Jackson, Sum Mak, David A. Rhoades, Matthew C. Gerstenberger, Naoshi Hirata, Maria Liukis, Philip J. Maechling, Anne Strader, Matteo Taroni, Stefan Wiemer, Jeremy D. Zechar, and Jiancang Zhuang. The collaboratory for the study of earthquake predictability: Achievements and priorities. Seismological Research Letters, 89(4):1305–1313, June 2018. ISSN 1938-2057. https://doi.org/10.1785/0220180053.
  • Schulz and Lerch (2022) Benedikt Schulz and Sebastian Lerch. Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison. Monthly Weather Review, 150(1):235–257, January 2022. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-21-0150.1.
  • Shannon (1948) C. E. Shannon. A mathematical theory of communication. Bell System Technical Journal, 27(4):623–656, October 1948. ISSN 0005-8580. https://doi.org/10.1002/j.1538-7305.1948.tb00917.x.
  • Smola et al. (2007) Alex Smola, Arthur Gretton, Le Song, and Bernhard Schölkopf. A hilbert space embedding for distributions. In Marcus Hutter, Rocco A. Servedio, and Eiji Takimoto, editors, Algorithmic Learning Theory, pages 13–31, Berlin, Heidelberg, 2007. Springer Berlin Heidelberg. ISBN 978-3-540-75225-7.
  • Stein and Stoop (2022) Joël Stein and Fabien Stoop. Neighborhood-based ensemble evaluation using the crps. Monthly Weather Review, 150(8):1901–1914, August 2022. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-21-0224.1.
  • Steinwart and Christmann (2008) Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics. Springer, New York, 2008. ISBN 978-0-387-77241-7.
  • Steinwart and Ziegel (2021) Ingo Steinwart and Johanna F. Ziegel. Strictly proper kernel scores and characteristic kernels on compact spaces. Applied and Computational Harmonic Analysis, 51:510–542, 2021. ISSN 1063-5203. https://doi.org/10.1016/j.acha.2019.11.005. URL https://www.sciencedirect.com/science/article/pii/S1063520317301483.
  • Székely (2003) Gábor Székely. E-statistics: The energy of statistical samples. techreport, Bowling Green State University, 2003.
  • Taillardat (2021) Maxime Taillardat. Skewed and mixture of gaussian distributions for ensemble postprocessing. Atmosphere, 12(8):966, July 2021. ISSN 2073-4433. https://doi.org/10.3390/atmos12080966.
  • Taillardat and Mestre (2020) Maxime Taillardat and Olivier Mestre. From research to applications – examples of operational ensemble post-processing in france using machine learning. Nonlinear Processes in Geophysics, 27(2):329–347, May 2020. ISSN 1607-7946. https://doi.org/10.5194/npg-27-329-2020.
  • Taillardat et al. (2016) Maxime Taillardat, Olivier Mestre, Michaël Zamo, and Philippe Naveau. Calibrated ensemble forecasts using quantile regression forests and ensemble model output statistics. Monthly Weather Review, 144(6):2375–2393, June 2016. ISSN 1520-0493. https://doi.org/10.1175/mwr-d-15-0260.1.
  • Talagrand et al. (1997) O. Talagrand, R. Vautard, and B Strauss. Evaluation of probabilistic prediction systems. In Workshop on Predictability, 20-22 October 1997, pages 1–26, Shinfield Park, Reading, 1997. ECMWF.
  • Thorarinsdottir and Schuhen (2018) Thordis L. Thorarinsdottir and Nina Schuhen. Verification: Assessment of Calibration and Accuracy, pages 155–186. Elsevier, 2018. https://doi.org/10.1016/b978-0-12-812372-0.00006-6.
  • Thorarinsdottir et al. (2013) Thordis L. Thorarinsdottir, Tilmann Gneiting, and Nadine Gissibl. Using proper divergence functions to evaluate climate models. SIAM/ASA Journal on Uncertainty Quantification, 1(1):522–534, January 2013. ISSN 2166-2525. https://doi.org/10.1137/130907550.
  • Tsyplakov (2011) Alexander Tsyplakov. Evaluating density forecasts: A comment. SSRN Electronic Journal, 2011. ISSN 1556-5068. https://doi.org/10.2139/ssrn.1907799.
  • Tsyplakov (2013) Alexander Tsyplakov. Evaluation of probabilistic forecasts: Proper scoring rules and moments. SSRN Electronic Journal, 2013. ISSN 1556-5068. https://doi.org/10.2139/ssrn.2236605.
  • Tsyplakov (2020) Alexander Tsyplakov. Evaluation of probabilistic forecasts: Conditional auto-calibration, 2020. URL https://www.sas.upenn.edu/~fdiebold/papers2/Tsyplakov_Auto_calibration_sent_eswc2020.pdf.
  • Vannitsem et al. (2021) Stéphane Vannitsem, John Bjørnar Bremnes, Jonathan Demaeyer, Gavin R. Evans, Jonathan Flowerdew, Stephan Hemri, Sebastian Lerch, Nigel Roberts, Susanne Theis, Aitor Atencia, Zied Ben Bouallègue, Jonas Bhend, Markus Dabernig, Lesley De Cruz, Leila Hieta, Olivier Mestre, Lionel Moret, Iris Odak Plenković, Maurice Schmeits, Maxime Taillardat, Joris Van den Bergh, Bert Van Schaeybroeck, Kirien Whan, and Jussi Ylhaisi. Statistical postprocessing for weather forecasts: Review, challenges, and avenues in a big data world. Bulletin of the American Meteorological Society, 102(3):E681–E699, March 2021. ISSN 1520-0477. https://doi.org/10.1175/bams-d-19-0308.1.
  • Wernli et al. (2008) Heini Wernli, Marcus Paulat, Martin Hagen, and Christoph Frei. Sal—a novel quality measure for the verification of quantitative precipitation forecasts. Monthly Weather Review, 136(11):4470–4487, November 2008. ISSN 0027-0644. https://doi.org/10.1175/2008mwr2415.1.
  • Winkelbauer (2014) Andreas Winkelbauer. Moments and absolute moments of the normal distribution. September 2014. https://doi.org/10.48550/ARXIV.1209.4340.
  • Winkler et al. (1996) R. L. Winkler, Javier Muñoz, José L. Cervera, José M. Bernardo, Gail Blattenberger, Joseph B. Kadane, Dennis V. Lindley, Allan H. Murphy, Robert M Oliver, and David Ríos-Insua. Scoring rules and the evaluation of probabilities. Test, 5(1):1–60, June 1996. ISSN 1863-8260. https://doi.org/10.1007/bf02562681.
  • Winkler (1977) Robert L. Winkler. Rewarding Expertise in Probability Assessment, pages 127–140. Springer Netherlands, 1977. ISBN 9789401012768. https://doi.org/10.1007/978-94-010-1276-8_10.
  • Zamo and Naveau (2017) Michaël Zamo and Philippe Naveau. Estimation of the continuous ranked probability score with limited information and applications to ensemble weather forecasts. Mathematical Geosciences, 50(2):209–234, November 2017. ISSN 1874-8953. https://doi.org/10.1007/s11004-017-9709-7.
  • Ziel and Berk (2019) Florian Ziel and Kevin Berk. Multivariate forecasting evaluation: On sensitive and strictly proper scoring rules. 2019.

Appendix A Expected univariate scoring rules

A.1 Squared Error

For any F,G𝒫2()𝐹𝐺subscript𝒫2F,G\in\mathcal{P}_{2}(\mathbb{R})italic_F , italic_G ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ), the expectation of the squared error (2) is :

𝔼G[SE(F,Y)]=(μFμG)2+σG2,subscript𝔼𝐺delimited-[]SE𝐹𝑌superscriptsubscript𝜇𝐹subscript𝜇𝐺2superscriptsubscript𝜎𝐺2\mathbb{E}_{G}[\mathrm{SE}(F,Y)]=(\mu_{F}-\mu_{G})^{2}+{\sigma_{G}}^{2},blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_SE ( italic_F , italic_Y ) ] = ( italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where μFsubscript𝜇𝐹\mu_{F}italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the mean of the distribution F𝐹Fitalic_F and μGsubscript𝜇𝐺\mu_{G}italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and σG2superscriptsubscript𝜎𝐺2{\sigma_{G}}^{2}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the mean and the variance of the distribution G𝐺Gitalic_G.

Proof.
𝔼G[SE(F,Y)]subscript𝔼𝐺delimited-[]SE𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{SE}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_SE ( italic_F , italic_Y ) ] =𝔼G[(μFY)2]absentsubscript𝔼𝐺delimited-[]superscriptsubscript𝜇𝐹𝑌2\displaystyle=\mathbb{E}_{G}[(\mu_{F}-Y)^{2}]= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ ( italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - italic_Y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=μF22μF𝔼G[Y]+𝔼G[Y2]absentsuperscriptsubscript𝜇𝐹22subscript𝜇𝐹subscript𝔼𝐺delimited-[]𝑌subscript𝔼𝐺delimited-[]superscript𝑌2\displaystyle=\mu_{F}^{2}-2\ \mu_{F}\mathbb{E}_{G}[Y]+\mathbb{E}_{G}[Y^{2}]= italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ italic_Y ] + blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ italic_Y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

Using the fact that 𝔼[X2]=Var(X)+𝔼[X]2𝔼delimited-[]superscript𝑋2Var𝑋𝔼superscriptdelimited-[]𝑋2\mathbb{E}[X^{2}]=\mathrm{Var}(X)+\mathbb{E}[X]^{2}blackboard_E [ italic_X start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = roman_Var ( italic_X ) + blackboard_E [ italic_X ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT,

𝔼G[SE(F,Y)]subscript𝔼𝐺delimited-[]SE𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{SE}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_SE ( italic_F , italic_Y ) ] =μF22μFμG+σG2+μG2absentsuperscriptsubscript𝜇𝐹22subscript𝜇𝐹subscript𝜇𝐺superscriptsubscript𝜎𝐺2superscriptsubscript𝜇𝐺2\displaystyle=\mu_{F}^{2}-2\ \mu_{F}\mu_{G}+\sigma_{G}^{2}+\mu_{G}^{2}= italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=(μFμG)2+σG2absentsuperscriptsubscript𝜇𝐹subscript𝜇𝐺2superscriptsubscript𝜎𝐺2\displaystyle=(\mu_{F}-\mu_{G})^{2}+\sigma_{G}^{2}= ( italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

A.2 Quantile Score

For any F,G𝒫1()𝐹𝐺subscript𝒫1F,G\in\mathcal{P}_{1}(\mathbb{R})italic_F , italic_G ∈ caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R ), the expectation of the quantile score of level α𝛼\alphaitalic_α (4) is :

𝔼G[QSα(F,Y)]subscript𝔼𝐺delimited-[]subscriptQS𝛼𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{QS}_{\alpha}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_QS start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_F , italic_Y ) ] =F1(α)(F1(α)y)G(dy)α(F1(α)y)G(dy);absentsuperscriptsubscriptsuperscript𝐹1𝛼superscript𝐹1𝛼𝑦𝐺d𝑦𝛼subscriptsuperscript𝐹1𝛼𝑦𝐺d𝑦\displaystyle=\int_{-\infty}^{F^{-1}(\alpha)}(F^{-1}(\alpha)-y)G(\mathrm{d}y)-% \alpha\int_{\mathbb{R}}(F^{-1}(\alpha)-y)G(\mathrm{d}y);= ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUPERSCRIPT ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y ) - italic_α ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y ) ;
=𝔼G[QSα(G,Y)]+{(G(F1(α))α)(F1(α)G1(α))G1(α)F1(α)(yG1(α))G(dy)}.absentsubscript𝔼𝐺delimited-[]subscriptQS𝛼𝐺𝑌𝐺superscript𝐹1𝛼𝛼superscript𝐹1𝛼superscript𝐺1𝛼superscriptsubscriptsuperscript𝐺1𝛼superscript𝐹1𝛼𝑦superscript𝐺1𝛼𝐺d𝑦\displaystyle=\mathbb{E}_{G}[\mathrm{QS}_{\alpha}(G,Y)]+\left\{(G(F^{-1}(% \alpha))-\alpha)(F^{-1}(\alpha)-G^{-1}(\alpha))-\int_{G^{-1}(\alpha)}^{F^{-1}(% \alpha)}(y-G^{-1}(\alpha))G(\mathrm{d}y)\right\}.= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_QS start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_G , italic_Y ) ] + { ( italic_G ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) - italic_α ) ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) - ∫ start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUPERSCRIPT ( italic_y - italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) italic_G ( roman_d italic_y ) } .
Proof.

Inspired by the proof of the propriety of the quantile score in Friederichs and Hense (2008).

𝔼G[QSα(F,Y)]subscript𝔼𝐺delimited-[]subscriptQS𝛼𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{QS}_{\alpha}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_QS start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_F , italic_Y ) ] =(𝟙yF1(α)α)(F1(α)y)G(dy)absentsubscriptsubscript1𝑦superscript𝐹1𝛼𝛼superscript𝐹1𝛼𝑦𝐺d𝑦\displaystyle=\int_{\mathbb{R}}(\mathds{1}_{y\leq F^{-1}(\alpha)}-\alpha)(F^{-% 1}(\alpha)-y)G(\mathrm{d}y)= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( blackboard_1 start_POSTSUBSCRIPT italic_y ≤ italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUBSCRIPT - italic_α ) ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y )
=F1(α)(1α)(F1(α)y)G(dy)+F1(α)+(α)(F1(α)y)G(dy)absentsuperscriptsubscriptsuperscript𝐹1𝛼1𝛼superscript𝐹1𝛼𝑦𝐺d𝑦superscriptsubscriptsuperscript𝐹1𝛼𝛼superscript𝐹1𝛼𝑦𝐺d𝑦\displaystyle=\int_{-\infty}^{F^{-1}(\alpha)}(1-\alpha)(F^{-1}(\alpha)-y)G(% \mathrm{d}y)+\int_{F^{-1}(\alpha)}^{+\infty}(-\alpha)(F^{-1}(\alpha)-y)G(% \mathrm{d}y)= ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUPERSCRIPT ( 1 - italic_α ) ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y ) + ∫ start_POSTSUBSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT ( - italic_α ) ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y )
=F1(α)(F1(α)y)G(dy)α(F1(α)y)G(dy)absentsuperscriptsubscriptsuperscript𝐹1𝛼superscript𝐹1𝛼𝑦𝐺d𝑦𝛼subscriptsuperscript𝐹1𝛼𝑦𝐺d𝑦\displaystyle=\int_{-\infty}^{F^{-1}(\alpha)}(F^{-1}(\alpha)-y)G(\mathrm{d}y)-% \alpha\int_{\mathbb{R}}(F^{-1}(\alpha)-y)G(\mathrm{d}y)= ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUPERSCRIPT ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y ) - italic_α ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y )

Then, using F1(α)y=(F1(α)G1(α))+(G1(α)y)superscript𝐹1𝛼𝑦superscript𝐹1𝛼superscript𝐺1𝛼superscript𝐺1𝛼𝑦F^{-1}(\alpha)-y=(F^{-1}(\alpha)-G^{-1}(\alpha))+(G^{-1}(\alpha)-y)italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y = ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) + ( italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ),

𝔼G[QSα(F,Y)]subscript𝔼𝐺delimited-[]subscriptQS𝛼𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{QS}_{\alpha}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_QS start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_F , italic_Y ) ] =F1(α)(F1(α)G1(α))G(dy)α(F1(α)G1(α))G(dy)absentsuperscriptsubscriptsuperscript𝐹1𝛼superscript𝐹1𝛼superscript𝐺1𝛼𝐺d𝑦𝛼subscriptsuperscript𝐹1𝛼superscript𝐺1𝛼𝐺d𝑦\displaystyle=\int_{-\infty}^{F^{-1}(\alpha)}(F^{-1}(\alpha)-G^{-1}(\alpha))G(% \mathrm{d}y)-\alpha\int_{\mathbb{R}}(F^{-1}(\alpha)-G^{-1}(\alpha))G(\mathrm{d% }y)= ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUPERSCRIPT ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) italic_G ( roman_d italic_y ) - italic_α ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) italic_G ( roman_d italic_y )
+F1(α)(G1(α)y)G(dy)α(G1(α)y)G(dy)superscriptsubscriptsuperscript𝐹1𝛼superscript𝐺1𝛼𝑦𝐺d𝑦𝛼subscriptsuperscript𝐺1𝛼𝑦𝐺d𝑦\displaystyle\ \ \ +\int_{-\infty}^{F^{-1}(\alpha)}(G^{-1}(\alpha)-y)G(\mathrm% {d}y)-\alpha\int_{\mathbb{R}}(G^{-1}(\alpha)-y)G(\mathrm{d}y)+ ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUPERSCRIPT ( italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y ) - italic_α ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y )
=(G(F1(α))α)(F1(α)G1(α))absent𝐺superscript𝐹1𝛼𝛼superscript𝐹1𝛼superscript𝐺1𝛼\displaystyle=(G(F^{-1}(\alpha))-\alpha)(F^{-1}(\alpha)-G^{-1}(\alpha))= ( italic_G ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) - italic_α ) ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) )
+F1(α)(G1(α)y)G(dy)α(G1(α)y)G(dy)superscriptsubscriptsuperscript𝐹1𝛼superscript𝐺1𝛼𝑦𝐺d𝑦𝛼subscriptsuperscript𝐺1𝛼𝑦𝐺d𝑦\displaystyle\ \ \ +\int_{-\infty}^{F^{-1}(\alpha)}(G^{-1}(\alpha)-y)G(\mathrm% {d}y)-\alpha\int_{\mathbb{R}}(G^{-1}(\alpha)-y)G(\mathrm{d}y)+ ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUPERSCRIPT ( italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y ) - italic_α ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y )
=(G(F1(α))α)(F1(α)G1(α))absent𝐺superscript𝐹1𝛼𝛼superscript𝐹1𝛼superscript𝐺1𝛼\displaystyle=(G(F^{-1}(\alpha))-\alpha)(F^{-1}(\alpha)-G^{-1}(\alpha))= ( italic_G ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) - italic_α ) ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) )
+G1(α)(G1(α)y)G(dy)+G1(α)F1(α)(G1(α)y)G(dy)α(G1(α)y)G(dy)superscriptsubscriptsuperscript𝐺1𝛼superscript𝐺1𝛼𝑦𝐺d𝑦superscriptsubscriptsuperscript𝐺1𝛼superscript𝐹1𝛼superscript𝐺1𝛼𝑦𝐺d𝑦𝛼subscriptsuperscript𝐺1𝛼𝑦𝐺d𝑦\displaystyle\ \ \ +\int_{-\infty}^{G^{-1}(\alpha)}(G^{-1}(\alpha)-y)G(\mathrm% {d}y)+\int_{G^{-1}(\alpha)}^{F^{-1}(\alpha)}(G^{-1}(\alpha)-y)G(\mathrm{d}y)-% \alpha\int_{\mathbb{R}}(G^{-1}(\alpha)-y)G(\mathrm{d}y)+ ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUPERSCRIPT ( italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y ) + ∫ start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUPERSCRIPT ( italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y ) - italic_α ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_y ) italic_G ( roman_d italic_y )
=(G(F1(α))α)(F1(α)G1(α))+𝔼G[QSα(G,Y)])G1(α)F1(α)(yG1(α))G(dy)\displaystyle=(G(F^{-1}(\alpha))-\alpha)(F^{-1}(\alpha)-G^{-1}(\alpha))+% \mathbb{E}_{G}[\mathrm{QS}_{\alpha}(G,Y)])-\int_{G^{-1}(\alpha)}^{F^{-1}(% \alpha)}(y-G^{-1}(\alpha))G(\mathrm{d}y)= ( italic_G ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) - italic_α ) ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) - italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) + blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_QS start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_G , italic_Y ) ] ) - ∫ start_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) end_POSTSUPERSCRIPT ( italic_y - italic_G start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_α ) ) italic_G ( roman_d italic_y )

A.3 Absolute Error

First of all, for F𝒫1()𝐹subscript𝒫1F\in\mathcal{P}_{1}(\mathbb{R})italic_F ∈ caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R ) and y𝑦y\in\mathbb{R}italic_y ∈ blackboard_R, the absolute error (3) is equal to twice the quantile score of level α=0.5𝛼0.5\alpha=0.5italic_α = 0.5 :

AE(F,y)=|med(F)y|=2QS0.5(F,y),AE𝐹𝑦med𝐹𝑦2subscriptQS0.5𝐹𝑦\mathrm{AE}(F,y)=|\mathrm{med}(F)-y|=2\ \mathrm{QS}_{0.5}(F,y),roman_AE ( italic_F , italic_y ) = | roman_med ( italic_F ) - italic_y | = 2 roman_QS start_POSTSUBSCRIPT 0.5 end_POSTSUBSCRIPT ( italic_F , italic_y ) ,

where med(F)med𝐹\mathrm{med}(F)roman_med ( italic_F ) is the median of the distribution F𝐹Fitalic_F.

It can be deduced that, for any F,G𝒫1()𝐹𝐺subscript𝒫1F,G\in\mathcal{P}_{1}(\mathbb{R})italic_F , italic_G ∈ caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R ), the expectation of the absolute error is :

𝔼G[AE(F,Y)]subscript𝔼𝐺delimited-[]AE𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{AE}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_AE ( italic_F , italic_Y ) ] =𝔼G[|med(F)Y|];absentsubscript𝔼𝐺delimited-[]med𝐹𝑌\displaystyle=\mathbb{E}_{G}[|\mathrm{med}(F)-Y|];= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ | roman_med ( italic_F ) - italic_Y | ] ;
=2med(F)(med(F)y)G(dy)2α(med(F)y)G(dy);absent2superscriptsubscriptmed𝐹med𝐹𝑦𝐺d𝑦2𝛼subscriptmed𝐹𝑦𝐺d𝑦\displaystyle=2\ \int_{-\infty}^{\mathrm{med}(F)}(\mathrm{med}(F)-y)G(\mathrm{% d}y)-2\alpha\int_{\mathbb{R}}(\mathrm{med}(F)-y)G(\mathrm{d}y);= 2 ∫ start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_med ( italic_F ) end_POSTSUPERSCRIPT ( roman_med ( italic_F ) - italic_y ) italic_G ( roman_d italic_y ) - 2 italic_α ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( roman_med ( italic_F ) - italic_y ) italic_G ( roman_d italic_y ) ;
=𝔼G[AE(G,Y)]+2{(G(med(F))α)(med(F)med(G))med(G)med(F)(ymed(G))G(dy)}.absentsubscript𝔼𝐺delimited-[]AE𝐺𝑌2𝐺med𝐹𝛼med𝐹med𝐺superscriptsubscriptmed𝐺med𝐹𝑦med𝐺𝐺d𝑦\displaystyle=\mathbb{E}_{G}[\mathrm{AE}(G,Y)]+2\left\{(G(\mathrm{med}(F))-% \alpha)(\mathrm{med}(F)-\mathrm{med}(G))-\int_{\mathrm{med}(G)}^{\mathrm{med}(% F)}(y-\mathrm{med}(G))G(\mathrm{d}y)\right\}.= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_AE ( italic_G , italic_Y ) ] + 2 { ( italic_G ( roman_med ( italic_F ) ) - italic_α ) ( roman_med ( italic_F ) - roman_med ( italic_G ) ) - ∫ start_POSTSUBSCRIPT roman_med ( italic_G ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_med ( italic_F ) end_POSTSUPERSCRIPT ( italic_y - roman_med ( italic_G ) ) italic_G ( roman_d italic_y ) } .

A.4 Brier score

For any F,G𝒫()𝐹𝐺𝒫F,G\in\mathcal{P}(\mathbb{R})italic_F , italic_G ∈ caligraphic_P ( blackboard_R ), the expectation of the Brier score (5) is :

𝔼G[BSt(F,Y)]=(F(t)G(t))2+G(t)(1G(t)).subscript𝔼𝐺delimited-[]subscriptBS𝑡𝐹𝑌superscript𝐹𝑡𝐺𝑡2𝐺𝑡1𝐺𝑡\displaystyle\mathbb{E}_{G}[\mathrm{BS}_{t}(F,Y)]=(F(t)-G(t))^{2}+G(t)(1-G(t)).blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_BS start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_F , italic_Y ) ] = ( italic_F ( italic_t ) - italic_G ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_G ( italic_t ) ( 1 - italic_G ( italic_t ) ) .
Proof.
𝔼G[BSt(F,Y)]subscript𝔼𝐺delimited-[]subscriptBS𝑡𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{BS}_{t}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_BS start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_F , italic_Y ) ] =𝔼G[(F(t)𝟙Yt)2]absentsubscript𝔼𝐺delimited-[]superscript𝐹𝑡subscript1𝑌𝑡2\displaystyle=\mathbb{E}_{G}[(F(t)-\mathds{1}_{Y\leq t})^{2}]= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ ( italic_F ( italic_t ) - blackboard_1 start_POSTSUBSCRIPT italic_Y ≤ italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=F(t)22F(t)𝔼G[𝟙Yt]+𝔼G[𝟙Yt2]absent𝐹superscript𝑡22𝐹𝑡subscript𝔼𝐺delimited-[]subscript1𝑌𝑡subscript𝔼𝐺delimited-[]superscriptsubscript1𝑌𝑡2\displaystyle=F(t)^{2}-2F(t)\mathbb{E}_{G}[\mathds{1}_{Y\leq t}]+\mathbb{E}_{G% }[{\mathds{1}_{Y\leq t}}^{2}]= italic_F ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_F ( italic_t ) blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_Y ≤ italic_t end_POSTSUBSCRIPT ] + blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_Y ≤ italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=F(t)22F(t)G(t)+G(t)absent𝐹superscript𝑡22𝐹𝑡𝐺𝑡𝐺𝑡\displaystyle=F(t)^{2}-2F(t)G(t)+G(t)= italic_F ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_F ( italic_t ) italic_G ( italic_t ) + italic_G ( italic_t )
=F(t)22F(t)G(t)+G(t)2G(t)2+G(t)absent𝐹superscript𝑡22𝐹𝑡𝐺𝑡𝐺superscript𝑡2𝐺superscript𝑡2𝐺𝑡\displaystyle=F(t)^{2}-2F(t)G(t)+G(t)^{2}-G(t)^{2}+G(t)= italic_F ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_F ( italic_t ) italic_G ( italic_t ) + italic_G ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_G ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_G ( italic_t )
=(F(t)G(t))2+G(t)(1G(t))absentsuperscript𝐹𝑡𝐺𝑡2𝐺𝑡1𝐺𝑡\displaystyle=(F(t)-G(t))^{2}+G(t)(1-G(t))= ( italic_F ( italic_t ) - italic_G ( italic_t ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_G ( italic_t ) ( 1 - italic_G ( italic_t ) )

A.5 Continuous Ranked Probability Score

For any F,G𝒫1()𝐹𝐺subscript𝒫1F,G\in\mathcal{P}_{1}(\mathbb{R})italic_F , italic_G ∈ caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R ), the expectation of the CRPS (7) is :

𝔼G[CRPS(F,Y)]subscript𝔼𝐺delimited-[]CRPS𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{CRPS}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_CRPS ( italic_F , italic_Y ) ] =𝔼F,G|XY|12𝔼F|XX|;absentsubscript𝔼𝐹𝐺𝑋𝑌12subscript𝔼𝐹𝑋superscript𝑋\displaystyle=\mathbb{E}_{F,G}|X-Y|-\frac{1}{2}\mathbb{E}_{F}|X-X^{\prime}|;= blackboard_E start_POSTSUBSCRIPT italic_F , italic_G end_POSTSUBSCRIPT | italic_X - italic_Y | - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT | italic_X - italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | ;
=(F(z)G(z))2dz+G(z)(1G(z))dz,absentsubscriptsuperscript𝐹𝑧𝐺𝑧2differential-d𝑧subscript𝐺𝑧1𝐺𝑧differential-d𝑧\displaystyle=\int_{\mathbb{R}}(F(z)-G(z))^{2}\mathrm{d}z+\int_{\mathbb{R}}G(z% )(1-G(z))\mathrm{d}z,= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( italic_F ( italic_z ) - italic_G ( italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_z + ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_G ( italic_z ) ( 1 - italic_G ( italic_z ) ) roman_d italic_z ,

where the second term of the last line is the entropy of the CRPS.

Proof.
𝔼G[CRPS(F,Y)]subscript𝔼𝐺delimited-[]CRPS𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{CRPS}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_CRPS ( italic_F , italic_Y ) ] =𝔼G[(F(z)𝟙yz)2dz]absentsubscript𝔼𝐺delimited-[]subscriptsuperscript𝐹𝑧subscript1𝑦𝑧2differential-d𝑧\displaystyle=\mathbb{E}_{G}\left[\int_{\mathbb{R}}(F(z)-\mathds{1}_{y\leq z})% ^{2}\mathrm{d}z\right]= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( italic_F ( italic_z ) - blackboard_1 start_POSTSUBSCRIPT italic_y ≤ italic_z end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_z ]
=𝔼G[(F(z)𝟙yz)2]dzabsentsubscriptsubscript𝔼𝐺delimited-[]superscript𝐹𝑧subscript1𝑦𝑧2differential-d𝑧\displaystyle=\int_{\mathbb{R}}\mathbb{E}_{G}\left[(F(z)-\mathds{1}_{y\leq z})% ^{2}\right]\mathrm{d}z= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ ( italic_F ( italic_z ) - blackboard_1 start_POSTSUBSCRIPT italic_y ≤ italic_z end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] roman_d italic_z
=𝔼G[F(z)22F(z)𝟙yz+𝟙yz2]dzabsentsubscriptsubscript𝔼𝐺delimited-[]𝐹superscript𝑧22𝐹𝑧subscript1𝑦𝑧superscriptsubscript1𝑦𝑧2differential-d𝑧\displaystyle=\int_{\mathbb{R}}\mathbb{E}_{G}\left[F(z)^{2}-2F(z)\mathds{1}_{y% \leq z}+\mathds{1}_{y\leq z}^{2}\right]\mathrm{d}z= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ italic_F ( italic_z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_F ( italic_z ) blackboard_1 start_POSTSUBSCRIPT italic_y ≤ italic_z end_POSTSUBSCRIPT + blackboard_1 start_POSTSUBSCRIPT italic_y ≤ italic_z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] roman_d italic_z
={F(z)22F(z)𝔼G[𝟙yz]+𝔼G[𝟙yz]}dzabsentsubscript𝐹superscript𝑧22𝐹𝑧subscript𝔼𝐺delimited-[]subscript1𝑦𝑧subscript𝔼𝐺delimited-[]subscript1𝑦𝑧differential-d𝑧\displaystyle=\int_{\mathbb{R}}\left\{F(z)^{2}-2F(z)\mathbb{E}_{G}\left[% \mathds{1}_{y\leq z}\right]+\mathbb{E}_{G}\left[\mathds{1}_{y\leq z}\right]% \right\}\mathrm{d}z= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT { italic_F ( italic_z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_F ( italic_z ) blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_y ≤ italic_z end_POSTSUBSCRIPT ] + blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ blackboard_1 start_POSTSUBSCRIPT italic_y ≤ italic_z end_POSTSUBSCRIPT ] } roman_d italic_z
={F(z)22F(z)G(z)+G(z)}dzabsentsubscript𝐹superscript𝑧22𝐹𝑧𝐺𝑧𝐺𝑧differential-d𝑧\displaystyle=\int_{\mathbb{R}}\left\{F(z)^{2}-2F(z)G(z)+G(z)\right\}\mathrm{d}z= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT { italic_F ( italic_z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_F ( italic_z ) italic_G ( italic_z ) + italic_G ( italic_z ) } roman_d italic_z
={F(z)22F(z)G(z)+G(z)2G(z)2+G(z)}dzabsentsubscript𝐹superscript𝑧22𝐹𝑧𝐺𝑧𝐺superscript𝑧2𝐺superscript𝑧2𝐺𝑧differential-d𝑧\displaystyle=\int_{\mathbb{R}}\left\{F(z)^{2}-2F(z)G(z)+G(z)^{2}-G(z)^{2}+G(z% )\right\}\mathrm{d}z= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT { italic_F ( italic_z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 italic_F ( italic_z ) italic_G ( italic_z ) + italic_G ( italic_z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_G ( italic_z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_G ( italic_z ) } roman_d italic_z
=(F(z)G(z))2dz+G(z)(1G(z))dzabsentsubscriptsuperscript𝐹𝑧𝐺𝑧2differential-d𝑧subscript𝐺𝑧1𝐺𝑧differential-d𝑧\displaystyle=\int_{\mathbb{R}}(F(z)-G(z))^{2}\mathrm{d}z+\int_{\mathbb{R}}G(z% )(1-G(z))\mathrm{d}z= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( italic_F ( italic_z ) - italic_G ( italic_z ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_d italic_z + ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_G ( italic_z ) ( 1 - italic_G ( italic_z ) ) roman_d italic_z

A.6 Dawid-Sebastiani score

For any F,G𝒫2()𝐹𝐺subscript𝒫2F,G\in\mathcal{P}_{2}(\mathbb{R})italic_F , italic_G ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ), the expectation of the Dawid-Sebastiani score (9) is :

𝔼G[DSS(F,Y)]=(μFμG)2σF2+σG2σF2+2logσF.subscript𝔼𝐺delimited-[]DSS𝐹𝑌superscriptsubscript𝜇𝐹subscript𝜇𝐺2superscriptsubscript𝜎𝐹2superscriptsubscript𝜎𝐺2superscriptsubscript𝜎𝐹22subscript𝜎𝐹\mathbb{E}_{G}[\mathrm{DSS}(F,Y)]=\frac{(\mu_{F}-\mu_{G})^{2}}{{\sigma_{F}}^{2% }}+\frac{{\sigma_{G}}^{2}}{{\sigma_{F}}^{2}}+2\log\sigma_{F}.blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_DSS ( italic_F , italic_Y ) ] = divide start_ARG ( italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 roman_log italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT .
Proof.
𝔼G[DSS(F,Y)]subscript𝔼𝐺delimited-[]DSS𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{DSS}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_DSS ( italic_F , italic_Y ) ] =𝔼G[(YμF)2σF2+2logσF]absentsubscript𝔼𝐺delimited-[]superscript𝑌subscript𝜇𝐹2superscriptsubscript𝜎𝐹22subscript𝜎𝐹\displaystyle=\mathbb{E}_{G}\left[\frac{(Y-\mu_{F})^{2}}{{\sigma_{F}}^{2}}+2% \log\sigma_{F}\right]= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ divide start_ARG ( italic_Y - italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 roman_log italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ]
=𝔼G[(YμF)2]σF2+2logσFabsentsubscript𝔼𝐺delimited-[]superscript𝑌subscript𝜇𝐹2superscriptsubscript𝜎𝐹22subscript𝜎𝐹\displaystyle=\frac{\mathbb{E}_{G}\left[(Y-\mu_{F})^{2}\right]}{{\sigma_{F}}^{% 2}}+2\log\sigma_{F}= divide start_ARG blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ ( italic_Y - italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 roman_log italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT

Noticing that 𝔼G[(YμF)2]=𝔼G[SE(F,Y)]subscript𝔼𝐺delimited-[]superscript𝑌subscript𝜇𝐹2subscript𝔼𝐺delimited-[]SE𝐹𝑌\mathbb{E}_{G}\left[(Y-\mu_{F})^{2}\right]=\mathbb{E}_{G}\left[\mathrm{SE}(F,Y% )\right]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ ( italic_Y - italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_SE ( italic_F , italic_Y ) ],

𝔼G[DSS(F,Y)]subscript𝔼𝐺delimited-[]DSS𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{DSS}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_DSS ( italic_F , italic_Y ) ] =(μFμG)2+σG2σF2+2logσF.absentsuperscriptsubscript𝜇𝐹subscript𝜇𝐺2superscriptsubscript𝜎𝐺2superscriptsubscript𝜎𝐹22subscript𝜎𝐹\displaystyle=\frac{(\mu_{F}-\mu_{G})^{2}+{\sigma_{G}}^{2}}{{\sigma_{F}}^{2}}+% 2\log\sigma_{F}.= divide start_ARG ( italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 roman_log italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT .

A.7 Error-spread score

For any F,G𝒫4()𝐹𝐺subscript𝒫4F,G\in\mathcal{P}_{4}(\mathbb{R})italic_F , italic_G ∈ caligraphic_P start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( blackboard_R ), the expectation of the error-spread score (10) is :

𝔼G[ESS(F,Y)]subscript𝔼𝐺delimited-[]ESS𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{ESS}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_ESS ( italic_F , italic_Y ) ] =[(σG2σF2)+(μGμF)2σFγF(μGμF)]2absentsuperscriptdelimited-[]superscriptsubscript𝜎𝐺2superscriptsubscript𝜎𝐹2superscriptsubscript𝜇𝐺subscript𝜇𝐹2subscript𝜎𝐹subscript𝛾𝐹subscript𝜇𝐺subscript𝜇𝐹2\displaystyle=\left[({\sigma_{G}}^{2}-{\sigma_{F}}^{2})+(\mu_{G}-\mu_{F})^{2}-% \sigma_{F}\gamma_{F}(\mu_{G}-\mu_{F})\right]^{2}= [ ( italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + ( italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+σG2[2(μGμF)+(σGγGσFγF)]2superscriptsubscript𝜎𝐺2superscriptdelimited-[]2subscript𝜇𝐺subscript𝜇𝐹subscript𝜎𝐺subscript𝛾𝐺subscript𝜎𝐹subscript𝛾𝐹2\displaystyle\ \ \ \ +{\sigma_{G}}^{2}\left[2(\mu_{G}-\mu_{F})+(\sigma_{G}% \gamma_{G}-\sigma_{F}\gamma_{F})\right]^{2}+ italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ 2 ( italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) + ( italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+σG4(βGγG21),superscriptsubscript𝜎𝐺4subscript𝛽𝐺superscriptsubscript𝛾𝐺21\displaystyle\ \ \ \ +{\sigma_{G}}^{4}(\beta_{G}-{\gamma_{G}}^{2}-1),+ italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_β start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - italic_γ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 1 ) ,

where μFsubscript𝜇𝐹\mu_{F}italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, σF2superscriptsubscript𝜎𝐹2\sigma_{F}^{2}italic_σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, γFsubscript𝛾𝐹\gamma_{F}italic_γ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT are the mean, the variance and the skewness of the probabilistic forecast F𝐹Fitalic_F. Similarly, μGsubscript𝜇𝐺\mu_{G}italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, σG2superscriptsubscript𝜎𝐺2\sigma_{G}^{2}italic_σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, γGsubscript𝛾𝐺\gamma_{G}italic_γ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and βGsubscript𝛽𝐺\beta_{G}italic_β start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are the first four centered moments of the distribution G𝐺Gitalic_G. The proof is available in Appendix B of Christensen et al. (2014).

A.8 Logarithmic score

For any F,G𝒫()𝐹𝐺𝒫F,G\in\mathcal{P}(\mathbb{R})italic_F , italic_G ∈ caligraphic_P ( blackboard_R ) such that F𝐹Fitalic_F and G𝐺Gitalic_G have probability density functions in the class 1()subscript1\mathcal{L}_{1}(\mathbb{R})caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R ), the expectation of the logarithmic score (11) is :

𝔼G[LogS(F,Y)]=DKL(G||F)+H(F),\mathbb{E}_{G}[\mathrm{LogS}(F,Y)]=D_{\mathrm{KL}}(G||F)+\mathrm{H}(F),blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_LogS ( italic_F , italic_Y ) ] = italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_G | | italic_F ) + roman_H ( italic_F ) ,

where DKL(G||F)D_{\mathrm{KL}}(G||F)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_G | | italic_F ) is the Kullback-Leibler divergence from F𝐹Fitalic_F to G𝐺Gitalic_G and H(F)H𝐹\mathrm{H}(F)roman_H ( italic_F ) is the Shannon entropy of F𝐹Fitalic_F. The proof is straightforward given that the Kullback-Leibler divergence and Shannon entropy are defined as

DKL(G||F)\displaystyle D_{\mathrm{KL}}(G||F)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_G | | italic_F ) =g(y)log(g(y)f(y))dy;absentsubscript𝑔𝑦𝑔𝑦𝑓𝑦differential-d𝑦\displaystyle=\int_{\mathbb{R}}g(y)\log\left(\frac{g(y)}{f(y)}\right)\mathrm{d% }y;= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_g ( italic_y ) roman_log ( divide start_ARG italic_g ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) end_ARG ) roman_d italic_y ;
H(F)H𝐹\displaystyle\mathrm{H}(F)roman_H ( italic_F ) =f(y)log(f(y))dy.absentsubscript𝑓𝑦𝑓𝑦differential-d𝑦\displaystyle=\int_{\mathbb{R}}f(y)\log(f(y))\mathrm{d}y.= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_f ( italic_y ) roman_log ( start_ARG italic_f ( italic_y ) end_ARG ) roman_d italic_y .

A.9 Hyvärinen score

For F,G𝐹𝐺F,Gitalic_F , italic_G such that their densities f𝑓fitalic_f exist, are twice continuously differentiable and satisfy f(x)/f(x)0superscript𝑓𝑥𝑓𝑥0f^{\prime}(x)/f(x)\to 0italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) / italic_f ( italic_x ) → 0 as |x|𝑥|x|\to\infty| italic_x | → ∞ and g(x)/g(x)0superscript𝑔𝑥𝑔𝑥0g^{\prime}(x)/g(x)\to 0italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) / italic_g ( italic_x ) → 0 as |x|𝑥|x|\to\infty| italic_x | → ∞, the expectation of the Hyvärinen score is :

𝔼G[HS(F,Y)]subscript𝔼𝐺delimited-[]HS𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{HS}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_HS ( italic_F , italic_Y ) ] =(f(y)2f(y)22f(y)g(y)f(y)g(y))g(y)dyabsentsubscriptsuperscript𝑓superscript𝑦2𝑓superscript𝑦22superscript𝑓𝑦superscript𝑔𝑦𝑓𝑦𝑔𝑦𝑔𝑦differential-d𝑦\displaystyle=\int_{\mathbb{R}}\left(\frac{f^{\prime}(y)^{2}}{f(y)^{2}}-2\frac% {f^{\prime}(y)g^{\prime}(y)}{f(y)g(y)}\right)g(y)\mathrm{d}y= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 2 divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) italic_g ( italic_y ) end_ARG ) italic_g ( italic_y ) roman_d italic_y
=(f(y)f(y)g(y)g(y))2g(y)dyg(y)2g(y)2g(y)dyabsentsubscriptsuperscriptsuperscript𝑓𝑦𝑓𝑦superscript𝑔𝑦𝑔𝑦2𝑔𝑦differential-d𝑦subscriptsuperscript𝑔superscript𝑦2𝑔superscript𝑦2𝑔𝑦differential-d𝑦\displaystyle=\int_{\mathbb{R}}\left(\frac{f^{\prime}(y)}{f(y)}-\frac{g^{% \prime}(y)}{g(y)}\right)^{2}g(y)\mathrm{d}y-\int_{\mathbb{R}}\frac{g^{\prime}(% y)^{2}}{g(y)^{2}}g(y)\mathrm{d}y= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) end_ARG - divide start_ARG italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_g ( italic_y ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g ( italic_y ) roman_d italic_y - ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_g ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_g ( italic_y ) roman_d italic_y

where the last formula shows the entropy of the Hyvärinen score (second term on the right-hand side).

Proof.
𝔼G[HS(F,Y)]subscript𝔼𝐺delimited-[]HS𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{HS}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_HS ( italic_F , italic_Y ) ] =𝔼[2f′′(Y)f(Y)f(Y)2f(y)2]absent𝔼delimited-[]2superscript𝑓′′𝑌𝑓𝑌superscript𝑓superscript𝑌2𝑓superscript𝑦2\displaystyle=\mathbb{E}\left[2\frac{f^{\prime\prime}(Y)}{f(Y)}-\frac{f^{% \prime}(Y)^{2}}{f(y)^{2}}\right]= blackboard_E [ 2 divide start_ARG italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_Y ) end_ARG start_ARG italic_f ( italic_Y ) end_ARG - divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_Y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ]
=2f′′(y)f(y)g(y)dyf(y)2f(y)2g(y)dyabsent2subscriptsuperscript𝑓′′𝑦𝑓𝑦𝑔𝑦differential-d𝑦subscriptsuperscript𝑓superscript𝑦2𝑓superscript𝑦2𝑔𝑦differential-d𝑦\displaystyle=2\int_{\mathbb{R}}\frac{f^{\prime\prime}(y)}{f(y)}g(y)\mathrm{d}% y-\int_{\mathbb{R}}\frac{f^{\prime}(y)^{2}}{f(y)^{2}}g(y)\mathrm{d}y= 2 ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) end_ARG italic_g ( italic_y ) roman_d italic_y - ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_g ( italic_y ) roman_d italic_y

Integrating by part the integral of the first term on the right-hand side leads to :

f′′(y)f(y)g(y)dysubscriptsuperscript𝑓′′𝑦𝑓𝑦𝑔𝑦differential-d𝑦\displaystyle\int_{\mathbb{R}}\frac{f^{\prime\prime}(y)}{f(y)}g(y)\mathrm{d}y∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) end_ARG italic_g ( italic_y ) roman_d italic_y =[f(y)f(y)g(y)]+f(y)g(y)f(y)g(y)f(y)f(y)2dyabsentsubscriptsuperscriptdelimited-[]superscript𝑓𝑦𝑓𝑦𝑔𝑦subscriptsuperscript𝑓𝑦superscript𝑔𝑦𝑓𝑦𝑔𝑦superscript𝑓𝑦𝑓superscript𝑦2differential-d𝑦\displaystyle=\left[\frac{f^{\prime}(y)}{f(y)}g(y)\right]^{+\infty}_{-\infty}-% \int_{\mathbb{R}}f^{\prime}(y)\frac{g^{\prime}(y)f(y)-g(y)f^{\prime}(y)}{f(y)^% {2}}\mathrm{d}y= [ divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) end_ARG italic_g ( italic_y ) ] start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - ∞ end_POSTSUBSCRIPT - ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) divide start_ARG italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) italic_f ( italic_y ) - italic_g ( italic_y ) italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_d italic_y
=f(y)g(y)f(y)g(y)g(y)dy+f(y)2f(y)2g(y)dyabsentsubscriptsuperscript𝑓𝑦superscript𝑔𝑦𝑓𝑦𝑔𝑦𝑔𝑦differential-d𝑦subscriptsuperscript𝑓superscript𝑦2𝑓superscript𝑦2𝑔𝑦differential-d𝑦\displaystyle=-\int_{\mathbb{R}}\frac{f^{\prime}(y)g^{\prime}(y)}{f(y)g(y)}g(y% )\mathrm{d}y+\int_{\mathbb{R}}\frac{f^{\prime}(y)^{2}}{f(y)^{2}}g(y)\mathrm{d}y= - ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) italic_g ( italic_y ) end_ARG italic_g ( italic_y ) roman_d italic_y + ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_g ( italic_y ) roman_d italic_y

The boundary term is null since f(x)/f(x)0superscript𝑓𝑥𝑓𝑥0f^{\prime}(x)/f(x)\to 0italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x ) / italic_f ( italic_x ) → 0 as |x|𝑥|x|\to\infty| italic_x | → ∞ and g𝑔gitalic_g is a probability density function.
Thus,

𝔼G[HS(F,Y)]subscript𝔼𝐺delimited-[]HS𝐹𝑌\displaystyle\mathbb{E}_{G}[\mathrm{HS}(F,Y)]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_HS ( italic_F , italic_Y ) ] =2f(y)g(y)f(y)g(y)g(y)dy+2f(y)2f(y)2g(y)dyf(y)2f(y)2g(y)dyabsent2subscriptsuperscript𝑓𝑦superscript𝑔𝑦𝑓𝑦𝑔𝑦𝑔𝑦differential-d𝑦2subscriptsuperscript𝑓superscript𝑦2𝑓superscript𝑦2𝑔𝑦differential-d𝑦subscriptsuperscript𝑓superscript𝑦2𝑓superscript𝑦2𝑔𝑦differential-d𝑦\displaystyle=-2\int_{\mathbb{R}}\frac{f^{\prime}(y)g^{\prime}(y)}{f(y)g(y)}g(% y)\mathrm{d}y+2\int_{\mathbb{R}}\frac{f^{\prime}(y)^{2}}{f(y)^{2}}g(y)\mathrm{% d}y-\int_{\mathbb{R}}\frac{f^{\prime}(y)^{2}}{f(y)^{2}}g(y)\mathrm{d}y= - 2 ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) italic_g ( italic_y ) end_ARG italic_g ( italic_y ) roman_d italic_y + 2 ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_g ( italic_y ) roman_d italic_y - ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_g ( italic_y ) roman_d italic_y
=2f(y)g(y)f(y)g(y)g(y)dy+f(y)2f(y)2g(y)dyabsent2subscriptsuperscript𝑓𝑦superscript𝑔𝑦𝑓𝑦𝑔𝑦𝑔𝑦differential-d𝑦subscriptsuperscript𝑓superscript𝑦2𝑓superscript𝑦2𝑔𝑦differential-d𝑦\displaystyle=-2\int_{\mathbb{R}}\frac{f^{\prime}(y)g^{\prime}(y)}{f(y)g(y)}g(% y)\mathrm{d}y+\int_{\mathbb{R}}\frac{f^{\prime}(y)^{2}}{f(y)^{2}}g(y)\mathrm{d}y= - 2 ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) italic_g ( italic_y ) end_ARG italic_g ( italic_y ) roman_d italic_y + ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_g ( italic_y ) roman_d italic_y
=(f(y)2f(y)22f(y)g(y)f(y)g(y))g(y)dyabsentsubscriptsuperscript𝑓superscript𝑦2𝑓superscript𝑦22superscript𝑓𝑦superscript𝑔𝑦𝑓𝑦𝑔𝑦𝑔𝑦differential-d𝑦\displaystyle=\int_{\mathbb{R}}\left(\frac{f^{\prime}(y)^{2}}{f(y)^{2}}-2\frac% {f^{\prime}(y)g^{\prime}(y)}{f(y)g(y)}\right)g(y)\mathrm{d}y= ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_f ( italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 2 divide start_ARG italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) end_ARG start_ARG italic_f ( italic_y ) italic_g ( italic_y ) end_ARG ) italic_g ( italic_y ) roman_d italic_y

A.10 Quadratic score

For any F,G2()𝐹𝐺subscript2F,G\in\mathcal{L}_{2}(\mathbb{R})italic_F , italic_G ∈ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R ), the expectation of the quadratic score is :

𝔼G[QuadS(F,Y)]=f222f,g,subscript𝔼𝐺delimited-[]QuadS𝐹𝑌superscriptsubscriptdelimited-∥∥𝑓222𝑓𝑔\mathbb{E}_{G}[\mathrm{QuadS}(F,Y)]=\lVert f\rVert_{2}^{2}-2\langle f,g\rangle,blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_QuadS ( italic_F , italic_Y ) ] = ∥ italic_f ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_f , italic_g ⟩ ,

where f,g=f(y)g(y)dy𝑓𝑔subscript𝑓𝑦𝑔𝑦differential-d𝑦\langle f,g\rangle=\int_{\mathbb{R}}f(y)g(y)\mathrm{d}y⟨ italic_f , italic_g ⟩ = ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_f ( italic_y ) italic_g ( italic_y ) roman_d italic_y.

A.11 Pseudospherical score

For any F,Gα()𝐹𝐺subscript𝛼F,G\in\mathcal{L}_{\alpha}(\mathbb{R})italic_F , italic_G ∈ caligraphic_L start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( blackboard_R ), the expectation of the quadratic score is :

𝔼G[PseudoS(F,Y)]=fα1,gfαα1,subscript𝔼𝐺delimited-[]PseudoS𝐹𝑌superscript𝑓𝛼1𝑔superscriptsubscriptdelimited-∥∥𝑓𝛼𝛼1\mathbb{E}_{G}[\mathrm{PseudoS}(F,Y)]=-\frac{\langle f^{\alpha-1},g\rangle}{% \lVert f\rVert_{\alpha}^{\alpha-1}},blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_PseudoS ( italic_F , italic_Y ) ] = - divide start_ARG ⟨ italic_f start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT , italic_g ⟩ end_ARG start_ARG ∥ italic_f ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT end_ARG ,

where fα1,g=f(y)α1g(y)dysuperscript𝑓𝛼1𝑔subscript𝑓superscript𝑦𝛼1𝑔𝑦differential-d𝑦\langle f^{\alpha-1},g\rangle=\int_{\mathbb{R}}f(y)^{\alpha-1}g(y)\mathrm{d}y⟨ italic_f start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT , italic_g ⟩ = ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_f ( italic_y ) start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT italic_g ( italic_y ) roman_d italic_y.

Appendix B Expected multivariate scoring rules

B.1 Squared error

For any F,G𝒫2(d)𝐹𝐺subscript𝒫2superscript𝑑F,G\in\mathcal{P}_{2}(\mathbb{R}^{d})italic_F , italic_G ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), the expectation of the squared error (12) is :

𝔼G[SE(F,𝒀)]=𝝁F𝝁G22+tr(ΣG),subscript𝔼𝐺delimited-[]SE𝐹𝒀subscriptsuperscriptdelimited-∥∥subscript𝝁𝐹subscript𝝁𝐺22trsubscriptΣ𝐺\mathbb{E}_{G}[\mathrm{SE}(F,\bm{Y})]=\lVert\bm{\mu}_{F}-\bm{\mu}_{G}\rVert^{2% }_{2}+\mathrm{tr}(\Sigma_{G}),blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_SE ( italic_F , bold_italic_Y ) ] = ∥ bold_italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_tr ( roman_Σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ,

where 𝝁Fsubscript𝝁𝐹\bm{\mu}_{F}bold_italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the mean vector of the distribution F𝐹Fitalic_F and 𝝁Gsubscript𝝁𝐺\bm{\mu}_{G}bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and ΣG2superscriptsubscriptΣ𝐺2{\Sigma_{G}}^{2}roman_Σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT are the mean vector and the covariance matrix of the distribution G𝐺Gitalic_G.

Proof.

Let Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the projection on the i𝑖iitalic_i-th margin.

𝔼G[SE(F,𝒀)]subscript𝔼𝐺delimited-[]SE𝐹𝒀\displaystyle\mathbb{E}_{G}[\mathrm{SE}(F,\bm{Y})]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_SE ( italic_F , bold_italic_Y ) ] =𝔼G[𝝁F𝒀22]absentsubscript𝔼𝐺delimited-[]subscriptsuperscriptdelimited-∥∥subscript𝝁𝐹𝒀22\displaystyle=\mathbb{E}_{G}[\lVert\bm{\mu}_{F}-\bm{Y}\rVert^{2}_{2}]= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ ∥ bold_italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - bold_italic_Y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]
=𝔼G[i=1d(𝝁Ti(F)Ti(𝒀))2]absentsubscript𝔼𝐺delimited-[]superscriptsubscript𝑖1𝑑superscriptsubscript𝝁subscript𝑇𝑖𝐹subscript𝑇𝑖𝒀2\displaystyle=\mathbb{E}_{G}\left[\sum_{i=1}^{d}(\bm{\mu}_{T_{i}(F)}-T_{i}(\bm% {Y}))^{2}\right]= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_F ) end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_Y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=i=1d𝔼Ti(G)[SE(Ti(F),Y)]absentsuperscriptsubscript𝑖1𝑑subscript𝔼subscript𝑇𝑖𝐺delimited-[]SEsubscript𝑇𝑖𝐹𝑌\displaystyle=\sum_{i=1}^{d}\mathbb{E}_{T_{i}(G)}\left[\mathrm{SE}(T_{i}(F),Y)\right]= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_G ) end_POSTSUBSCRIPT [ roman_SE ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_F ) , italic_Y ) ]
=i=1d((μTi(F)μTi(G))2+σTi(G)2)absentsuperscriptsubscript𝑖1𝑑superscriptsubscript𝜇subscript𝑇𝑖𝐹subscript𝜇subscript𝑇𝑖𝐺2superscriptsubscript𝜎subscript𝑇𝑖𝐺2\displaystyle=\sum_{i=1}^{d}\left((\mu_{T_{i}(F)}-\mu_{T_{i}(G)})^{2}+\sigma_{% T_{i}(G)}^{2}\right)= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( ( italic_μ start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_F ) end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_G ) end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_G ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=𝝁F𝝁G22+tr(ΣG)absentsubscriptsuperscriptdelimited-∥∥subscript𝝁𝐹subscript𝝁𝐺22trsubscriptΣ𝐺\displaystyle=\lVert\bm{\mu}_{F}-\bm{\mu}_{G}\rVert^{2}_{2}+\mathrm{tr}(\Sigma% _{G})= ∥ bold_italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + roman_tr ( roman_Σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT )

B.2 Dawid-Sebastiani score

For any F,G𝒫2(d)𝐹𝐺subscript𝒫2superscript𝑑F,G\in\mathcal{P}_{2}(\mathbb{R}^{d})italic_F , italic_G ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), the expectation of the Dawid-Sebastiani score is :

𝔼G[DSS(F,𝒀)]=log(detΣF)+(𝝁F𝝁G)TΣF1(𝝁F𝝁G)+tr(ΣGΣF1).subscript𝔼𝐺delimited-[]DSS𝐹𝒀subscriptΣ𝐹superscriptsubscript𝝁𝐹subscript𝝁𝐺𝑇superscriptsubscriptΣ𝐹1subscript𝝁𝐹subscript𝝁𝐺trsubscriptΣ𝐺superscriptsubscriptΣ𝐹1\mathbb{E}_{G}[\mathrm{DSS}(F,\bm{Y})]=\log(\det\Sigma_{F})+(\bm{\mu}_{F}-\bm{% \mu}_{G})^{T}\Sigma_{F}^{-1}(\bm{\mu}_{F}-\bm{\mu}_{G})+\mathrm{tr}(\Sigma_{G}% \Sigma_{F}^{-1}).blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_DSS ( italic_F , bold_italic_Y ) ] = roman_log ( start_ARG roman_det roman_Σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT end_ARG ) + ( bold_italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_Σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) + roman_tr ( roman_Σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) .

The proof is available in the original article (Dawid and Sebastiani, 1999).

B.3 Energy score

In a general setting, the expected energy score does not simplify. For any F,G𝒫β(d)𝐹𝐺subscript𝒫𝛽superscript𝑑F,G\in\mathcal{P}_{\beta}(\mathbb{R}^{d})italic_F , italic_G ∈ caligraphic_P start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), the expected energy score (13) is :

𝔼G[ESβ(F,𝒀)]=𝔼F,G𝑿𝒀2β12𝔼F𝑿𝑿2β.\mathbb{E}_{G}[\mathrm{ES}_{\beta}(F,\bm{Y})]=\mathbb{E}_{F,G}\lVert\bm{X}-\bm% {Y}\lVert^{\beta}_{2}-\frac{1}{2}\mathbb{E}_{F}\lVert\bm{X}-\bm{X}^{\prime}% \lVert^{\beta}_{2}.blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_ES start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ( italic_F , bold_italic_Y ) ] = blackboard_E start_POSTSUBSCRIPT italic_F , italic_G end_POSTSUBSCRIPT ∥ bold_italic_X - bold_italic_Y ∥ start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∥ bold_italic_X - bold_italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

B.4 Variogram score

For any F,G𝒫(d)𝐹𝐺𝒫superscript𝑑F,G\in\mathcal{P}(\mathbb{R}^{d})italic_F , italic_G ∈ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) such that the 2p2𝑝2p2 italic_p-th moments of all their univariate margins are finite, the expected variogram score of order p𝑝pitalic_p (14) is :

𝔼G[VSp(F,𝒀)]=i,j=1dwij(𝔼F[|XiXj|p]22𝔼F[|XiXj|p]𝔼G[|YiYj|p]+𝔼G[|YiYj|2p]).subscript𝔼𝐺delimited-[]subscriptVS𝑝𝐹𝒀superscriptsubscript𝑖𝑗1𝑑subscript𝑤𝑖𝑗subscript𝔼𝐹superscriptdelimited-[]superscriptsubscript𝑋𝑖subscript𝑋𝑗𝑝22subscript𝔼𝐹delimited-[]superscriptsubscript𝑋𝑖subscript𝑋𝑗𝑝subscript𝔼𝐺delimited-[]superscriptsubscript𝑌𝑖subscript𝑌𝑗𝑝subscript𝔼𝐺delimited-[]superscriptsubscript𝑌𝑖subscript𝑌𝑗2𝑝\displaystyle\mathbb{E}_{G}[\mathrm{VS}_{p}(F,\bm{Y})]=\sum_{i,j=1}^{d}w_{ij}% \left(\mathbb{E}_{F}\left[|X_{i}-X_{j}|^{p}\right]^{2}-2\mathbb{E}_{F}\left[|X% _{i}-X_{j}|^{p}\right]\mathbb{E}_{G}[|Y_{i}-Y_{j}|^{p}]+\mathbb{E}_{G}[|Y_{i}-% Y_{j}|^{2p}]\right).blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_VS start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_F , bold_italic_Y ) ] = ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ | italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] + blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ | italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT ] ) .
Proof.
𝔼G[VSp(F,𝒀)]subscript𝔼𝐺delimited-[]subscriptVS𝑝𝐹𝒀\displaystyle\mathbb{E}_{G}[\mathrm{VS}_{p}(F,\bm{Y})]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_VS start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_F , bold_italic_Y ) ] =𝔼G[i,j=1dwij(𝔼F[|XiXj|p]|YiYj|p)2]absentsubscript𝔼𝐺delimited-[]superscriptsubscript𝑖𝑗1𝑑subscript𝑤𝑖𝑗superscriptsubscript𝔼𝐹delimited-[]superscriptsubscript𝑋𝑖subscript𝑋𝑗𝑝superscriptsubscript𝑌𝑖subscript𝑌𝑗𝑝2\displaystyle=\mathbb{E}_{G}\left[\sum_{i,j=1}^{d}w_{ij}\left(\mathbb{E}_{F}% \left[|X_{i}-X_{j}|^{p}\right]-|Y_{i}-Y_{j}|^{p}\right)^{2}\right]= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] - | italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=𝔼G[i,j=1dwij(𝔼F[|XiXj|p]22𝔼F[|XiXj|p]|YiYj|p+|YiYj|2p)]absentsubscript𝔼𝐺delimited-[]superscriptsubscript𝑖𝑗1𝑑subscript𝑤𝑖𝑗subscript𝔼𝐹superscriptdelimited-[]superscriptsubscript𝑋𝑖subscript𝑋𝑗𝑝22subscript𝔼𝐹delimited-[]superscriptsubscript𝑋𝑖subscript𝑋𝑗𝑝superscriptsubscript𝑌𝑖subscript𝑌𝑗𝑝superscriptsubscript𝑌𝑖subscript𝑌𝑗2𝑝\displaystyle=\mathbb{E}_{G}\left[\sum_{i,j=1}^{d}w_{ij}\left(\mathbb{E}_{F}% \left[|X_{i}-X_{j}|^{p}\right]^{2}-2\mathbb{E}_{F}\left[|X_{i}-X_{j}|^{p}% \right]|Y_{i}-Y_{j}|^{p}+|Y_{i}-Y_{j}|^{2p}\right)\right]= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] | italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT + | italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT ) ]
=i,j=1dwij(𝔼F[|XiXj|p]22𝔼F[|XiXj|p]𝔼G[|YiYj|p]+𝔼G[|YiYj|2p]).absentsuperscriptsubscript𝑖𝑗1𝑑subscript𝑤𝑖𝑗subscript𝔼𝐹superscriptdelimited-[]superscriptsubscript𝑋𝑖subscript𝑋𝑗𝑝22subscript𝔼𝐹delimited-[]superscriptsubscript𝑋𝑖subscript𝑋𝑗𝑝subscript𝔼𝐺delimited-[]superscriptsubscript𝑌𝑖subscript𝑌𝑗𝑝subscript𝔼𝐺delimited-[]superscriptsubscript𝑌𝑖subscript𝑌𝑗2𝑝\displaystyle=\sum_{i,j=1}^{d}w_{ij}\left(\mathbb{E}_{F}\left[|X_{i}-X_{j}|^{p% }\right]^{2}-2\mathbb{E}_{F}\left[|X_{i}-X_{j}|^{p}\right]\mathbb{E}_{G}[|Y_{i% }-Y_{j}|^{p}]+\mathbb{E}_{G}[|Y_{i}-Y_{j}|^{2p}]\right).= ∑ start_POSTSUBSCRIPT italic_i , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ | italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] + blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ | italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 italic_p end_POSTSUPERSCRIPT ] ) .

B.5 Logarithmic score

For any F,G𝒫(d)𝐹𝐺𝒫superscript𝑑F,G\in\mathcal{P}(\mathbb{R}^{d})italic_F , italic_G ∈ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) such that F𝐹Fitalic_F and G𝐺Gitalic_G have probability density functions that belong to 1(d)subscript1superscript𝑑\mathcal{L}_{1}(\mathbb{R}^{d})caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), the expectation of the logarithmic score is analogous to its univariate version :

𝔼G[LogS(F,𝒀)]=DKL(G||F)+H(F),\mathbb{E}_{G}[\mathrm{LogS}(F,\bm{Y})]=D_{\mathrm{KL}}(G||F)+\mathrm{H}(F),blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_LogS ( italic_F , bold_italic_Y ) ] = italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_G | | italic_F ) + roman_H ( italic_F ) ,

where DKL(G||F)D_{\mathrm{KL}}(G||F)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_G | | italic_F ) is the Kullback-Leibler divergence from F𝐹Fitalic_F to G𝐺Gitalic_G and H(F)H𝐹\mathrm{H}(F)roman_H ( italic_F ) is the Shannon entropy of F𝐹Fitalic_F.

DKL(G||F)\displaystyle D_{\mathrm{KL}}(G||F)italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_G | | italic_F ) =dg(𝒚)log(g(𝒚)f(𝒚))d𝒚absentsubscriptsuperscript𝑑𝑔𝒚𝑔𝒚𝑓𝒚differential-d𝒚\displaystyle=\int_{\mathbb{R}^{d}}g(\bm{y})\log\left(\frac{g(\bm{y})}{f(\bm{y% })}\right)\mathrm{d}\bm{y}= ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_g ( bold_italic_y ) roman_log ( divide start_ARG italic_g ( bold_italic_y ) end_ARG start_ARG italic_f ( bold_italic_y ) end_ARG ) roman_d bold_italic_y
H(F)H𝐹\displaystyle\mathrm{H}(F)roman_H ( italic_F ) =df(𝒚)log(f(𝒚))d𝒚.absentsubscriptsuperscript𝑑𝑓𝒚𝑓𝒚differential-d𝒚\displaystyle=\int_{\mathbb{R}^{d}}f(\bm{y})\log(f(\bm{y}))\mathrm{d}\bm{y}.= ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_y ) roman_log ( start_ARG italic_f ( bold_italic_y ) end_ARG ) roman_d bold_italic_y .

B.6 Hyvärinen score

For F,G𝒫(d)𝐹𝐺𝒫superscript𝑑F,G\in\mathcal{P}(\mathbb{R}^{d})italic_F , italic_G ∈ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) such that their probability density functions f𝑓fitalic_f and g𝑔gitalic_g such that they are twice continuously differentiable and satisfying f(x)0𝑓𝑥0\nabla f(x)\to 0∇ italic_f ( italic_x ) → 0 and g(x)0𝑔𝑥0\nabla g(x)\to 0∇ italic_g ( italic_x ) → 0 as xdelimited-∥∥𝑥\lVert x\rVert\to\infty∥ italic_x ∥ → ∞, the expectation of the Hyvärinen score is :

𝔼[HS(F,𝒀)]=dg(y)log(f(y))2log(g(y)),log(f(y))g(y)dy𝔼delimited-[]HS𝐹𝒀subscriptsuperscript𝑑𝑔𝑦𝑓𝑦2𝑔𝑦𝑓𝑦𝑔𝑦differential-d𝑦\mathbb{E}[\mathrm{HS}(F,\bm{Y})]=\int_{\mathbb{R}^{d}}g(y)\langle\nabla\log(f% (y))-2\nabla\log(g(y)),\nabla\log(f(y))\rangle g(y)\mathrm{d}yblackboard_E [ roman_HS ( italic_F , bold_italic_Y ) ] = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_g ( italic_y ) ⟨ ∇ roman_log ( start_ARG italic_f ( italic_y ) end_ARG ) - 2 ∇ roman_log ( start_ARG italic_g ( italic_y ) end_ARG ) , ∇ roman_log ( start_ARG italic_f ( italic_y ) end_ARG ) ⟩ italic_g ( italic_y ) roman_d italic_y

where \nabla is the gradient operator and ,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩ is the scalar product. The proof is similar to the proof for the univariate case using integration by parts and Stoke’s theorem (Parry et al., 2012).

B.7 Quadratic score

For any F,G2(d)𝐹𝐺subscript2superscript𝑑F,G\in\mathcal{L}_{2}(\mathbb{R}^{d})italic_F , italic_G ∈ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), the expectation of the quadratic score is analogous to its univariate version :

𝔼G[QuadS(F,𝒀)]=f222f,g,subscript𝔼𝐺delimited-[]QuadS𝐹𝒀superscriptsubscriptdelimited-∥∥𝑓222𝑓𝑔\mathbb{E}_{G}[\mathrm{QuadS}(F,\bm{Y})]=\lVert f\rVert_{2}^{2}-2\langle f,g\rangle,blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_QuadS ( italic_F , bold_italic_Y ) ] = ∥ italic_f ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_f , italic_g ⟩ ,

where f,g=df(𝒚)g(𝒚)d𝒚𝑓𝑔subscriptsuperscript𝑑𝑓𝒚𝑔𝒚differential-d𝒚\langle f,g\rangle=\int_{\mathbb{R}^{d}}f(\bm{y})g(\bm{y})\mathrm{d}\bm{y}⟨ italic_f , italic_g ⟩ = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_y ) italic_g ( bold_italic_y ) roman_d bold_italic_y.

B.8 Pseudospherical score

For any F,Gα(d)𝐹𝐺subscript𝛼superscript𝑑F,G\in\mathcal{L}_{\alpha}(\mathbb{R}^{d})italic_F , italic_G ∈ caligraphic_L start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ), the expectation of the quadratic score is analogous to its univariate version :

𝔼G[PseudoS(F,𝒀)]=fα1,gfαα1,subscript𝔼𝐺delimited-[]PseudoS𝐹𝒀superscript𝑓𝛼1𝑔superscriptsubscriptdelimited-∥∥𝑓𝛼𝛼1\mathbb{E}_{G}[\mathrm{PseudoS}(F,\bm{Y})]=-\frac{\langle f^{\alpha-1},g% \rangle}{\lVert f\rVert_{\alpha}^{\alpha-1}},blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_PseudoS ( italic_F , bold_italic_Y ) ] = - divide start_ARG ⟨ italic_f start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT , italic_g ⟩ end_ARG start_ARG ∥ italic_f ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT end_ARG ,

where fα1,g=df(𝒚)α1g(𝒚)d𝒚superscript𝑓𝛼1𝑔subscriptsuperscript𝑑𝑓superscript𝒚𝛼1𝑔𝒚differential-d𝒚\langle f^{\alpha-1},g\rangle=\int_{\mathbb{R}^{d}}f(\bm{y})^{\alpha-1}g(\bm{y% })\mathrm{d}\bm{y}⟨ italic_f start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT , italic_g ⟩ = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_y ) start_POSTSUPERSCRIPT italic_α - 1 end_POSTSUPERSCRIPT italic_g ( bold_italic_y ) roman_d bold_italic_y.

Appendix C Proofs

C.1 Proposition 1

Proof of Proposition 1.

Let 𝒫(d)𝒫superscript𝑑\mathcal{F}\subset\mathcal{P}(\mathbb{R}^{d})caligraphic_F ⊂ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) be a class of Borel probability measure on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and let F𝐹F\in\mathcal{F}italic_F ∈ caligraphic_F be a forecast and yd𝑦superscript𝑑y\in\mathbb{R}^{d}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT an observation. Let T:dk:𝑇superscript𝑑superscript𝑘T:\mathbb{R}^{d}\to\mathbb{R}^{k}italic_T : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT be a transformation and let SS\mathrm{S}roman_S be a scoring rule on ksuperscript𝑘\mathbb{R}^{k}blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT that is proper relative to T()={(T(𝑿)),XF}𝑇similar-to𝑇𝑿𝑋𝐹T(\mathcal{F})=\{\mathcal{L}(T(\bm{X})),X\sim F\in\mathcal{F}\}italic_T ( caligraphic_F ) = { caligraphic_L ( italic_T ( bold_italic_X ) ) , italic_X ∼ italic_F ∈ caligraphic_F }.

𝔼G[ST(F,𝒀)]subscript𝔼𝐺delimited-[]subscriptS𝑇𝐹𝒀\displaystyle\mathbb{E}_{G}\left[\mathrm{S}_{T}(F,\bm{Y})\right]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_F , bold_italic_Y ) ] =𝔼G[S(T(F)),T(𝒀))]\displaystyle=\mathbb{E}_{G}\left[\mathrm{S}(T(F)),T(\bm{Y}))\right]= blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_S ( italic_T ( italic_F ) ) , italic_T ( bold_italic_Y ) ) ]
=𝔼T(G)[S(T(F),𝒀)]absentsubscript𝔼𝑇𝐺delimited-[]S𝑇𝐹𝒀\displaystyle=\mathbb{E}_{T(G)}\left[\mathrm{S}(T(F),\bm{Y})\right]= blackboard_E start_POSTSUBSCRIPT italic_T ( italic_G ) end_POSTSUBSCRIPT [ roman_S ( italic_T ( italic_F ) , bold_italic_Y ) ]

Given that T(F),T(G)T()𝑇𝐹𝑇𝐺𝑇T(F),T(G)\in T(\mathcal{F})italic_T ( italic_F ) , italic_T ( italic_G ) ∈ italic_T ( caligraphic_F ) and SS\mathrm{S}roman_S is proper relative to T()𝑇T(\mathcal{F})italic_T ( caligraphic_F ),

𝔼T(G)[S(T(G),𝒀)]𝔼T(G)[S(T(F),𝒀)]subscript𝔼𝑇𝐺delimited-[]S𝑇𝐺𝒀subscript𝔼𝑇𝐺delimited-[]S𝑇𝐹𝒀\displaystyle\mathbb{E}_{T(G)}\left[\mathrm{S}(T(G),\bm{Y})\right]\leq\mathbb{% E}_{T(G)}\left[\mathrm{S}(T(F),\bm{Y})\right]blackboard_E start_POSTSUBSCRIPT italic_T ( italic_G ) end_POSTSUBSCRIPT [ roman_S ( italic_T ( italic_G ) , bold_italic_Y ) ] ≤ blackboard_E start_POSTSUBSCRIPT italic_T ( italic_G ) end_POSTSUBSCRIPT [ roman_S ( italic_T ( italic_F ) , bold_italic_Y ) ]
\displaystyle\Leftrightarrow\ \ 𝔼G[ST(G,𝒀)]𝔼G[ST(F,𝒀)]subscript𝔼𝐺delimited-[]subscriptS𝑇𝐺𝒀subscript𝔼𝐺delimited-[]subscriptS𝑇𝐹𝒀\displaystyle\mathbb{E}_{G}\left[\mathrm{S}_{T}(G,\bm{Y})\right]\leq\mathbb{E}% _{G}\left[\mathrm{S}_{T}(F,\bm{\bm{Y}})\right]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_G , bold_italic_Y ) ] ≤ blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_F , bold_italic_Y ) ] (23)

Proof of the strict propriety case in Proposition 1.

The notations are the same as the proof above except the following. Let T:dk:𝑇superscript𝑑superscript𝑘T:\mathbb{R}^{d}\to\mathbb{R}^{k}italic_T : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT be an injective transformation and let SS\mathrm{S}roman_S be a scoring rule on ksuperscript𝑘\mathbb{R}^{k}blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT that is strictly proper relative to T()={(T(𝑿)),XF}𝑇similar-to𝑇𝑿𝑋𝐹T(\mathcal{F})=\{\mathcal{L}(T(\bm{X})),X\sim F\in\mathcal{F}\}italic_T ( caligraphic_F ) = { caligraphic_L ( italic_T ( bold_italic_X ) ) , italic_X ∼ italic_F ∈ caligraphic_F }.

The equality in Equation (23) leads to :

𝔼G[ST(G,𝒀)]=𝔼G[ST(F,𝒀)]subscript𝔼𝐺delimited-[]subscriptS𝑇𝐺𝒀subscript𝔼𝐺delimited-[]subscriptS𝑇𝐹𝒀\displaystyle\mathbb{E}_{G}\left[\mathrm{S}_{T}(G,\bm{Y})\right]=\mathbb{E}_{G% }\left[\mathrm{S}_{T}(F,\bm{Y})\right]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_G , bold_italic_Y ) ] = blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_F , bold_italic_Y ) ]
\displaystyle\Leftrightarrow\ \ 𝔼G[S(T(G),T(𝒀))]=𝔼G[S(T(F),T(𝒀))]subscript𝔼𝐺delimited-[]S𝑇𝐺𝑇𝒀subscript𝔼𝐺delimited-[]S𝑇𝐹𝑇𝒀\displaystyle\mathbb{E}_{G}\left[\mathrm{S}(T(G),T(\bm{Y}))\right]=\mathbb{E}_% {G}\left[\mathrm{S}(T(F),T(\bm{Y}))\right]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_S ( italic_T ( italic_G ) , italic_T ( bold_italic_Y ) ) ] = blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ roman_S ( italic_T ( italic_F ) , italic_T ( bold_italic_Y ) ) ]
\displaystyle\Leftrightarrow\ \ 𝔼T(G)[S(T(G),𝒀)]=𝔼T(G)[S(T(F),𝒀)]subscript𝔼𝑇𝐺delimited-[]S𝑇𝐺𝒀subscript𝔼𝑇𝐺delimited-[]S𝑇𝐹𝒀\displaystyle\mathbb{E}_{T(G)}\left[\mathrm{S}(T(G),\bm{Y})\right]=\mathbb{E}_% {T(G)}\left[\mathrm{S}(T(F),\bm{Y})\right]blackboard_E start_POSTSUBSCRIPT italic_T ( italic_G ) end_POSTSUBSCRIPT [ roman_S ( italic_T ( italic_G ) , bold_italic_Y ) ] = blackboard_E start_POSTSUBSCRIPT italic_T ( italic_G ) end_POSTSUBSCRIPT [ roman_S ( italic_T ( italic_F ) , bold_italic_Y ) ]

The fact that SS\mathrm{S}roman_S is strictly proper relative to T()𝑇T(\mathcal{F})italic_T ( caligraphic_F ) leads to T(F)=T(G)𝑇𝐹𝑇𝐺T(F)=T(G)italic_T ( italic_F ) = italic_T ( italic_G ), and finally since T𝑇Titalic_T is injective, we have F=G𝐹𝐺F=Gitalic_F = italic_G. ∎

C.2 Proposition 3

Proof of Proposition 3.

The proof relies on the reproducing kernel Hilbert space (RKHS) representation of the kernel scoring rule SρsubscriptS𝜌\mathrm{S}_{\rho}roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT. For a background on kernel scoring rule, maximum mean discrepancies and RKHS, we refer to Smola et al. (2007) or Steinwart and Christmann (2008, Section 4).

Let ρsubscript𝜌\mathcal{H}_{\rho}caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT denote the RKHS associated with ρ𝜌\rhoitalic_ρ. We recall that ρsubscript𝜌\mathcal{H}_{\rho}caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT contains all the functions ρ(𝒙,)𝜌𝒙\rho(\bm{x},\cdot)italic_ρ ( bold_italic_x , ⋅ ) and that the inner product on ρsubscript𝜌\mathcal{H}_{\rho}caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT satisfies the property

ρ(𝒙1,),ρ(𝒙2,)ρ=ρ(𝒙1,𝒙2).subscript𝜌subscript𝒙1𝜌subscript𝒙2subscript𝜌𝜌subscript𝒙1subscript𝒙2\langle\rho(\bm{x}_{1},\cdot),\rho(\bm{x}_{2},\cdot)\rangle_{\mathcal{H}_{\rho% }}=\rho(\bm{x}_{1},\bm{x}_{2}).⟨ italic_ρ ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋅ ) , italic_ρ ( bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋅ ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_ρ ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

The kernel mean embedding is a linear application Ψρ:𝒫ρρ:subscriptΨ𝜌subscript𝒫𝜌subscript𝜌\Psi_{\rho}:\mathcal{P}_{\rho}\to\mathcal{H}_{\rho}roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT : caligraphic_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT → caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT map** an admissible distribution F𝒫ρ𝐹subscript𝒫𝜌F\in\mathcal{P}_{\rho}italic_F ∈ caligraphic_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT into a function Ψρ(F)subscriptΨ𝜌𝐹\Psi_{\rho}(F)roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F ) in the RKHS and such that the image of the point measure δ𝒙subscript𝛿𝒙\delta_{\bm{x}}italic_δ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT is ρ(𝒙,)𝜌𝒙\rho(\bm{x},\cdot)italic_ρ ( bold_italic_x , ⋅ ). Equation (16) giving the kernel scoring rule for an ensemble prediction F=1Mm=1Mδ𝒙m𝐹1𝑀superscriptsubscript𝑚1𝑀subscript𝛿subscript𝒙𝑚F=\frac{1}{M}\sum_{m=1}^{M}\delta_{\bm{x}_{m}}italic_F = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT can be written as

Sρ(F,𝒚)subscriptS𝜌𝐹𝒚\displaystyle\mathrm{S}_{\rho}(F,\bm{y})roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) =12Ψρ(F)Ψρ(δ𝒚),Ψρ(F)Ψρ(δ𝒚)ρabsent12subscriptsubscriptΨ𝜌𝐹subscriptΨ𝜌subscript𝛿𝒚subscriptΨ𝜌𝐹subscriptΨ𝜌subscript𝛿𝒚subscript𝜌\displaystyle=\frac{1}{2}\langle\Psi_{\rho}(F)-\Psi_{\rho}(\delta_{\bm{y}}),% \Psi_{\rho}(F)-\Psi_{\rho}(\delta_{\bm{y}})\rangle_{\mathcal{H}_{\rho}}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⟨ roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F ) - roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_δ start_POSTSUBSCRIPT bold_italic_y end_POSTSUBSCRIPT ) , roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F ) - roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_δ start_POSTSUBSCRIPT bold_italic_y end_POSTSUBSCRIPT ) ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=12Ψρ(Fδ𝒚)ρ2.absent12superscriptsubscriptnormsubscriptΨ𝜌𝐹subscript𝛿𝒚subscript𝜌2\displaystyle=\frac{1}{2}\|\Psi_{\rho}(F-\delta_{\bm{y}})\|_{\mathcal{H}_{\rho% }}^{2}.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F - italic_δ start_POSTSUBSCRIPT bold_italic_y end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

The properties of the kernel mean embedding ensure that this relation still holds for all F𝒫ρ𝐹subscript𝒫𝜌F\in\mathcal{P}_{\rho}italic_F ∈ caligraphic_P start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT. As a consequence, if (Tl)l1subscriptsubscript𝑇𝑙𝑙1(T_{l})_{l\geq 1}( italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_l ≥ 1 end_POSTSUBSCRIPT is an Hilbertian basis of ρsubscript𝜌\mathcal{H}_{\rho}caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT, we have

Sρ(F,y)subscriptS𝜌𝐹𝑦\displaystyle\mathrm{S}_{\rho}(F,y)roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F , italic_y ) =12Ψρ(Fδ𝒚)ρ2absent12superscriptsubscriptnormsubscriptΨ𝜌𝐹subscript𝛿𝒚subscript𝜌2\displaystyle=\frac{1}{2}\|\Psi_{\rho}(F-\delta_{\bm{y}})\|_{\mathcal{H}_{\rho% }}^{2}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F - italic_δ start_POSTSUBSCRIPT bold_italic_y end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=12l1Ψρ(Fδ𝒚),Tlρ2.absent12subscript𝑙1superscriptsubscriptsubscriptΨ𝜌𝐹subscript𝛿𝒚subscript𝑇𝑙subscript𝜌2\displaystyle=\frac{1}{2}\sum_{l\geq 1}\langle\Psi_{\rho}(F-\delta_{\bm{y}}),T% _{l}\rangle_{\mathcal{H}_{\rho}}^{2}.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_l ≥ 1 end_POSTSUBSCRIPT ⟨ roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F - italic_δ start_POSTSUBSCRIPT bold_italic_y end_POSTSUBSCRIPT ) , italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Finally, the properties of the kernel mean embedding ensure that, for all Tρ𝑇subscript𝜌T\in\mathcal{H}_{\rho}italic_T ∈ caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT,

Ψρ(Fδ𝒚),Tρ=dT(𝒙)(Fδ𝒚)(d𝒙)=𝔼F[T(𝑿)]T(𝒚)subscriptsubscriptΨ𝜌𝐹subscript𝛿𝒚𝑇subscript𝜌subscriptsuperscript𝑑𝑇𝒙𝐹subscript𝛿𝒚d𝒙subscript𝔼𝐹delimited-[]𝑇𝑿𝑇𝒚\langle\Psi_{\rho}(F-\delta_{\bm{y}}),T\rangle_{\mathcal{H}_{\rho}}=\int_{% \mathbb{R}^{d}}T(\bm{x})(F-\delta_{\bm{y}})(\mathrm{d}\bm{x})=\mathbb{E}_{F}[T% (\bm{X})]-T(\bm{y})⟨ roman_Ψ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F - italic_δ start_POSTSUBSCRIPT bold_italic_y end_POSTSUBSCRIPT ) , italic_T ⟩ start_POSTSUBSCRIPT caligraphic_H start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_T ( bold_italic_x ) ( italic_F - italic_δ start_POSTSUBSCRIPT bold_italic_y end_POSTSUBSCRIPT ) ( roman_d bold_italic_x ) = blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_T ( bold_italic_X ) ] - italic_T ( bold_italic_y )

whence the result follows. ∎

C.3 Proof of examples illustrating Proposition 3

Next, we illustrate the Proposition 3 and provide some computations in two cases: the Gaussian kernel scoring rule and the continuous rank probability score (CRPS).

Gaussian Kernel Scoring Rule. This is the scoring rule related to the Gaussian kernel

ρ(x1,x2)=exp((x1x2)2/2),x1,x2.formulae-sequence𝜌subscript𝑥1subscript𝑥2superscriptsubscript𝑥1subscript𝑥222subscript𝑥1subscript𝑥2\rho(x_{1},x_{2})=\exp(-(x_{1}-x_{2})^{2}/2),\quad x_{1},x_{2}\in\mathbb{R}.italic_ρ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_exp ( start_ARG - ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_ARG ) , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R .

Using a series expansion of the exponential function, we have

ρ(x1,x2)=ex12/2ex22/2l0(x1x2)ll!=l0Tl(x1)Tl(x2)𝜌subscript𝑥1subscript𝑥2superscriptesuperscriptsubscript𝑥122superscriptesuperscriptsubscript𝑥222subscript𝑙0superscriptsubscript𝑥1subscript𝑥2𝑙𝑙subscript𝑙0subscript𝑇𝑙subscript𝑥1subscript𝑇𝑙subscript𝑥2\rho(x_{1},x_{2})=\mathrm{e}^{-x_{1}^{2}/2}\mathrm{e}^{-x_{2}^{2}/2}\sum_{l% \geq 0}\frac{(x_{1}x_{2})^{l}}{l!}=\sum_{l\geq 0}T_{l}(x_{1})T_{l}(x_{2})italic_ρ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_e start_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT roman_e start_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l ≥ 0 end_POSTSUBSCRIPT divide start_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_ARG start_ARG italic_l ! end_ARG = ∑ start_POSTSUBSCRIPT italic_l ≥ 0 end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )

with Tlsubscript𝑇𝑙T_{l}italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT the transformation defined, for l0𝑙0l\geq 0italic_l ≥ 0, by

Tl(x)=1l!ex2/2xl.subscript𝑇𝑙𝑥1𝑙superscriptesuperscript𝑥22superscript𝑥𝑙T_{l}(x)=\frac{1}{\sqrt{l!}}\mathrm{e}^{-x^{2}/2}x^{l}.italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_l ! end_ARG end_ARG roman_e start_POSTSUPERSCRIPT - italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT .

As a consequence, the Gaussian kernel scoring rule writes, for all F𝒫()𝐹𝒫F\in\mathcal{P}(\mathbb{R})italic_F ∈ caligraphic_P ( blackboard_R ) and y𝑦y\in\mathbb{R}italic_y ∈ blackboard_R,

Sρ(F,y)subscriptS𝜌𝐹𝑦\displaystyle\mathrm{S}_{\rho}(F,y)roman_S start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT ( italic_F , italic_y ) =12×ρ(x1,x2)(Fδy)(dx1)(Fδy)(dx2)absent12subscript𝜌subscript𝑥1subscript𝑥2𝐹subscript𝛿𝑦dsubscript𝑥1𝐹subscript𝛿𝑦dsubscript𝑥2\displaystyle=\frac{1}{2}\int_{\mathbb{R}\times\mathbb{R}}\rho(x_{1},x_{2})(F-% \delta_{y})(\mathrm{d}x_{1})(F-\delta_{y})(\mathrm{d}x_{2})= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_R end_POSTSUBSCRIPT italic_ρ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ( italic_F - italic_δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ( roman_d italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_F - italic_δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ( roman_d italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=12×(l0Tl(x1)Tl(x2))(Fδy)(dx1)(Fδy)(dx2)absent12subscriptsubscript𝑙0subscript𝑇𝑙subscript𝑥1subscript𝑇𝑙subscript𝑥2𝐹subscript𝛿𝑦dsubscript𝑥1𝐹subscript𝛿𝑦dsubscript𝑥2\displaystyle=\frac{1}{2}\int_{\mathbb{R}\times\mathbb{R}}\Big{(}\sum_{l\geq 0% }T_{l}(x_{1})T_{l}(x_{2})\Big{)}(F-\delta_{y})(\mathrm{d}x_{1})(F-\delta_{y})(% \mathrm{d}x_{2})= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ start_POSTSUBSCRIPT blackboard_R × blackboard_R end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_l ≥ 0 end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ( italic_F - italic_δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ( roman_d italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ( italic_F - italic_δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ( roman_d italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=12l0(Tl(x)(Fδy)(dx))2absent12subscript𝑙0superscriptsubscriptsubscript𝑇𝑙𝑥𝐹subscript𝛿𝑦d𝑥2\displaystyle=\frac{1}{2}\sum_{l\geq 0}\Big{(}\int_{\mathbb{R}}T_{l}(x)(F-% \delta_{y})(\mathrm{d}x)\Big{)}^{2}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_l ≥ 0 end_POSTSUBSCRIPT ( ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x ) ( italic_F - italic_δ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ( roman_d italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=12l0(𝔼F[Tl(X)]Tl(y))2.absent12subscript𝑙0superscriptsubscript𝔼𝐹delimited-[]subscript𝑇𝑙𝑋subscript𝑇𝑙𝑦2\displaystyle=\frac{1}{2}\sum_{l\geq 0}\Big{(}\mathbb{E}_{F}[T_{l}(X)]-T_{l}(y% )\Big{)}^{2}.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_l ≥ 0 end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_X ) ] - italic_T start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Continuous Ranked Probability Score. The CRPS is the scoring rule with kernel ρ(x1,x2)=|x1|+|x2||x1x2|𝜌subscript𝑥1subscript𝑥2subscript𝑥1subscript𝑥2subscript𝑥1subscript𝑥2\rho(x_{1},x_{2})=|x_{1}|+|x_{2}|-|x_{1}-x_{2}|italic_ρ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | + | italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | - | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |. This kernel is the covariance of the Brownian motion on \mathbb{R}blackboard_R and its RKHS is known to be the Sobolev space H1=H1()superscriptH1superscriptH1\mathrm{H}^{1}=\mathrm{H}^{1}(\mathbb{R})roman_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = roman_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( blackboard_R ), see Berlinet and Thomas-Agnan (2004). We recall the definition of the Sobolev space

H1={f𝒞(,):f(0)=0,f˙L2()},superscriptH1conditional-set𝑓𝒞formulae-sequence𝑓00˙𝑓superscript𝐿2\mathrm{H}^{1}=\left\{f\in\mathcal{C}(\mathbb{R},\mathbb{R})\colon f(0)=0,\dot% {f}\in L^{2}(\mathbb{R})\right\},roman_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = { italic_f ∈ caligraphic_C ( blackboard_R , blackboard_R ) : italic_f ( 0 ) = 0 , over˙ start_ARG italic_f end_ARG ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( blackboard_R ) } ,

where f˙˙𝑓\dot{f}over˙ start_ARG italic_f end_ARG denotes the derivative of f𝑓fitalic_f assumed to be defined almost everywhere and square-integrable. The inner product on H1superscriptH1\mathrm{H}^{1}roman_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is defined by

f1,f2H1=f˙1(x)f˙2(x)dxsubscriptsubscript𝑓1subscript𝑓2superscript𝐻1subscriptsubscript˙𝑓1𝑥subscript˙𝑓2𝑥differential-d𝑥\langle f_{1},f_{2}\rangle_{H^{1}}=\int_{\mathbb{R}}\dot{f}_{1}(x)\dot{f}_{2}(% x)\mathrm{d}x⟨ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT over˙ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) over˙ start_ARG italic_f end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ) roman_d italic_x

and one can easily check the fundamental relation

ρ(x1,),ρ(x2,)H1=ρ˙(x1,x)ρ˙(x2,x)dx=ρ(x1,x2).subscript𝜌subscript𝑥1𝜌subscript𝑥2superscript𝐻1subscript˙𝜌subscript𝑥1𝑥˙𝜌subscript𝑥2𝑥differential-d𝑥𝜌subscript𝑥1subscript𝑥2\langle\rho(x_{1},\cdot),\rho(x_{2},\cdot)\rangle_{H^{1}}=\int_{\mathbb{R}}% \dot{\rho}(x_{1},x)\dot{\rho}(x_{2},x)\mathrm{d}x=\rho(x_{1},x_{2}).⟨ italic_ρ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋅ ) , italic_ρ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋅ ) ⟩ start_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT over˙ start_ARG italic_ρ end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) over˙ start_ARG italic_ρ end_ARG ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x ) roman_d italic_x = italic_ρ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

Here the derivative ρ˙(x1,x)=𝟙[0,x1](x)˙𝜌subscript𝑥1𝑥subscript10subscript𝑥1𝑥\dot{\rho}(x_{1},x)=\mathds{1}_{[0,x_{1}]}(x)over˙ start_ARG italic_ρ end_ARG ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x ) = blackboard_1 start_POSTSUBSCRIPT [ 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ( italic_x ) is taken with respect to the second variable x𝑥xitalic_x. Then, we consider the Haar system defined as the collection of functions

Hl0(x)=H0(xl)andHl,m1(x)=2m/2H1(2mxl),l,m0,formulae-sequencesubscriptsuperscript𝐻0𝑙𝑥superscript𝐻0𝑥𝑙andsubscriptsuperscript𝐻1𝑙𝑚𝑥superscript2𝑚2superscript𝐻1superscript2𝑚𝑥𝑙formulae-sequence𝑙𝑚0H^{0}_{l}(x)=H^{0}(x-l)\quad\mbox{and}\quad H^{1}_{l,m}(x)=2^{m/2}H^{1}(2^{m}x% -l),\quad l\in\mathbb{Z},\,m\geq 0,italic_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x ) = italic_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x - italic_l ) and italic_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_m end_POSTSUBSCRIPT ( italic_x ) = 2 start_POSTSUPERSCRIPT italic_m / 2 end_POSTSUPERSCRIPT italic_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_x - italic_l ) , italic_l ∈ blackboard_Z , italic_m ≥ 0 ,

with H0(x)=𝟙[0,1)(x)superscript𝐻0𝑥subscript101𝑥H^{0}(x)=\mathds{1}_{[0,1)}(x)italic_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x ) = blackboard_1 start_POSTSUBSCRIPT [ 0 , 1 ) end_POSTSUBSCRIPT ( italic_x ) and H1(x)=𝟙[0,1/2)(x)𝟙[1/2,1)(x)superscript𝐻1𝑥subscript1012𝑥subscript1121𝑥H^{1}(x)=\mathds{1}_{[0,1/2)}(x)-\mathds{1}_{[1/2,1)}(x)italic_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_x ) = blackboard_1 start_POSTSUBSCRIPT [ 0 , 1 / 2 ) end_POSTSUBSCRIPT ( italic_x ) - blackboard_1 start_POSTSUBSCRIPT [ 1 / 2 , 1 ) end_POSTSUBSCRIPT ( italic_x ). Since the Haar system is an orthonormal basis of the space L2()superscript𝐿2L^{2}(\mathbb{R})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( blackboard_R ) and the map fH1f˙L2𝑓superscript𝐻1maps-to˙𝑓superscript𝐿2f\in H^{1}\mapsto\dot{f}\in L^{2}italic_f ∈ italic_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ↦ over˙ start_ARG italic_f end_ARG ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is an isomorphism between Hilbert spaces, we obtain an orthonormal basis of H1()superscriptH1\mathrm{H}^{1}(\mathbb{R})roman_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( blackboard_R ) by considering the primitives vanishing at 00 of the Haar basis functions. Setting T0(x)=x𝟙[0,1)(x)+𝟙[1,+)(x)superscript𝑇0𝑥𝑥subscript101𝑥subscript11𝑥T^{0}(x)=x\mathds{1}_{[0,1)}(x)+\mathds{1}_{[1,+\infty)}(x)italic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x ) = italic_x blackboard_1 start_POSTSUBSCRIPT [ 0 , 1 ) end_POSTSUBSCRIPT ( italic_x ) + blackboard_1 start_POSTSUBSCRIPT [ 1 , + ∞ ) end_POSTSUBSCRIPT ( italic_x ) and T1(x)=(1/2|x1/2|)𝟙[0,1](x)superscript𝑇1𝑥12𝑥12subscript101𝑥T^{1}(x)=\big{(}1/2-|x-1/2|\big{)}\mathds{1}_{[0,1]}(x)italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_x ) = ( 1 / 2 - | italic_x - 1 / 2 | ) blackboard_1 start_POSTSUBSCRIPT [ 0 , 1 ] end_POSTSUBSCRIPT ( italic_x ) the primitive functions of H0superscript𝐻0H^{0}italic_H start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and H1superscript𝐻1H^{1}italic_H start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT respectively, we obtain the system

Tl0(x)=T0(xl),Tl,m1(x)=2m/2T1(2mxl),l,m0.formulae-sequencesubscriptsuperscript𝑇0𝑙𝑥superscript𝑇0𝑥𝑙formulae-sequencesubscriptsuperscript𝑇1𝑙𝑚𝑥superscript2𝑚2superscript𝑇1superscript2𝑚𝑥𝑙formulae-sequence𝑙𝑚0T^{0}_{l}(x)=T^{0}(x-l),\quad T^{1}_{l,m}(x)=2^{-m/2}T^{1}(2^{m}x-l),\quad l% \in\mathbb{Z},\,m\geq 0.italic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_x ) = italic_T start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( italic_x - italic_l ) , italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l , italic_m end_POSTSUBSCRIPT ( italic_x ) = 2 start_POSTSUPERSCRIPT - italic_m / 2 end_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( 2 start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_x - italic_l ) , italic_l ∈ blackboard_Z , italic_m ≥ 0 .

The series representation of the CRPS is then deduced from Proposition 3 and its proof since the collection {Tl,m:l,m0}conditional-setsubscript𝑇𝑙𝑚formulae-sequence𝑙𝑚0\{T_{l,m}\colon l\in\mathbb{Z},m\geq 0\}{ italic_T start_POSTSUBSCRIPT italic_l , italic_m end_POSTSUBSCRIPT : italic_l ∈ blackboard_Z , italic_m ≥ 0 }, is an orthonormal basis of the RKHS associated with the kernel ρ𝜌\rhoitalic_ρ of the CRPS.

Appendix D General form of Corollary 1

Corollary 2.

Let 𝒯={Ti}1im𝒯subscriptsubscript𝑇𝑖1𝑖𝑚\mathcal{T}=\{T_{i}\}_{{1\leq i\leq m}}caligraphic_T = { italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT be a set of transformations from dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT to ksuperscript𝑘\mathbb{R}^{k}blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. Let 𝒮={Si}1im𝒮subscriptsubscriptS𝑖1𝑖𝑚\mathcal{S}=\{\mathrm{S}_{i}\}_{{1\leq i\leq m}}caligraphic_S = { roman_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT be a set of proper scoring rules such that SisubscriptS𝑖\mathrm{S}_{i}roman_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is proper relative to Ti()subscript𝑇𝑖T_{i}(\mathcal{F})italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_F ), for all 1im1𝑖𝑚1\leq i\leq m1 ≤ italic_i ≤ italic_m. Let 𝐰={wi}1im𝐰subscriptsubscript𝑤𝑖1𝑖𝑚\bm{w}=\{w_{i}\}_{{1\leq i\leq m}}bold_italic_w = { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_m end_POSTSUBSCRIPT be nonnegative weights. Then the scoring rule

S𝒮𝒯,𝒘(F,𝒚)=i=1mwiSiTi(F,𝒚)=i=1mwiSi(Ti(F),Ti(𝒚))subscriptSsubscript𝒮𝒯𝒘𝐹𝒚superscriptsubscript𝑖1𝑚subscript𝑤𝑖subscriptsubscriptS𝑖subscript𝑇𝑖𝐹𝒚superscriptsubscript𝑖1𝑚subscript𝑤𝑖subscriptS𝑖subscript𝑇𝑖𝐹subscript𝑇𝑖𝒚\mathrm{S}_{\mathcal{S}_{\mathcal{T}},\bm{w}}(F,\bm{y})=\sum_{i=1}^{m}w_{i}{% \mathrm{S}_{i}}_{T_{i}}(F,\bm{y})=\sum_{i=1}^{m}w_{i}{\mathrm{S}_{i}}(T_{i}(F)% ,T_{i}(\bm{y}))roman_S start_POSTSUBSCRIPT caligraphic_S start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT , bold_italic_w end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_F ) , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_y ) )

is proper relative to \mathcal{F}caligraphic_F.

Appendix E Scoring rules of the simulation study

The following formulas are deduced for a probabilistic forecast F𝐹Fitalic_F taking the form of the Gaussian random field model of Equation (20). The formulas of the aggregated univariate scoring rules can be obtained from the formulas in Gneiting and Raftery (2007) and Jordan et al. (2019) and, thus, are not presented here. We focus on the expression of the variogram score and the CRPS of spatial mean.

Variogram Score

VSp(F,𝒚)subscriptVS𝑝𝐹𝒚\displaystyle\mathrm{VS}_{p}(F,\bm{y})roman_VS start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) =s,s𝒟wss(𝔼F[|XsXs|p]|ysys|p)2absentsubscript𝑠superscript𝑠𝒟subscript𝑤𝑠superscript𝑠superscriptsubscript𝔼𝐹delimited-[]superscriptsubscript𝑋𝑠subscript𝑋superscript𝑠𝑝superscriptsubscript𝑦𝑠subscript𝑦superscript𝑠𝑝2\displaystyle=\sum_{s,s^{\prime}\in\mathcal{D}}w_{ss^{\prime}}\left(\mathbb{E}% _{F}[|X_{s}-X_{s^{\prime}}|^{p}]-|y_{s}-y_{s^{\prime}}|^{p}\right)^{2}= ∑ start_POSTSUBSCRIPT italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_D end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_s italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] - | italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

For X𝒩(μ,σ2)similar-to𝑋𝒩𝜇superscript𝜎2X\sim\mathcal{N}(\mu,\sigma^{2})italic_X ∼ caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), the absolute moment is (Winkelbauer, 2014) :

𝔼[|X|ν]=σν2ν/2Γ(ν+12)πF11(ν/2,1/2;μ22σ2),𝔼delimited-[]superscript𝑋𝜈superscript𝜎𝜈superscript2𝜈2Γ𝜈12𝜋subscriptsubscript𝐹11𝜈212superscript𝜇22superscript𝜎2\mathbb{E}[|X|^{\nu}]=\sigma^{\nu}2^{\nu/2}\frac{\Gamma\left(\frac{\nu+1}{2}% \right)}{\sqrt{\pi}}{}_{1}F_{1}\left(-\nu/2,1/2;-\frac{\mu^{2}}{2\sigma^{2}}% \right),blackboard_E [ | italic_X | start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT ] = italic_σ start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_ν / 2 end_POSTSUPERSCRIPT divide start_ARG roman_Γ ( divide start_ARG italic_ν + 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG end_ARG start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( - italic_ν / 2 , 1 / 2 ; - divide start_ARG italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (24)

where F11subscriptsubscript𝐹11{}_{1}F_{1}start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the confluent hypergeometric function of the first kind. For XFsimilar-to𝑋𝐹X\sim Fitalic_X ∼ italic_F,

XsXssubscript𝑋𝑠subscript𝑋superscript𝑠\displaystyle X_{s}-X_{s^{\prime}}italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 𝒩(μsμs,σs2+σs22cov(Fs,Fs)\displaystyle\sim\mathcal{N}(\mu_{s}-\mu_{s^{\prime}},{\sigma_{s}}^{2}+{\sigma% _{s^{\prime}}}^{2}-2\mathrm{cov}(F_{s},F_{s^{\prime}})∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 roman_c roman_o roman_v ( italic_F start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT )
𝒩(0,2σ2(1e(ssλ)β)).similar-toabsent𝒩02superscript𝜎21superscript𝑒superscriptdelimited-∥∥𝑠superscript𝑠𝜆𝛽\displaystyle\sim\mathcal{N}(0,2\sigma^{2}(1-e^{-\left(\frac{\lVert s-s^{% \prime}\rVert}{\lambda}\right)^{\beta}})).∼ caligraphic_N ( 0 , 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - ( divide start_ARG ∥ italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) .

This leads to

𝔼G[|XsXs|p]subscript𝔼𝐺delimited-[]superscriptsubscript𝑋𝑠subscript𝑋superscript𝑠𝑝\displaystyle\mathbb{E}_{G}[|X_{s}-X_{s^{\prime}}|^{p}]blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] =(2σ2(1e(ssλ)β))p/22p/2Γ(p+12)πF11(p/2,1/2;(μsμs)24σ2(1e(ssλ)β))absentsuperscript2superscript𝜎21superscript𝑒superscriptdelimited-∥∥𝑠superscript𝑠𝜆𝛽𝑝2superscript2𝑝2Γ𝑝12𝜋subscriptsubscript𝐹11𝑝212superscriptsubscript𝜇𝑠subscript𝜇superscript𝑠24superscript𝜎21superscript𝑒superscriptdelimited-∥∥𝑠superscript𝑠𝜆𝛽\displaystyle=\left(2\sigma^{2}(1-e^{-\left(\frac{\lVert s-s^{\prime}\rVert}{% \lambda}\right)^{\beta}})\right)^{p/2}2^{p/2}\frac{\Gamma\left(\frac{p+1}{2}% \right)}{\sqrt{\pi}}{}_{1}F_{1}\left(-p/2,1/2;-\frac{(\mu_{s}-\mu_{s^{\prime}}% )^{2}}{4\sigma^{2}(1-e^{-\left(\frac{\lVert s-s^{\prime}\rVert}{\lambda}\right% )^{\beta}})}\right)= ( 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - ( divide start_ARG ∥ italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT divide start_ARG roman_Γ ( divide start_ARG italic_p + 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG end_ARG start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( - italic_p / 2 , 1 / 2 ; - divide start_ARG ( italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - ( divide start_ARG ∥ italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_ARG )
=2pσp(1e(ssλ)β)p/2Γ(p+12)πF11(p/2,1/2;0)absentsuperscript2𝑝superscript𝜎𝑝superscript1superscript𝑒superscriptdelimited-∥∥𝑠superscript𝑠𝜆𝛽𝑝2Γ𝑝12𝜋subscriptsubscript𝐹11𝑝2120\displaystyle=2^{p}\sigma^{p}\left(1-e^{-\left(\frac{\lVert s-s^{\prime}\rVert% }{\lambda}\right)^{\beta}}\right)^{p/2}\frac{\Gamma\left(\frac{p+1}{2}\right)}% {\sqrt{\pi}}{}_{1}F_{1}\left(-p/2,1/2;0\right)= 2 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - ( divide start_ARG ∥ italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT divide start_ARG roman_Γ ( divide start_ARG italic_p + 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG end_ARG start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( - italic_p / 2 , 1 / 2 ; 0 )
=2pσp(1e(ssλ)β)p/2Γ(p+12)πabsentsuperscript2𝑝superscript𝜎𝑝superscript1superscript𝑒superscriptdelimited-∥∥𝑠superscript𝑠𝜆𝛽𝑝2Γ𝑝12𝜋\displaystyle=2^{p}\sigma^{p}\left(1-e^{-\left(\frac{\lVert s-s^{\prime}\rVert% }{\lambda}\right)^{\beta}}\right)^{p/2}\frac{\Gamma\left(\frac{p+1}{2}\right)}% {\sqrt{\pi}}= 2 start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - ( divide start_ARG ∥ italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT divide start_ARG roman_Γ ( divide start_ARG italic_p + 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG end_ARG

Finally,

VSp(F,𝒚)subscriptVS𝑝𝐹𝒚\displaystyle\mathrm{VS}_{p}(F,\bm{y})roman_VS start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) =s,s𝒟wij(𝔼G[|XsXs|p]|ysys|p)2absentsubscript𝑠superscript𝑠𝒟subscript𝑤𝑖𝑗superscriptsubscript𝔼𝐺delimited-[]superscriptsubscript𝑋𝑠subscript𝑋superscript𝑠𝑝superscriptsubscript𝑦𝑠subscript𝑦superscript𝑠𝑝2\displaystyle=\sum_{s,s^{\prime}\in\mathcal{D}}w_{ij}\left(\mathbb{E}_{G}[|X_{% s}-X_{s^{\prime}}|^{p}]-|y_{s}-y_{s^{\prime}}|^{p}\right)^{2}= ∑ start_POSTSUBSCRIPT italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_D end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT [ | italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] - | italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=s,s𝒟wij((2σ2(1e(ssλ)β))p/22p/2Γ(p+12)π|ysys|p)2\displaystyle=\sum_{s,s^{\prime}\in\mathcal{D}}w_{ij}\Biggl{(}\left(2\sigma^{2% }(1-e^{-\left(\frac{\lVert s-s^{\prime}\rVert}{\lambda}\right)^{\beta}})\right% )^{p/2}2^{p/2}\frac{\Gamma\left(\frac{p+1}{2}\right)}{\sqrt{\pi}}-|y_{s}-y_{s^% {\prime}}|^{p}\Biggl{)}^{2}= ∑ start_POSTSUBSCRIPT italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_D end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( ( 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - ( divide start_ARG ∥ italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT divide start_ARG roman_Γ ( divide start_ARG italic_p + 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG end_ARG - | italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

p-Variation Score

pVS(F,𝒚)pVS𝐹𝒚\displaystyle\mathrm{pVS}(F,\bm{y})roman_pVS ( italic_F , bold_italic_y ) =𝒔𝒟w𝒔SETpvar,𝒔(F,𝒚);absentsubscript𝒔superscript𝒟subscript𝑤𝒔subscriptSEsubscript𝑇𝑝𝑣𝑎𝑟𝒔𝐹𝒚\displaystyle=\sum_{\bm{s}\in\mathcal{D}^{\ast}}w_{\bm{s}}\mathrm{SE}_{T_{p-% var,\bm{s}}}(F,\bm{y});= ∑ start_POSTSUBSCRIPT bold_italic_s ∈ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT roman_SE start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_p - italic_v italic_a italic_r , bold_italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) ;
=𝒔𝒟w𝒔(𝔼F[Tpvar,𝒔(𝑿)]Tpvar,𝒔(𝒚))2,absentsubscript𝒔superscript𝒟subscript𝑤𝒔superscriptsubscript𝔼𝐹delimited-[]subscript𝑇𝑝𝑣𝑎𝑟𝒔𝑿subscript𝑇𝑝𝑣𝑎𝑟𝒔𝒚2\displaystyle=\sum_{\bm{s}\in\mathcal{D}^{\ast}}w_{\bm{s}}(\mathbb{E}_{F}[T_{p% -var,\bm{s}}(\bm{X})]-T_{p-var,\bm{s}}(\bm{y}))^{2},= ∑ start_POSTSUBSCRIPT bold_italic_s ∈ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT ( blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_T start_POSTSUBSCRIPT italic_p - italic_v italic_a italic_r , bold_italic_s end_POSTSUBSCRIPT ( bold_italic_X ) ] - italic_T start_POSTSUBSCRIPT italic_p - italic_v italic_a italic_r , bold_italic_s end_POSTSUBSCRIPT ( bold_italic_y ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

Denote Z=𝑿𝒔+(1,1)𝑿𝒔+(1,0)𝑿𝒔+(0,1)+𝑿𝒔𝑍subscript𝑿𝒔11subscript𝑿𝒔10subscript𝑿𝒔01subscript𝑿𝒔Z=\bm{X}_{\bm{s}+(1,1)}-\bm{X}_{\bm{s}+(1,0)}-\bm{X}_{\bm{s}+(0,1)}+\bm{X}_{% \bm{s}}italic_Z = bold_italic_X start_POSTSUBSCRIPT bold_italic_s + ( 1 , 1 ) end_POSTSUBSCRIPT - bold_italic_X start_POSTSUBSCRIPT bold_italic_s + ( 1 , 0 ) end_POSTSUBSCRIPT - bold_italic_X start_POSTSUBSCRIPT bold_italic_s + ( 0 , 1 ) end_POSTSUBSCRIPT + bold_italic_X start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT. For XFsimilar-to𝑋𝐹X\sim Fitalic_X ∼ italic_F, we have Z𝒩(μZ,σZ2)similar-to𝑍𝒩subscript𝜇𝑍superscriptsubscript𝜎𝑍2Z\sim\mathcal{N}(\mu_{Z},\sigma_{Z}^{2})italic_Z ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) with

μZ=μ𝒔+(1,1)μ𝒔+(1,0)μ𝒔+(0,1)+μ𝒔=0subscript𝜇𝑍subscript𝜇𝒔11subscript𝜇𝒔10subscript𝜇𝒔01subscript𝜇𝒔0\mu_{Z}=\mu_{\bm{s}+(1,1)}-\mu_{\bm{s}+(1,0)}-\mu_{\bm{s}+(0,1)}+\mu_{\bm{s}}=0italic_μ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT bold_italic_s + ( 1 , 1 ) end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT bold_italic_s + ( 1 , 0 ) end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT bold_italic_s + ( 0 , 1 ) end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT = 0

and

σZ2superscriptsubscript𝜎𝑍2\displaystyle\sigma_{Z}^{2}italic_σ start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =σ𝒔+(1,1)2+σ𝒔+(1,0)2+σ𝒔+(0,1)2+σ𝒔2absentsuperscriptsubscript𝜎𝒔112superscriptsubscript𝜎𝒔102superscriptsubscript𝜎𝒔012superscriptsubscript𝜎𝒔2\displaystyle=\sigma_{\bm{s}+(1,1)}^{2}+\sigma_{\bm{s}+(1,0)}^{2}+\sigma_{\bm{% s}+(0,1)}^{2}+\sigma_{\bm{s}}^{2}= italic_σ start_POSTSUBSCRIPT bold_italic_s + ( 1 , 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT bold_italic_s + ( 1 , 0 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT bold_italic_s + ( 0 , 1 ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2cov(F(𝒔+(1,1)),F(𝒔+(1,0)))2cov(F(𝒔+(1,1)),F(𝒔+(0,1))+2cov(F(𝒔+(1,1)),F(𝒔))\displaystyle\ \ \ \ -2\mathrm{cov}(F(\bm{s}+(1,1)),F(\bm{s}+(1,0)))-2\mathrm{% cov}(F(\bm{s}+(1,1)),F(\bm{s}+(0,1))+2\mathrm{cov}(F(\bm{s}+(1,1)),F(\bm{s}))- 2 roman_c roman_o roman_v ( italic_F ( bold_italic_s + ( 1 , 1 ) ) , italic_F ( bold_italic_s + ( 1 , 0 ) ) ) - 2 roman_c roman_o roman_v ( italic_F ( bold_italic_s + ( 1 , 1 ) ) , italic_F ( bold_italic_s + ( 0 , 1 ) ) + 2 roman_c roman_o roman_v ( italic_F ( bold_italic_s + ( 1 , 1 ) ) , italic_F ( bold_italic_s ) )
+2cov(F(𝒔+(1,0)),F(𝒔+(0,1)))2cov(F(𝒔+(1,0)),F(𝒔))2cov𝐹𝒔10𝐹𝒔012cov𝐹𝒔10𝐹𝒔\displaystyle\ \ \ \ +2\mathrm{cov}(F(\bm{s}+(1,0)),F(\bm{s}+(0,1)))-2\mathrm{% cov}(F(\bm{s}+(1,0)),F(\bm{s}))+ 2 roman_c roman_o roman_v ( italic_F ( bold_italic_s + ( 1 , 0 ) ) , italic_F ( bold_italic_s + ( 0 , 1 ) ) ) - 2 roman_c roman_o roman_v ( italic_F ( bold_italic_s + ( 1 , 0 ) ) , italic_F ( bold_italic_s ) )
2cov(F(𝒔+(0,1)),F(𝒔))2cov𝐹𝒔01𝐹𝒔\displaystyle\ \ \ \ -2\mathrm{cov}(F(\bm{s}+(0,1)),F(\bm{s}))- 2 roman_c roman_o roman_v ( italic_F ( bold_italic_s + ( 0 , 1 ) ) , italic_F ( bold_italic_s ) )
=4σ2(1+e(2/λ)β2e(1/λ)β)absent4superscript𝜎21superscript𝑒superscript2𝜆𝛽2superscript𝑒superscript1𝜆𝛽\displaystyle=4\sigma^{2}(1+e^{-(\sqrt{2}/\lambda)^{\beta}}-2e^{-(1/\lambda)^{% \beta}})= 4 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_e start_POSTSUPERSCRIPT - ( square-root start_ARG 2 end_ARG / italic_λ ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - 2 italic_e start_POSTSUPERSCRIPT - ( 1 / italic_λ ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT )

Using (24), this leads to

𝔼F[Tpvar,𝒔(𝑿)]subscript𝔼𝐹delimited-[]subscript𝑇𝑝𝑣𝑎𝑟𝒔𝑿\displaystyle\mathbb{E}_{F}[T_{p-var,\bm{s}}(\bm{X})]blackboard_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT [ italic_T start_POSTSUBSCRIPT italic_p - italic_v italic_a italic_r , bold_italic_s end_POSTSUBSCRIPT ( bold_italic_X ) ] =(4σ2(1+e(2/λ)β2e(1/λ)β))p/22p/2Γ(p+12)πF11(p/2,1/2;0)absentsuperscript4superscript𝜎21superscript𝑒superscript2𝜆𝛽2superscript𝑒superscript1𝜆𝛽𝑝2superscript2𝑝2Γ𝑝12𝜋subscriptsubscript𝐹11𝑝2120\displaystyle=\left(4\sigma^{2}(1+e^{-(\sqrt{2}/\lambda)^{\beta}}-2e^{-(1/% \lambda)^{\beta}})\right)^{p/2}2^{p/2}\frac{\Gamma\left(\frac{p+1}{2}\right)}{% \sqrt{\pi}}{}_{1}F_{1}\left(-p/2,1/2;0\right)= ( 4 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_e start_POSTSUPERSCRIPT - ( square-root start_ARG 2 end_ARG / italic_λ ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - 2 italic_e start_POSTSUPERSCRIPT - ( 1 / italic_λ ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT divide start_ARG roman_Γ ( divide start_ARG italic_p + 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG end_ARG start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( - italic_p / 2 , 1 / 2 ; 0 )
=(4σ2(1+e(2/λ)β2e(1/λ)β))p/22p/2Γ(p+12)πabsentsuperscript4superscript𝜎21superscript𝑒superscript2𝜆𝛽2superscript𝑒superscript1𝜆𝛽𝑝2superscript2𝑝2Γ𝑝12𝜋\displaystyle=\left(4\sigma^{2}(1+e^{-(\sqrt{2}/\lambda)^{\beta}}-2e^{-(1/% \lambda)^{\beta}})\right)^{p/2}2^{p/2}\frac{\Gamma\left(\frac{p+1}{2}\right)}{% \sqrt{\pi}}= ( 4 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_e start_POSTSUPERSCRIPT - ( square-root start_ARG 2 end_ARG / italic_λ ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - 2 italic_e start_POSTSUPERSCRIPT - ( 1 / italic_λ ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT divide start_ARG roman_Γ ( divide start_ARG italic_p + 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG end_ARG

Finally,

pVS(F,𝒚)pVS𝐹𝒚\displaystyle\mathrm{pVS}(F,\bm{y})roman_pVS ( italic_F , bold_italic_y ) =𝒔𝒟w𝒔SETpvar,𝒔(F,𝒚)absentsubscript𝒔superscript𝒟subscript𝑤𝒔subscriptSEsubscript𝑇𝑝𝑣𝑎𝑟𝒔𝐹𝒚\displaystyle=\sum_{\bm{s}\in\mathcal{D}^{\ast}}w_{\bm{s}}\mathrm{SE}_{T_{p-% var,\bm{s}}}(F,\bm{y})= ∑ start_POSTSUBSCRIPT bold_italic_s ∈ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT roman_SE start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_p - italic_v italic_a italic_r , bold_italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y )
=𝒔𝒟w𝒔((4σ2(1+e(2/λ)β2e(1/λ)β))p/22p/2Γ(p+12)π|y𝒔+(1,1)y𝒔+(1,0)y𝒔+(0,1)+y𝒔|p)2\displaystyle=\sum_{\bm{s}\in\mathcal{D}^{\ast}}w_{\bm{s}}\Biggl{(}\left(4% \sigma^{2}(1+e^{-(\sqrt{2}/\lambda)^{\beta}}-2e^{-(1/\lambda)^{\beta}})\right)% ^{p/2}2^{p/2}\frac{\Gamma\left(\frac{p+1}{2}\right)}{\sqrt{\pi}}-|y_{\bm{s}+(1% ,1)}-y_{\bm{s}+(1,0)}-y_{\bm{s}+(0,1)}+y_{\bm{s}}|^{p}\Biggl{)}^{2}= ∑ start_POSTSUBSCRIPT bold_italic_s ∈ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT ( ( 4 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_e start_POSTSUPERSCRIPT - ( square-root start_ARG 2 end_ARG / italic_λ ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - 2 italic_e start_POSTSUPERSCRIPT - ( 1 / italic_λ ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_p / 2 end_POSTSUPERSCRIPT divide start_ARG roman_Γ ( divide start_ARG italic_p + 1 end_ARG start_ARG 2 end_ARG ) end_ARG start_ARG square-root start_ARG italic_π end_ARG end_ARG - | italic_y start_POSTSUBSCRIPT bold_italic_s + ( 1 , 1 ) end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT bold_italic_s + ( 1 , 0 ) end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT bold_italic_s + ( 0 , 1 ) end_POSTSUBSCRIPT + italic_y start_POSTSUBSCRIPT bold_italic_s end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

CRPS of spatial mean

The CRPS of spatial mean is defined as

CRPSmean𝒫,𝒘𝓟(F,𝒚)subscriptCRPSsubscriptmean𝒫subscript𝒘𝓟𝐹𝒚\displaystyle\mathrm{CRPS}_{\mathrm{mean}_{\mathcal{P}},\bm{w_{\mathcal{P}}}}(% F,\bm{y})roman_CRPS start_POSTSUBSCRIPT roman_mean start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT bold_caligraphic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) =P𝒫wPCRPSmeanP(F,𝒚)absentsubscript𝑃𝒫subscript𝑤𝑃subscriptCRPSsubscriptmean𝑃𝐹𝒚\displaystyle=\sum_{P\in\mathcal{P}}w_{P}\mathrm{CRPS}_{\mathrm{mean}_{P}}(F,% \bm{y})= ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_CRPS start_POSTSUBSCRIPT roman_mean start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y )
=P𝒫wPCRPS(meanP(F),meanP(𝒚)),absentsubscript𝑃𝒫subscript𝑤𝑃CRPSsubscriptmean𝑃𝐹subscriptmean𝑃𝒚\displaystyle=\sum_{P\in\mathcal{P}}w_{P}\mathrm{CRPS}(\mathrm{mean}_{P}(F),% \mathrm{mean}_{P}(\bm{y})),= ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_CRPS ( roman_mean start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_F ) , roman_mean start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( bold_italic_y ) ) ,

where 𝒫𝒫\mathcal{P}caligraphic_P is an ensemble of spatial patches and wPsubscript𝑤𝑃w_{P}italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT is the weight associated with a patch P𝒫𝑃𝒫P\in\mathcal{P}italic_P ∈ caligraphic_P. The mean of Gaussian marginals follows a Gaussian distribution :

meanP(F)𝒩(sPμs,σ2|P|2s,sPe(ssλ)β)=𝒩(μP,σP2),similar-tosubscriptmean𝑃𝐹𝒩subscript𝑠𝑃subscript𝜇𝑠superscript𝜎2superscript𝑃2subscript𝑠superscript𝑠𝑃superscript𝑒superscriptdelimited-∥∥𝑠superscript𝑠𝜆𝛽𝒩subscript𝜇𝑃superscriptsubscript𝜎𝑃2\mathrm{mean}_{P}(F)\sim\mathcal{N}(\sum_{s\in P}\mu_{s},\frac{\sigma^{2}}{|P|% ^{2}}\sum_{s,s^{\prime}\in P}e^{-(\frac{\lVert s-s^{\prime}\rVert}{\lambda})^{% \beta}})=\mathcal{N}(\mu_{P},\sigma_{P}^{2}),roman_mean start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_F ) ∼ caligraphic_N ( ∑ start_POSTSUBSCRIPT italic_s ∈ italic_P end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG | italic_P | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_P end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - ( divide start_ARG ∥ italic_s - italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG italic_λ end_ARG ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) = caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where |P|𝑃|P|| italic_P | is the cardinal of the patch P𝑃Pitalic_P (i.e., the number of grid points belonging to P𝑃Pitalic_P).

Finally,

CRPSmean𝒫,𝒘𝓟(F,𝒚)=P𝒫wPCRPS(𝒩(μP,σP2),meanP(𝒚)).subscriptCRPSsubscriptmean𝒫subscript𝒘𝓟𝐹𝒚subscript𝑃𝒫subscript𝑤𝑃CRPS𝒩subscript𝜇𝑃superscriptsubscript𝜎𝑃2subscriptmean𝑃𝒚\displaystyle\mathrm{CRPS}_{\mathrm{mean}_{\mathcal{P}},\bm{w_{\mathcal{P}}}}(% F,\bm{y})=\sum_{P\in\mathcal{P}}w_{P}\mathrm{CRPS}(\mathcal{N}(\mu_{P},\sigma_% {P}^{2}),\mathrm{mean}_{P}(\bm{y})).roman_CRPS start_POSTSUBSCRIPT roman_mean start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT , bold_italic_w start_POSTSUBSCRIPT bold_caligraphic_P end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_F , bold_italic_y ) = ∑ start_POSTSUBSCRIPT italic_P ∈ caligraphic_P end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT roman_CRPS ( caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , roman_mean start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( bold_italic_y ) ) .