Forecasting Electricity Market Signals
via Generative AI^*^**This work was supported in part by the National Science Foundation under Award EECS 2218110.

Xinyi Wang [email protected] Qing Zhao [email protected] Lang Tong [email protected]

Abstract

This paper presents a generative artificial intelligence approach to probabilistic forecasting of electricity market signals, such as real-time locational marginal prices and area control error signals. Inspired by the Wiener-Kallianpur innovation representation of nonparametric time series, we propose a weak innovation autoencoder architecture and a novel deep learning algorithm that extracts the canonical independent and identically distributed innovation sequence of the time series, from which samples of future time series are generated. The validity of the proposed approach is established by proving that, under ideal training conditions, the generated samples have the same conditional probability distribution as that of the ground truth. Three applications involving highly dynamic and volatile time series in real-time market operations are considered: (i) locational marginal price forecasting for self-scheduled resources such as battery storage participants, (ii) interregional price spread forecasting for virtual bidders in interchange markets, and (iii) area control error forecasting for frequency regulations. Numerical studies based on market data from multiple independent system operators demonstrate the superior performance of the proposed generative forecaster over leading classical and modern machine learning techniques under both probabilistic and point forecasting metrics.

keywords:

Probabilistic forecasting, electricity price forecasting, representation learning, generative artificial intelligence, energy and ancillary market forecasting, and risk-sensitive market operations.

^†^†journal: International Journal of Forecasting

\affiliation

[label1]organization=School of Electrical and Computer Engineering,addressline=Cornell University, city=Ithaca, state=New York, postcode=14853, country=USA

1 Introduction

A defining feature of generative artificial intelligence (AI) is its ability to produce artificial samples that resemble reality. In particular, generative AI learns the underlying structure of a phenomenon from examples, from which an arbitrarily large number of artificial samples having the same properties exhibited in the examples are produced. Generative AI with sophisticated neural network structure and machine learning methods have achieved remarkable performance in real-world applications unmatched by conventional techniques [1].

The classical subject of forecasting has a natural connection with generative AI through probabilistic forecasting. Given past observations, probabilistic forecasting aims to obtain the conditional probability distribution of future time series given past observations. Once such a distribution is obtained, Monte-Carlo samples of the future time series can be generated. However, forecasting conditional probability distribution faces daunting computation and sample complexities. First, nonparametric distribution forecasting is an infinite-dimensional functional estimation problem, typically requiring a finite-dimensional reduction through a histogram with a finite number of bins or quantiles with finite levels. Second, for continuously distributed random time series, there is only one realization of the future associated with the particular history on which conditional probability is defined, which makes learning the conditional distribution from data fundamentally difficult.

This paper develops a generative probabilistic forecasting (GPF) solution based on the generative AI principle that directly generates future time series samples based on current and past observations, bypassing the modeling, computational, and sample complexity challenges of forecasting the conditional probability distribution. Currently, such nonparametric GPF techniques do not exist. At the theoretical level, it is unknown what the GPF implementation architecture would be to guarantee the generation of samples with the correct conditional distribution.

1.1 Literature Review

The literature on parametric and nonparametric probabilistic forecasting of electricity market signals is vast. Our review here will focus on short-term (real-time) forecasting techniques for wholesale electricity prices and dispatch quantities such as area imbalances. To this end, we rely on earlier surveys [2, 3, 4] and pay more attention to contemporary machine learning-based techniques over the last decade or so.

Because wholesale electricity prices and dispatch imbalances are endogenously determined by optimization-based market clearing, they tend to be highly volatile due to their sensitivity to binding constraints. Overall, real-time prices and dispatch quantities behave quite differently from exogenous (physical) processes such as wind, solar, and demand time series. For this reason, our review excludes the extensive literature on energy forecasting, even though many techniques also apply to price forecasting. See [5] and references therein for energy forecasting.

Mainstream probabilistic forecasting techniques are from parametric or nonparametric families. Parametric forecasting predicts a parameterized conditional probability distribution of future time series variables, reducing an infinite-dimensional inference problem to one of finite-dimensional estimation. Popular approaches include autoregressive and moving average models [6, 7, 8, 9, 10, 11], Gaussian models [12], Student’s $t$ -distribution [13], and others [14]. A classic benchmark with strong performance is SNARX [4], which we include in our comparison. While restricting probability distributions to be parametric results in tractable computations, it comes with the loss of accuracy from model mismatch.

Nonparametric forecasting has a long history. See [2] for a review up to 1997. These techniques estimate the underlying probability distribution or its derived properties (e.g., quantile) without assuming a parametric form. Classical techniques [15] face formidable sample and computational complexities, further magnified when the underlying time series has arbitrary temporal dependencies. Quantile regression is one of the most popular techniques for forecasting electricity prices. By estimating multiple quantiles, one obtains a histogram approximation of the underlying probability density function. One well-recognized example using quantile regression for day-ahead LMP foreasting is [16]. Over the last decade, modern machine learning techniques are widely adopted to compute the conditional quantiles. Examples include Extreme Learning Machine (ELM) [17], Recurrent Neural Network (RNN) [18], and attention mechanism [19]. See [20] for a comparison study among multiple quantile regression methods. These techniques, however, are limited by the resolution of the quantile and cannot produce real-valued generative samples corresponding to the true distribution.

Most machine learning-based GPF methods involve a deep neural network (possibly with sophisticated LSTM or transformer architectures) that maps past observations and a set of exogenous random variables to samples of future time series. Distinguishing these techniques are methods used to train the neural network and ways of generating exogenous random variables. Examples include variational autoencoder (VAE) based methods [21, 22, 23, 24] and the more recent normalizing flow and denoising diffusion techniques [25, 26]. Generative adversarial network (GAN) learning has also been used [27, 28]. Missing in these methods is some form of theoretical guarantee that the generative samples should follow the desired conditional probability distribution.

Unlike point forecasting problems where the ground-truth samples are available, evaluating GPF is particularly challenging because of the lack of ground-truth distribution. Aside from standard evaluation criteria (see B), one way is to evaluate GPF by the performance of GPF-derived point forecasting techniques, since GPF can produce almost any type of point forecasts. The argument is that if a GPF method performs well, then all the GPF-derived point forecasting methods should perform well under point forecasting metrics. Thus, it is relevant to review some of the well-tested point forecasting techniques with which we compare the point forecasting techniques derived by the proposed GPF method.

One of the earliest GAN-based point forecasting techniques for real-time LMP is [29]. Very recently, the success of large language models (LLM) in natural language processing motivated its application in time series forecasting [30, 31], including the direct application of ChatGPT to price forecasting [32]. While there has been timely discussion of LLM’s capabilities and limitations in power system applications [32], there is a lack of understanding of the rationale of using highly complex LLM for LMP forecasting and the reasons for their improvement over the more conventional machine learning techniques. We include transformer-based point techniques as benchmarks for comparison in Sec. 4, given their demonstrated superior performance for natural languages and other time series. Informer [33] and Pyraformer [34] are point forecasting techniques that adopted the popular transformer model to capture the temporal dependencies. For their award-winning contributions, we include [33] and [34] in our comparative study in this paper.

1.2 Summary of Contributions

We propose Weak Innovation Autoencoder-based GPF (WIAE-GPF), a novel approach inspired by the classic Wiener-Kallianpur innovation representation of nonparametric time series [35] and a relaxation by Rosenblatt [36]. A key contribution of this work is to establish formally that the GPF architecture shown in Fig. 1 is “provable correct.” By provably correct, we mean that with optimally trained WIAE autoencoder $(G_{\theta^{*}},H_{\eta^{*}})$ , the WIAE-GPF output $\tilde{{\bm{X}}}_{t}$ at time $t$ follows the probability distribution of the future variable ${\bm{X}}_{t+T}$ given $({\bm{X}}_{0}={\bm{x}}_{0},\cdots,{\bm{X}}_{t}={\bm{x}}_{t})$ —the observed time series up to time $t$ .

Refer to caption — Figure 1: Forecasting pipeline for WIAE-GPF.

Note that the generator output $\tilde{{\bm{X}}}_{t}$ is a function of observed time series ${\bm{x}}_{0:t}$ and independently generated exogenous random vector $\tilde{\mathcal{V}}_{t}=({\bf V}_{1},\cdots,{\bf V}_{T})$ with independent and identical (IID) uniformly distributed components, making $\tilde{{\bm{X}}}_{t}$ a function of the random vector $\tilde{\mathcal{V}}_{t}$ . By generating $\tilde{{\cal V}}_{t}$ from $T$ IID samples of uniform distribution, WIAE-GPF produces realizations of $\tilde{{\bm{X}}}_{t}$ following the same conditional distribution as ${\bm{X}}_{t+T}$ . The formal definition of WIAE and its learning algorithm are presented in Sec. 2. The WIAE-GPF architecture and its validity are shown in Sec. 3.

Because practical implementations of WIAE are necessarily finite-dimensional, we establish a structural convergence property that the conditional distribution of the WIAE-GPF output converges to that of the conditional distribution of the time series. See Sec. 3.2 for details.

There have been some but limited applications of generative AI techniques in power system operations despite their accelerated advances in representing and learning time series models. Missing in particular are the validation and comparative studies using real-world market data. We fill this gap by comparing the WIAE-GPF with leading traditional and machine-learning algorithms for three applications: real-time LMP forecasting for energy markets, interregional LMP spread forecasting for interchange markets, and area control error (ACE) forecasting for regulation markets. Such comparisons are essential because these real-time market signals have characteristics not present in media signals such as video and natural language time series, where machine learning techniques have demonstrated success. Both LMP and ACE are highly dynamic time series with frequent spikes. Our comparison study offers a compelling case for WIAE-GPF across multiple performance measures for point and probabilistic forecasters.

The idea of WIAE-GPF was first presented in a preliminary version of this work [37], from which the current paper makes substantial new contributions^†^††Based on Turnitin comparison, this paper exhibits less than 15% percent overall similarity and less than 4% similarity to the preliminary version.. In particular, the Bayesian sufficiency theorem in Sec. 3.1 is significantly stronger than that in [37]. Also new are a theorem (Theorem 1 in Sec. 3.1) that establishes formally the validity of WIAE-GPF and a theorem on the structural convergence (Theorem 2 in Sec. 3.2). We considered three specific real-time market applications, two of the three were not considered in [37]. The numerical results for all three applications as well as the analysis and discussions are all new.

1.3 Organization and Notations

This paper is organized as follows. Sec. 2 defines a nonparametric time series model, its innovation representations, and the learning algorithm of WIAE. Sec. 3 develops WIAE-GPF, the proposed GPF algorithm. Sec. 4 presents the application of WIAE-GPF in three market operations and the comparison studies of major forecasting benchmarks.

The notations used in this paper are standard. Random variables are in capital letters and their realizations in lowercases. Boldface letters are typically used for vectors and matrices. We use $({\bm{X}}_{t})$ to denote a multivariate random time series, where column vector ${\bm{X}}_{t}=(X_{1t},\cdots,X_{dt})$ is the time series at time $t$ , and $(X_{it})$ the $i$ th sub-time series of $({\bm{X}}_{t})$ . In this paper, ${\bm{X}}_{t_{1}:t_{2}}$ denotes the segment of $({\bm{X}}_{t})$ from $t_{1}$ to $t_{2}$ , i.e., ${\bm{X}}_{t_{1}:t_{2}}=({\bm{X}}_{t_{1}},\cdots,{\bm{X}}_{t_{2}})$ . For two random vectors ${\bm{X}}$ and ${\bf Y}$ , ${\bm{X}}\stackrel{{\scriptstyle\mbox{\tiny a.s.}}}{{=}}{\bf Y}$ means the two random variables equal almost surely, and ${\bm{X}}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{=}}{\bf Y}$ means the two equal in distribution. An IID random sequence with marginal distribution cumulative distribution $F$ is denoted by ${\bm{X}}_{t}\stackrel{{\scriptstyle\mbox{\tiny\sf IID}}}{{\sim}}F$ . Table 1 shows the major designated symbols used in the paper.

Table 1: Abbreviations and mathematical notations used in this paper.

GPF	Generative Probabilitic Forecasting.
WIAE	Weak Innovation AutoEncoder.
ACE	Area Control Error.
LLM	Large Language Model.
NMSE	Normalized Mean Square Error.
NMAE	Normalized Mean Absolut Error.
MASE	Mean Absolute Scaled Error.
sMAPE	Symmetric Mean Absolute Percentage Error.
CRPS	Continuous Ranked Probability Score.
CP	Coverage Probability.
CPE	Coverage Probability Error.
NCW	Normalized Coverage Width.
$({\bm{X}}_{t})$	The random process of predictive interests.
$({\bf V}_{t})$	The innovation sequence.
$({\bf U}_{t})$	An IID sequence of uniform distribution.
$(\hat{{\bm{X}}}_{t})$	The rescontruction sequence output by WIAE decoder.
$\left(\hat{{\bf V}}_{t}^{(m)}\right)$	The weak innovation sequence estimated by a $m$ -dimensional WIAE.
$\left(\hat{{\bm{X}}}_{t}^{(m)}\right)$	The reconstruction sequence estimated by a $m$ -dimensional WIAE.
$({\bm{x}}_{t})$	A sequence of real numbers indicating the past realizations of $({\bm{X}}_{t})$ .
$(\hbox{\boldmath$\nu$\unboldmath}_{t})$	A sequence of real numbers indicating the past realizations of $({\bf V}_{t})$ .
$G$	WIAE encoder function.
$H$	WIAE decoder function.
$G_{\theta}$	A neural network approximation of $G$ parameterized by $\theta$ .
$H_{\eta}$	A neural network approximation of $H$ parameterized by $\eta$ .
$D_{\gamma}$	Innovation discriminator that measures the distance between $({\bf V}_{t})$ and $({\bf U}_{t})$ .
$D_{\omega}$	Reconstruction discriminator that measures the distance between $({\bm{X}}_{0:t+T})$ and $({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})$ .
${\cal U}[0,1]^{d}$	The continuous $d$ -dimensional uniform distribution on $[0,1]$ .

2 Innovation Representation Learning

2.1 Strong and Weak Innovation Representations

In 1958, Wiener and Kallianpur proposed an innovation representation of scalar time series [35]. In the parlance of modern machine learning, an innovation representation is a causal autoencoder shown in Fig. 2 with the latent process $({\bf V}_{t})$ being an IID-uniform innovation sequence. In particular, ${\bf V}_{t}$ represents the new information (innovation) contained in ${\bm{X}}_{t}$ independent of the past ${\bm{X}}_{0:t-1}=({\bm{X}}_{t-1},{\bm{X}}_{t-2},\cdots)$ . Mathematically, the innovation representation of the time series is defined by causal map**s $(G,H)$ and $({\bf V}_{t})$ :


$\displaystyle{\bf V}_{t}$	$\displaystyle=G({\bm{X}}_{t},{\bm{X}}_{t-1},\cdots),$	(\theparentequation.1)
	$\displaystyle({\bf V}_{t})\stackrel{{\scriptstyle\mbox{\sf\tiny IID}}}{{\sim}}% {\cal U}[0,1]^{d},$	(\theparentequation.2)
$\displaystyle\hat{{\bm{X}}}_{t}$	$\displaystyle=H({\bf V}_{t},{\bf V}_{t-1},\cdots),$	(\theparentequation.3)

The Wiener-Kallianpur’s innovation autoencoder requires further that the decoder output $(\hat{{\bm{X}}}_{t})$ reconstructs the input $({\bm{X}}_{t})$ (almost surely), i.e., $(\hat{{\bm{X}}_{t}})\stackrel{{\scriptstyle\mbox{\tiny a.s.}}}{{=}}({\bm{X}}_{% t})$ , which makes Wiener-Kallianpur’s autoencoder a strong innovation Autoencoder. The perfect causal reconstruction implies that the innovation sequence $({\bf V}_{t})$ is a sufficient statistic for all decision-making based on $({\bm{X}}_{t})$ . Therefore, using the IID-uniform $({\bf V}_{t})$ for decision-making incurs no performance loss.

However, Rosenblatt showed that the Wiener-Kallianpur (strong) innovation representation does not exist for broad classes of random processes, including some of the widely used finite-state Markov chains [36]. Rosenblatt suggested a weak innovation representation, requiring that the autoencoder output $(\hat{{\bm{X}}}_{t})$ matches its input $({\bm{X}}_{t})$ only in distribution:

\displaystyle({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})\stackrel{{\scriptstyle% \mbox{\tiny d}}}{{=}}({\bm{X}}_{0:t+T}),\forall t.

(2)

Herein, we call the autoencoder $(G,H)$ for the weak innovation representation the Weak Innovation Auto Encoder (WIAE).

2.2 Innovation Representation Learning

Beyond the Gaussian and additive Gaussian models, there is no known algorithm to obtain WIAE, especially when the underlying time series is nonparametric with an unknown probability structure. In [38], the authors proposed a GAN-based learning of the strong innovation representation by jointly minimizing the Wasserstein distance of the latent process from the uniform IID process and the mean squared error ( $l_{2}$ distance) of the autoencoder output as the estimate of the input. However, strong innovation representation applies only to a restricted class of time series, and the joint optimization of autoencoder with mixed Wasserstein and $l_{2}$ distance measures can be challenging. Finally, learning a scalar innovation representation limits the ability to incorporate multiple time series observations. The WIAE learning proposed below overcomes these shortcomings.

2.3 WIAE Learning

We present a deep learning approach to learn a WIAE for the weak innovation representation defined in (2). Shown in Fig. 3 is the schematic that highlights key components of the WIAE learning.

The encoder $G_{\theta}$ and decoder $H_{\eta}$ are causal convolutional neural networks parameterized by coefficients $\theta$ and $\eta$ , respectively. The weak innovation representation, at its core, matches the input-output distributions and constrains the latent process $({\bf V}_{t})$ to be IID-uniform. To this end, we introduce two neural network discriminators, the innovation discriminator $D_{\gamma}$ and the reconstruction discriminator $D_{\omega}$ with parameters $\gamma$ and $\omega$ respectively, to enforce (\theparentequation.2) and (2). In particular, the innovation discriminator $D_{\gamma}$ compares the distributions of $(\hat{{\bf V}}_{t})$ and $({\bf V}_{t})$ , and the reconstruction discriminator $D_{\omega}$ the compares joint distributions of ${\bm{X}}_{0:t+T}$ and $({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})$ with sufficiently large $T$ . These discriminators produce error signals to update neural network parameters $(\theta,\eta,\gamma,\omega)$ . In this work, we adopt the Wasserstein discriminator proposed in [39] to compute the Wasserstein distance between distributions.

Because the two discriminators are both based on the Wasserstein-distance measure, their parameters $(\theta,\eta,\omega,\gamma)$ can be obtained via a single optimization:

L:=\min_{\theta,\eta}\max_{\gamma,\omega}\big{(}\mathbb{E}[D_{\gamma}\left(({% \bf U}_{t})\right)]-\mathbb{E}[D_{\gamma}((\hat{{\bf V}}_{t}))]\\ +\lambda(\mathbb{E}[D_{\omega}({\bm{X}}_{0:t+T})]-\mathbb{E}[D_{\omega}(({\bm{% X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T}))])\big{)},

(3)

where $\lambda$ is a real number that scales the two Wasserstein distances. The two parts of the inner maximization loop of loss function (3) regularize $(G_{\theta},H_{\eta})$ according to (\theparentequation.2) and (2). It’s evident that minimizing the inner loop with respect to $\theta$ and $\eta$ is equivalent to enforcing $({\bf V}_{t})$ being IID uniform, and $({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})$ having the same distribution as ${\bm{X}}_{0:t+T}$ . The training of the four neural networks is standard. Here we used the off-the-shelf Adam optimizer.

In a practical implementation of WIAE, finite (input) dimensional neural networks are used. The training of a finite-dimensional WIAE via (3) must also be implemented by finite segments of the random processes involved. In Sec. 3.2, we consider the implications of such practical restrictions.

3 WIAE-GPF and its Properties

In this section, we introduce WIAE-GPF—a generative probabilistic forecasting techniques based on weak innovations representation. Specifically, given past observations ${\bm{x}}_{0:t}$ , WIAE-GPF produces (arbitrarily many) samples of $\tilde{{\bm{X}}}_{t}$ that has the conditional distribution of ${\bm{X}}_{t+T}$ . We present next the structure of WIAE-GPF, the Bayesian sufficiency of WIAE, and a structure convergence when WIAE is implemented with finite-dimensional implementations.

3.1 Structure of WIAE-GPF

The structure of the proposed WIAE-GPF forecaster is shown in Fig. 1. At time $t$ , given the realization of ${\bm{X}}_{0:t}={\bm{x}}_{0:t}$ and autoencoder $(G_{\theta^{*}},H_{\eta^{*}})$ trained by (3), ${\bm{x}}_{0:t}$ up to time $t$ , the WIAE encoder $G_{\theta^{*}}$ generates the innovation sequence $\hbox{\boldmath$\nu$\unboldmath}_{0:t}$ . The WIAE decoder $H_{\eta^{*}}$ maps $\hbox{\boldmath$\nu$\unboldmath}_{0:t}$ and independently generated IID-uniform pseudo innovations $\tilde{{\cal V}}_{t}\stackrel{{\scriptstyle\mbox{\sf\tiny IID}}}{{\sim}}{\cal U% }[0,1]^{T}$ to produce a sample $\tilde{{\bm{X}}}_{t}=\tilde{{\bm{x}}}_{t}$ .

Note that when forecasting ${\bm{X}}_{t+T}$ , we do not have realizations for random samples of ${\bm{X}}_{t+1:t+T}$ . The salient feature of WIAE-GPF is to replace samples from the unknown and arbitrarily distributed ${\bm{X}}_{t+1:t+T}$ by realizations of pseudo innovations $\tilde{\mathcal{V}}_{t}$ known to be IID-uniform. Thus, once the autoencoder is trained, generating random samples with the conditional distribution of ${\bm{X}}_{t+T}$ is trivial.

We now establish the validity of WIAE-GPF by showing that the WIAE-GPF output $\tilde{{\bm{X}}}_{t}$ has the same conditional distribution as ${\bm{X}}_{t+T}$ given ${\bm{X}}_{0:t}={\bm{x}}_{0:t}$ . This is not obvious because the input of $H_{\eta^{*}}$ includes an exogenous random vector $\tilde{{\cal V}}_{t}$ and the weak innovation $({\bf V}_{0:t})$ that may not be a sufficient statistic.

We first show that the weak innovation sequence is Bayesian sufficient^‡^‡‡ $T(X)$ is Bayesian sufficient for the estimation of a random variable $Y$ if the posterior distribution of $Y$ given $X$ is the same as the one given $T(X)$ [40]., which implies that any stochastic decision involving future time series ${\bm{X}}_{t+T}$ can be made without loss based on the innovations ${\bf V}_{0:t}$ . The same result was first presented in [37] under the more restrictive setting of $H_{\eta^{*}}$ being injective.

Lemma 1 (Bayesian Sufficiency of Multivariate Weak Innovations)

Let $({\bm{X}}_{t})$ be a stationary time series for which the weak innovation representation exists. Let $({\bf V}_{t})$ be the weak innovation representation of $({\bm{X}}_{t})$ . Then, for all ${\bm{x}}$ and ${\bm{X}}_{0:t}={\bm{x}}_{0:t}$ ,

\displaystyle\Pr[{\bm{X}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}]=\Pr% \left[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bf V}_{0:t}=\hbox{\boldmath$\nu$% \unboldmath}_{0:t}\right].

(4)

Proof: By the definition of weak innovation representation,

$\displaystyle\Pr[{\bm{X}}_{t+T}\leq{\bm{x}}\|{\bm{X}}_{0:t}$	$\displaystyle={\bm{x}}_{0:t}]=\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}\|{\bm{X}}_{0% :t}={\bm{x}}_{0:t}]$	(5)
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm% {x}}\|G_{\theta^{}}({\bm{X}}_{0:t})=G_{\theta^{}}({\bm{x}}_{0:t}))]$
	$\displaystyle=\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}\|{\bf V}_{0:t}=\hbox{% \boldmath$\nu$\unboldmath}_{0:t}],$

where $(a)$ is from the Markovian structure of the autoencoder,i.e., ${\bm{X}}_{0:t}\stackrel{{\scriptstyle G_{\theta^{*}}}}{{\rightarrow}}\hat{{\bf V% }}_{0:t}\stackrel{{\scriptstyle H_{\eta^{*}}}}{{\rightarrow}}\hat{{\bm{X}}}_{0% :t}$ . By definition of Bayesian statistics [40], ${\bf V}_{0:t}=G_{\theta^{*}}({{\bm{X}}_{0:t}})$ is a sufficient statistics for ${\bm{X}}_{t+T}$ for all $T>0$ . $\square$

The validity of WIAE-GPF is shown next.

Theorem 1 (Validity of WIAE-GPF)

For all $T>0$ , the conditional distribution of the WIAE-GPF output $\tilde{{\bm{X}}}_{t}$ given ${\bm{X}}_{0:t}={\bm{x}}_{0:t}$ is identical to that of ${\bm{X}}_{t+T}$ (given ${\bm{X}}_{0:t}={\bm{x}}_{0:t}$ ), i.e., ,

\displaystyle\Pr[{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}]=% \Pr[\tilde{{\bm{X}}}_{t}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}].

(6)

Proof: By Lemma 1,

\displaystyle\Pr[{\bm{X}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}]=\Pr% [\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bf V}_{0:t}=\hbox{\boldmath$\nu$% \unboldmath}_{0:t}],

(7)

where ${\bf V}_{0:t}=G_{\theta^{*}}({\bm{X}}_{0:t})$ and $\hbox{\boldmath$\nu$\unboldmath}_{0:t}=G_{\theta^{*}}({\bm{x}}_{0:t})$ . Now consider

\begin{cases}\hat{{\bm{X}}}_{t+T}=G_{\theta^{*}}({\bf V}_{0:t},{\bf V}_{t+1:t+% T})\\ \tilde{{\bm{X}}}_{t}=G_{\theta^{*}}({\bf V}_{0:t},\tilde{{\cal V}}_{t}),\end{cases}

where, by definition, $({\bf V}_{0:t},{\bf V}_{t+1:t+T},\tilde{{\cal V}}_{t})$ are jointly independent IID uniform sequences, and $\tilde{{\cal V}}_{t}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{=}}{\bf V}_{t+1:t% +T}$ . Therefore,

\displaystyle\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bf V}_{0:t}=\hbox{% \boldmath$\nu$\unboldmath}_{0:t}]=\Pr[\tilde{{\bm{X}}}_{t}\leq{\bm{x}}|{\bf V}% _{0:t}=\hbox{\boldmath$\nu$\unboldmath}_{0:t}].

(8)

Combining (7) and (8), we have (6). $\square$

3.2 Structural Convergence of WIAE-GPF

This section focuses on the practical issue of finite-dimensional implementations of WIAE and discriminators in Fig. 3. It is evident that no machine learning technique guarantees that a finite-dimensional implementation can extract weak innovations, even if the amount of historical samples available is unbounded. Here we present a structural convergence result to show that, under the ideal training conditions with unbounded training samples and global convergence of training, the innovations generated from a finite-dimensional WIAE converge in distribution to the true weak innovations.

The structural convergence analysis assumes that the training samples are unbounded, and the training algorithm converges to the global minimum. Let $G_{\theta^{*}}^{(m)}$ be the optimally trained finite (input) dimensional CNN encoder that takes time-shifted $m$ consecutive observations ${\bm{X}}_{t-m+1:t}$ and produces the latent process $\left(\hat{{\bf V}}^{(m)}_{t}\right)$ . Likewise, let $H^{(m)}_{\eta^{*}}$ be the optimally trained $m$ -dimensional CNN decoder that produces the WIAE output sequence $\left(\hat{{\bm{X}}}^{(m)}_{t}\right)$ . Similarly defined are the finite dimensional discriminators that take $n$ consecutive inputs, denoted by $(D_{\omega}^{(n)},D_{\gamma}^{(n)})$ . In this paper, we choose $n=m$ .

To analyze the asymptotic property of finite (input) dimensional WIAE-GPF, we make the following assumptions:

A1

Existence: The random process $({\bm{X}}_{t})$ has a weak innovation representation defined in (\theparentequation.1 - \theparentequation.3) & (2), and there exists a causal WIAE with continuous $G$ and $H$ .
A2

Feasibility: There exists a sequence of finite input dimension auto-encoder functions $(G_{\bar{\theta}}^{(m)},H_{\bar{\eta}}^{(m)})$ that converges uniformly to $(G,H)$ under the mean-squared distance metric.
A3

Training: The training sample sizes are infinite. The training algorithm for all finite-dimensional WIAE using finite-dimensional training samples converges almost surely to the global optimum.

Theorem 2

Under (A1-A3),

\displaystyle({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+T}^{(m)})\stackrel{{% \scriptstyle\mbox{\tiny d}}}{{\rightarrow}}({\bm{X}}_{0:t},{\bm{X}}_{t+T}),~{}\forall t

(9)

as $m$ goes to infinity.

Proof: See A.

3.3 From GPF to Point and Quantile Forecasting

GPF produces samples of the conditional probability distribution, from which point and quantile forecasts can be easily computed. Here we outline techniques to compute forecasts of several popular point and quantile forecasters. To this end, let $\left\{\tilde{{\bm{x}}}_{t}^{(k)},k=1,\cdots,K\right\}$ be the set of GPF generated samples from the probability distribution of the time series at time $t+T$ conditioned on past observations up to time $t$ . For the simplicity of mathematical expressions, we assume that $\left\{\tilde{{\bm{x}}}_{t}^{(k)}\right\}$ is sorted in the ascending order.

Minimum Mean Squared Error (MMSE) Forecast: The MMSE forecast is the mean of the conditional distribution. The MMSE forecast $\hat{{\bm{x}}}^{\mbox{\tiny MMSE}}_{t}$ by a GPF is given by the conditional sample mean

\hat{{\bm{x}}}^{\mbox{\tiny MMSE}}_{t}=\frac{1}{K}\sum_{k=1}^{K}\tilde{{\bm{x}% }}_{t}^{(k)}.

Minimum Mean Absolute Error (MMAE) Forecast: The MMAE forecast is the median of the conditional distribution. The MMAE forecast $\hat{{\bm{x}}}^{\mbox{\tiny MMAE}}_{t}$ by a GPF is given by the conditional sample median

\hat{{\bm{x}}}^{\mbox{\tiny MMAE}}_{t}=\begin{cases}\tilde{{\bm{x}}}_{t}^{% \left((K+1)/2\right)},&\mbox{if $K$ is odd}\\ 0.5\left(\tilde{{\bm{x}}}_{t}^{\left(K/2\right)}+\tilde{{\bm{x}}}_{t}^{\left(K% /2+1\right)}\right),&\mbox{if $K$ is even}.\\ \end{cases}

Quantile Forecast: The GPF forecast of $q$ -quantile $\hat{{\bm{x}}}^{\mbox{\tiny$q$-quantile}}_{t}$ is given by:

\hat{{\bm{x}}}^{\mbox{\tiny$q$-quantile}}_{t}=\begin{cases}\tilde{{\bm{x}}}_{t% }^{\left(qK\right)},&\mbox{if $qK$ is an integer}\\ 0.5\left(\tilde{{\bm{x}}}_{t}^{\left([qK]\right)}+\tilde{{\bm{x}}}_{t}^{\left(% [qK]+1\right)}\right),&\mbox{otherwise},\\ \end{cases}

where $[a]$ indicates the greatest integer not exceeding $a$ .

4 WIAE-GPF for Market Operations

We now apply WIAE-GPF to forecasting market signals such as locational marginal prices and market imbalances. At the outset, we recognize that the underlying random processes are not known to be stationary, whereas WIAE-GPF is derived based on a representation of stationary processes. Here, we rely on the hypothesis that these processes are approximately stationary locally within the forecasting horizon. Our evaluations based on real market data presented here in some way validated this hypothesis. Brief discussions on the limitations and possible extensions can be found in Sec. 5.

We conducted extensive experiments to compare leading GPF and point forecasting techniques based on a suite of performance metrics. This section summarizes our findings for three market applications where GPF is particularly valuable to system operators and market participants: (a) LMP forecasting for the optimal bidding in NYISO’s 5-minute real-time energy market, (b) GPF for the interregional LMP spread for the Coordinated Transaction Scheduling (CTS) [41] market between NYISO and PJM, and (c) ACE forecasting for regulation services using PJM’s 15-second ACE data. Common to these applications is that the forecasted variables are endogenously determined by the market operations. In contrast to exogenous variables such as wind/solar generations and inelastic demands, the LMP and ACE values are the results of dispatch and commitment optimization, where binding constraints introduce spikes in dual variables from which LMPs are computed. They are highly dynamic as shown in Fig. 4.

4.1 Baseline Methods in Comparison

Table 2: Comparison of the baselines.

Algorithm	Forecasting Type	Time Series Model	Forecastor Output	ML Models
SNARX [4]	Probabilistic	Semiparametric AR	AR Model Parameters	Kernel Estimation
WIAE-GPF	Probabilistic	Nonparametric	Generative	CNN + WIAE
TLAE [21]	Probabilistic	Parametric	Generative	RNN + VAE
DeepVAR [11]	Probabilistic	Parametric (AR Model)	Model Parameters	LSTM
LQRA [16]	Probabilistic	Nonparametric	Forecasted Quantiles	Ensemble method + Lasso regularization
BWGVT [19]	Probabilistic	Nonparametric	Forecasted Quantiles	LLM + Quantile Regression
Pyraformer [34]	Point	Nonparametric	Point Estimate	LLM
Informer [33]	Point	Nonparametric	Point Estimate	LLM

We compared WIAE-GPF with six leading forecasters based on their relevance to power system applications and their established reputations. See Table 2 for attributes of these techniques and references. WIAE-GPF is the only nonparametric GPF forecaster. Because there are limited nonparametric GPF techniques, we also included in our comparison popular machine-learning-based parameterized GPF and point-forecasting techniques.

We started with SNARX [4], a classical parametric forecasting technique based on an auto-regressive moving-average (ARMA) model. An early study showed that SNARX performed the best among a list of parametric techniques [4]. For deep-learning techniques, we compared with DeepVAR [11], a multivariate generalization of the popular DeepVAR that has become a key baseline for time series forecasting in multiple applications. Temporal Latent AutoEncoder (TLAE) [21] is an autoencoder-based parametric GPF where the conditional distribution of the future time series variables is obtained through a transformed Gaussian random process with mean and variance as parameters. Once these parameters are estimated from observed realizations, Gaussian Monte Carlo samples are fed into the decoder to generate samples of the forecasted random variable. LQRA [16] builds on top of quantile regression averaging by utilizing LASSO regularization for quantile regression. It computes conditional quantiles by linear transforming a collection of point forecasts. The point forecasts are obtained by fitting an expert model. The expert model is originally proposed for day-ahead price prediction, and thus we made our adaptation by adding the real-time price for the past 5 hours to the linear autoregression.

We also included three popular forecasting techniques based on LLMs. One is the award-winning technique Informer [33]; the other is Pyraformer [34] that captures temporal dependency at multiple granularities^§^§§Pyraformer is the keynote presentation at The International Conference on Learning Representations (ICLR) 2022.. Pyraformer showed superior performance over a wide range of LLM-based point forecasting techniques. Specifically developed for LMP forecasting, BWGVT^¶^¶¶We have named the algorithm by the first letters of authors’ last names [19] combines quantile regression with a transformer architecture derived for LLMs.

4.2 Evaluation Metrics

Comparing probabilistic forecasting methods is difficult due to the lack of ground truth for the underlying conditional distribution. However, because a GPF can produce arbitrarily many Monte Carlo samples^∥^∥∥We used $1000$ Monte-Carlo samples and sample average to obtain point estimates., it can be evaluated by all point forecasting metrics. More importantly, an ideal probabilistic forecaster that produces the correct conditional distribution will perform well under any point estimator metric under regularity conditions. Therefore, evaluating a GPF method based on a set of point-forecasting techniques is a credible way to assess its performance.

We used four popular point forecasting and two widely used probabilistic forecasting metrics. See B for their definitions. Normalized mean squared error (NMSE) measures the error associated with the mean of the estimated conditional probability distribution. Normalized Absolute Error (NMAE) measures the error associated with the median. Mean absolute scaled error (MASE) is the ratio of the NMAE of a method over that of the (naive) persistent predictor that uses the latest observation available as the forecast. Symmetric mean absolute percentage error (sMAPE) averages and symmetrizes the percentage error computed at each time stamp and is less sensible to outliers. The probabilistic forecasting metrics were the continuous ranked probability score (CRPS) [42], Coverage Probability Error (CPE), and Normalized Coverage Width (NCW). CRPS evaluates the quadratic difference between the predicted empirical cumulative density function (c.d.f.) with an indicator c.d.f. based on the ground truth. CPE and NCW are often used to evaluate prediction intervals. CPE is the deviation of the coverage probability (CP) from the nominal confidence level $\beta\%$ , whereas NCW represents the width of the prediction intervals. At similar level of CP, the method with smaller NCW shows better accuracy in prediction interval estimation. In this paper, we computed the CPE and NCW of 10%, 50%, and 90% intervals predicted by each probabilistic method. The mathematical definition of those metrics can be found in the appendix.

For probabilistic forecasting techniques, we used their conditional means as the point forecasts when evaluated by NMSE, whereas the conditional median is used for NMAE, MASE, and sMAPE. For the quantile regression technique, BWGVT, we use the estimated 0.5-quantile as its point forecast for all metrics since it’s unclear how to compute empirical mean from quantiles. When producing interval forecasts for GPF methods, we took the empirical quantiles from the Monte-Carlo forecasts of future values. In particular, the beta-coverage interval was defined as the $\beta$ -width interval symmetric around the sample median.

4.3 LMP forecasting for Energy Market Participation

Table 3: Point evaluation of forecasting results for real-time price forecasting at Long Island. The numbers in the parentheses were the ranking of the algorithm. The columns under the label of LONGIL are the GPF performance of the 12-step foresting of the LMP at LONGIL. The columns under LONGIL & NYC are the performance of LMP forecasting at LONGIL using both LONGIL and NYC observations.

Methods	NMSE		NMAE		MASE		sMAPE
Methods	LONGIL	LONGIL & NYC	LONGIL	LONGIL & NYC	LONGIL	LONGIL & NYC	LONGIL	LONGIL & NYC
SNARX [4]	(8) $0.9852$	(7) $0.5029$	(8) $0.9733$	(8) $0.7318$	(8) $0.7053$	(8) $0.4548$	(7) $0.4319$	(7) $0.4162$
WIAE-GPF	(1) $\mathbf{0.0585}$	(2) $0.0487$	(1) $\mathbf{0.2074}$	(1) $\mathbf{0.1186}$	(1) $\mathbf{0.1503}$	(1) $\mathbf{0.0737}$	(1) $\mathbf{0.0839}$	(1) $\mathbf{0.0316}$
TLAE [21]	(4) $0.2956$	(1) $\mathbf{0.0232}$	(6) $0.4186$	(2) $0.1366$	(6) $0.3034$	(2) $0.0849$	(3) $0.2720$	(2) $0.3190$
DeepVAR [11]	(6) $0.3919$	(6) $0.4060$	(5) $0.4088$	(7) $0.4097$	(5) $0.2963$	(7) $0.2546$	(5) $0.3629$	(3) $0.3703$
LQRA [16]	(7) $0.8287$	(8) $0.7110$	(7) $0.8472$	(4) $0.2693$	(7) $0.6139$	(4) $0.1623$	(8) $0.4952$	(8) $0.4879$
BWGVT [19]	(3) $0.2670$	(5) $0.2528$	(4) $0.3158$	(6) $0.3280$	(4) $0.2289$	(6) $0.2038$	(4) $0.2817$	(6) $0.3900$
Pyraformer [34]	(5) $0.3128$	(3) $0.1382$	(2) $0.3074$	(3) $0.2556$	(2) $0.2228$	(3) $0.1588$	(2) $0.1059$	(4) $0.3765$
Informer [33]	(2) $0.1912$	(4) $0.1829$	(3) $0.3147$	(5) $0.2830$	(3) $0.2281$	(5) $0.1759$	(6) $0.3761$	(5) $0.3827$

Table 4: Evaluation of probabilistic forecasting results of real-time prices at Long Island. The numbers inside the square parentheses are the corresponding NCW.

Methods	CRPS		CPE (90%) [NCW]		CPE (50%) [NCW]		CPE (10%) [NCW]
Methods	LONGIL	LONGIL & NYC	LONGIL	LONGIL & NYC	LONGLIL	LONGIL & NYC	LONGLIL	LONGIL & NYC
SNARX [4]	(3) $18.2864$	(3) $10.5286$	(5) $-0.2263$ $[0.6620]$	(6) $-0.3779$ $[0.7990]$	(5) $-0.2451$ $[0.5574]$	(2) $-0.0746$ $[0.9161]$	(3) $-0.0641$ $[0.8145]$	(6) $0.1870$ $[0.8082]$
WIAE-GPF	(1) $\mathbf{4.6029}$	(1) $\mathbf{1.1519}$	(1) $~{}\mathbf{0.0245}$ $[0.9936]$	(1) $~{}\mathbf{0.0157}$ $[0.9958]$	(1) $\mathbf{0.0216}$ $[0.9681]$	(1) $\mathbf{0.0184}$ $[0.8742]$	(1) $\mathbf{0.0242}$ $[0.9836]$	(1) $\mathbf{-0.0190}$ $[0.8207]$
TLAE [21]	(2) $6.1874$	(2) ${2.8449}$	(2) $~{}0.0325$ $[0.8179]$	(2) $~{}0.0592$ $[0.8176]$	(2) $0.0423$ $[0.6357]$	(3) $0.1286$ $[0.7637]$	(5) $-0.0848$ $[0.6704]$	(4) $-0.0834$ $[0.6516]$
DeepVAR [11]	(4) $23.1460$	(5) $22.9343$	(4) $-0.0575$ $[0.9738]$	(3) $-0.0761$ $[0.8467]$	(4) $-0.1637$ $[0.5709]$	(5) $-0.1488$ $[0.6614]$	(2) $-0.0616$ $[0.5889]$	(3) $-0.0431$ $[0.4260]$
LQRA [16]	(6) $48.9460$	(4) $17.0583$	(6) $-0.2601$ $[0.7130]$	(5) $-0.0958$ $[0.8434]$	(3) $-0.0465$ $[0.6534]$	(4) $0.1364$ $[0.9082]$	(4) $0.0648$ $[0.8494]$	(5) $0.1529$ $[0.8827]$
BWGVT [19]	(5) $24.2595$	(6) $24.3065$	(3) $~{}0.0999$ $[7.3500]$	(4) $~{}0.0829$ $[3.4864]$	(6) $0.4815$ $[2.7822]$	(6) $0.4322$ $[3.0100]$	(6) $0.1034$ $[7.2325]$	(2) $0.0311$ $[1.1181]$

For a self-scheduled resource submitting a quantity bid to the energy market, the ability to forecast future prices is essential in constructing its bids and offers. With GPF generating future LMP realizations, the problem of optimal offer/bid strategies can be formulated as scenario-based stochastic optimization [43]. Our experiment was based on a use case of a merchant storage owner submitting quantity offers and bids to a deregulated wholesale market, using LMP from NYISO as the hypothetical price realizations.

The real-time market of NYISO closes sixty minutes ahead of actual delivery, which means that the forecasting horizon needs to be longer than 60 minutes. Two experiments were conducted to produce probabilistic forecasts of 60-minute ahead LMPs at the Long Island (LONGIL) using (a) the day-ahead prices and the current and past real-time LMPs at LONGIL, along with the system load up to the time of submitting the bid; (b) the neighboring NYC real-time LMP in addition to the data in (a).

Fig. 4 shows the real-time LMP trajectories at both LONGIL and NYC along with the demand and the day-ahead LMP at LONGIL. The real-time LMPs at LONGIL and NYC showed apparent spatial dependencies, while the dependency between day-ahead and real-time LMPs at LONGIL. The dependencies between load and real-time LMP were less obvious. The real-time LMPs and load were collected every 5 minutes and day-ahead LMPs every hour. We used the first 25 days for training and validation, and last 6 days of July for evaluation.

Test results are shown in Table 3 and 4 with boldfaced numbers being the best performance. We observed that WIAE-GPF performed the best for all cases except for the forecasting of LONGIL LMP using both LONGIL and NYC data, for which WIAE-GPF is close second under NMSE. The strong performance of WIAE-GPF comes from that it aims to match conditional distribution, with a validity guarantee that ensures the Monte-Carlo samples generated have the same conditional distribution as that of the actual time series variable. Overall, the second-best is TLAE; the VAE-trained RNN autoencoder achieved noticeable gain by combining data from two locations, especially for the joint forecasting at LONGIL & NYC. However, under the probabilistic forecasting measure of CRPS, TLAE was 50% to 100% higher than that of WIAE-GPF. Because sampling the correlated Gaussian latent process is nontrivial, TLAE used a re-parameterization heuristic that could cause an accumulation of biases.

The three LLM-based forecasters (BWGVT, Pyraformer, and Informer) performed roughly the same, worse than WIAE-GPF and TLAE but better than SNARX and DeepVAR. Note that Pyraformer and Informer are both point forecasters trained to minimize the mean squared forecasting error. They did not perform well under NMAE, MASE, and sMAPE. BWGVT is adapted to be a probabilistic forecaster, although it did not perform well under the CRPS score. Its quantile prediction was outlier-sensitive, especially when compared with the GPF methods that depend on the stochastic latent process. We also observed that BWGVT predicted larger intervals as shown by its high NCW for all cases. Consequently, their CPE’s were always positive, indicating that the prediction intervals it generated covered more than the nominal percentages. Hence, both their point estimation results and CRPS were worse than TLAE and WIAE-GPF.

The other quantile regression based method, LQRA, performed worst among nonparametric techniques. Originally proposed for hourly price forecasting, LQRA’s assumed an simple regression model to produce point forecasts, with nonlinearity only introduced by the one-day ahead minimum and maximum day-ahead price. Although this model might be sufficient for day-ahead price forecasting, its model mismatch when applied to real-time price forecasting intensifies the inaccuracy.

For the two parametric techniques DeepVAR and SNARX, being significantly simpler than LLM forecasters, performed slightly worse. Both techniques did not perform well under point estimation metrics. Shown by the CPE’s in Table 4, their prediction interval estimation is also worse than other probabilistic forecasting methods. Their NCW showed that they predicted narrow intervals that failed to cover the majority of the ground truths. The inaccurate semiparametric-AR model assumption appears to be culpable. Their performance was influenced by model mismatch, where ARMA models typically expect smoother trajectories. For the real-time LMP that exhibits high volatility, these methods were slow catch the rapid changes, leading to predicting shifted peaks that resulted in huge forecasting errors. This phenomena were corroborated by their high sMAPE, indicating higher percentage error at each time stamp.

It is interesting to observe that using LMPs in neighboring locations improved the forecasting performance of all methods except DeepVAR and BWGVT, as shown by the LONGIL and LONGIL & NYC columns under the DeepVAR and BWGVT rows in Table 3. This implies that these two methods didn’t account for spatial correlations optimally. We conclude this to the difficulty of training of transformer and LSTM models.

To gain insights into the performance of WIAE-GPF and other benchmark techniques, we plotted the ground truth trajectories (black) and trajectory forecasts generated by WIAE-GPF (red) and a competing algorithm (blue) in Fig. 5. Note that the spikes were not predicted by any methods. This was not surprising given the nature of how these spikes were produced. Aside from these spikes, these figures show clearly that WIAE-GPF (red) tracked the ground truth (black) the closest, which was supported by the fact that WIAE-GPF has the smallest NMAE. We also observed that WIAE-GPF had the smallest variation, which is supported by the fact that WIAE-GPF had the smallest NMSE. Furthermore, WIAE-GPF was the least affected by the price spikes. This was because, as a GPF method, the Monte-Carlo samples used to produce the MMSE point estimate were less likely to include extreme samples. However, for SNARX, as shown in Fig. 5(c), the spikes caused significant deviations from the forecasted trajectory.

4.4 Interregional LMP spread for Interchange Markets

The interchange market aims to improve overall economic efficiency across ISOs by allowing virtual bidders to arbitrage price differences at proxy buses of two neighboring ISOs. This experiment was based on the use case of a virtual bidder bidding into the CTS market between NYISO and PJM. The proxy buses of this market were Sandy Point of NYISO and Neptune of PJM.

The CTS market closes 75 minutes ahead of delivery and is cleared every 15 minutes. A virtual bidder submits a price-quantity bid along with the direction of the virtual trade from the source of the proxy with low LMP to the destination proxy with high LMP. Once the market is cleared, the settlement is based on the actual LMP spread between the two proxies and the cleared quantity. The bidder profits if the virtual trade direction matches the direction of the real-time LMP spread. Otherwise, the bidder incurs a loss. Therefore, the ability to predict the LMP spread direction is especially important.

We performed a 75-minute ahead LMP spread forecasting using the interface power flow and LMP spread data between NYISO and PJM at the Neptune proxy, collected in February 2024. The interface power flow samples were collected every 5 minutes, and LMP spread every 15 minutes. We used the first 24 days for training and validation, and the last 5 days of February for testing.

We added Prediction Error Rate (PER) as a measure for the accuracy of the virtual trading direction prediction, given that the sign of spread is of great importance to profitability. PER indicates the percentage of forecasts that don’t have the same direction as the ground truth. For point forecasts, we compared the signs of the forecasts with the signs of the ground truth. For probabilistic forecasting, we compared the direction of the ground truth with that of the minimum error-probability prediction of the LMP spread, which is the sign of the conditional median. For GPF, we compare the sample median with the sign of the ground truth.

Table 5: Evaluation of forecasting results for spread forecasting between NYISO and PJM.

Methods	NMSE	NMAE	MASE	sMAPE	PER
SNARX [4]	(7) $2.4531$	(7) $1.3415$	(7) $1.1847$	(7) $0.4958$	(7) $0.6781$
WIAE-GPF	(1) $\mathbf{0.0098}$	(1) $\mathbf{0.2738}$	(1) $\mathbf{0.2418}$	(1) $\mathbf{0.4493}$	(1) $\mathbf{0.0606}$
TLAE [21]	(5) $0.9592$	(5) $0.9785$	(5) $0.8641$	(4) $0.4785$	(4) $0.3692$
DeepVAR [11]	(6) $1.8986$	(3) $0.7224$	(3) $0.6380$	(5) $0.4806$	(3) $0.3505$
BWGVT [19]	(3) $0.9053$	(4) $0.8525$	(4) $0.7529$	(3) $0.4674$	(2) $0.2313$
Pyraformer [34]	(4) $0.9478$	(6) $1.2674$	(6) $1.1193$	(6) $0.4909$	(6) $0.6738$
Informer [33]	(2) $0.8045$	(2) $0.4185$	(2) $0.4252$	(2) $0.4580$	(5) $0.5487$

Table 6: Probabilistic forecasting results of spread forecasting between NYISO and PJM.

Methods	CRPS	CPE (90%) [NCW]	CPE (50%) [NCW]	CPE (10%) [NCW]
SNARX [4]	(5) $120.0403$	(5) $-0.5443$ $[5.0419]$	(5) $-0.2958$ $[1.4555]$	(2) $-0.0195$ $[0.6015]$
WIAE-GPF	(1) $\mathbf{4.0329}$	(1) $\mathbf{0.0215}$ $[0.4427]$	(2) ${-0.0274}$ $[0.5255]$	(3) $0.0212$ $[0.1692]$
TLAE [21]	(2) $15.5195$	(3) $-0.0443$ $[0.7745]$	(1) $\mathbf{0.0052}$ $[0.8784]$	(5) $-0.0933$ $[0.3133]$
DeepVAR [11]	(4) $32.8296$	(2) $-0.0355$ $[2.0279]$	(3) ${-0.1480}$ $[1.4739]$	(1) $\mathbf{-0.0021}$ $[0.4198]$
BWGVT [19]	(3) $31.5660$	(4) $0.0989$ $[5.1788]$	(4) $0.1835$ $[6.0939]$	(4) $0.0656$ $[4.0473]$

Seen from Table 5 and 6, WIAE-GPF performed better than all other techniques in all metrics. TLAE performed the second-best in CRPS ( $15.5195$ ) but slightly worse than BWGVT when evaluated under point estimation metrics. Its sequential sampling of the latent Gaussian process added to its numerical instability. BWGVT was the overall second-best performing probabilistic technique. Its transformer architecture with enhanced capability of capturing long-term temporal dependency didn’t offer much gain for the training difficulty imposed by the increasing number of deep-learning parameters, see Sec. 4.6. BWGVT also exhibited the tendency to predict a wide interval covering more than the nominal percentage. Point estimation techniques, namely Pyraformer and Informer, were not competitive when evaluated under point estimation metrics other than NMSE. Among probabilistic methods, DeepVAR performed similarly to the LLM methods, and SNARX had the most difficulties. These (semi) parametric methods suffered from model mismatch, and were sensitive to sudden changes, Thus, shifted peaks and valleys were often witnessed in their predictions.

Same observation can also be made through Fig. 6. WIAE-GPF has the most stable prediction of interregional LMP spreads, which is corroborated by its smallest NMSE and NMAE. Pyraformer also follows the trend of LMP spreads accurately but with higher variance. The two AR-based parametric models, SNARX and DeepVAR exhibited the tendency to predict shifted spikes and failures to catch the rapid and dramatic change of LMP spread.

4.5 Area Control Error Forecasting for Reserve Market Participants

ACE is defined as the difference between actual and scheduled load-generation imbalance, adjusted by the area frequency deviation [44]. It is the control signal for frequency regulation, and its probabilistic forecasting is especially important for the operator to procure resources and market participants to bid in the regulation ancillary service market.

In this subsection, we present the simulation results of a 5-minute ahead forecasting of ACE. We utilized the ACE data from Jan 24th to 26th, collected by PJM. The ACE signal is measured every 15 seconds and can be quite volatile, as shown by the trajectory in Fig. 7.

Table 7: Estimation Results of ACE forecasting for PJM. The prediction step is 5-minute ahead.

Methods	NMSE	NMAE	MASE	sMAPE
SNARX [4]	(6) $1.1922$	(6) $1.1303$	(6) $0.7029$	(6) $0.4605$
WIAE-GPF	(1) $\mathbf{0.5957}$	(1) $\mathbf{0.7555}$	(1) $\mathbf{0.4698}$	(1) $\mathbf{0.1059}$
TLAE [21]	(5) $1.1727$	(5) $1.0605$	(5) $0.6595$	(3) $0.2782$
DeepVAR [11]	(7) $1.4431$	(7) $1.1750$	(7) $0.7307$	(5) $0.3952$
BWGVT [19]	(3) $0.9562$	(2) $0.9793$	(2) $0.6090$	(4) $0.3168$
Pyraformer [34]	(4) $0.9783$	(4) $0.9948$	(4) $0.6186$	(7) $0.4986$
Informer [33]	(2) $0.6006$	(3) $0.9819$	(3) $0.6106$	(2) $0.2247$

Table 8: Probabilistic Estimation Results of ACE forecasting for PJM. The prediction step is 5-minute ahead.

Methods	CRPS	CPE (90%) [NCW]	CPE (50%) [NCW]	CPE (10%) [NCW]
SNARX [4]	(5) $2.1007$	(5) $-0.8260$ $[1.6759]$	(5) $-0.4797$ $[1.6803]$	(2) $0.0343$ $[5.5768]$
WIAE-GPF	(1) $\mathbf{0.0081}$	(1) $\mathbf{-0.0016}$ $[0.9199]$	(1) $\mathbf{0.0321}$ $[0.9336]$	(1) $\mathbf{-0.0132}$ $[0.8885]$
TLAE [21]	(4) $1.5541$	(4) $-0.7857$ $[0.0004]$	(4) $-0.4489$ $[0.0005]$	(4) $-0.0957$ $[0.0027]$
DeepVAR [11]	(3) $1.2947$	(3) $-0.3526$ $[0.5665]$	(3) $-0.2560$ $[0.5296]$	(3) $-0.0521$ $[0.5434]$
BWGVT [19]	(2) $1.2488$	(2) $0.0065$ $[1.8309]$	(2) $0.0754$ $[2.0385]$	(5) $0.0996$ $[2.4261]$

WIAE-GPF achieved better performance than other methods, with CRPS less than $0.01$ and sMAPE less than $11\%$ , as shown by the WIAE-GPF row. We credited the strong performance of WIAE-GPF to the simplicity of its latent process, and its Bayesian sufficiency. BWGVT ranked second among all methods since the ACE data has few outliers. But its CRPS at $1.2488$ is dramatically larger than that of WIAE-GPF. Its CPE and NCW for 10% confidence interval prediction also showed that it cannot accurately predict a narrow interval. Pyraformer and Informer, trained with NMSE, had better performance under NMSE but worse under NMAE. With NMSE over $110\%$ , TLAE had the worst performance among GPF methods. DeepVAR and SNARX performed worse than the other forecasting methods, with NMSE and NMAE larger than $110\%$ , possibly due to model mismatch.

4.6 Discussion: On using LLM

Table 9: Statistics that models long-range dependency of time series.

Metrics	Real Time	Interchange Spread	ACE
Hurst Exponent	$0.5257$	$0.5301$	$0.5351$
DFA	$0.6053$	$0.6614$	$0.8609$

The success of LLM-based prediction in natural processing ignited broad interest in adopting LLM models in various applications, including electricity price forecasting with BWGVT, Pyraformer, and Informer. Our experiments showed that the innovation-based method (WIAE) performed uniformly better than the three LLM techniques, except for the real-time prediction at LONGIL-NYC under NMSE, where Pyraformer was the best among all forecasters. Note that the innovation representation used in WIAE can model but not explicitly long-range dependencies of the random process. WIAE does not include attention modeling.

As LLM-based forecasting techniques, Pyraformer, Informer and BWGVT performed better than the more conventional SNARX (see Table 3, 5, 7). Compared with the more straightforward deep learning method of DeepVAR, LLM-based forecasting did not show clear advantages. The same can be said when they were compared with TLAE. Authors of [20] pointed out that the simple convolutional neural network outperformed RNNs and LLMs on imbalance price forecasting, for the forecasted time series are not a good fit to the complicated deep learning models.

To understand if long-range dependencies matter in the probabilistic forecasting of electricity market signals, we examined the characteristics of LMP signals using the Hurst exponent and Detrended Fluctuation Analysis (DFA) as indicators for the long-range dependencies of LMP; both parameters had the range [0,1], with deviation from 0.5 indicating symptoms of long-range dependencies.

Table 9 shows the estimated Hurst exponent and DFA. The Hurst Exponent and DFA slope displayed a slight deviation from 0.5. The English and Korean literature [45, 46] are known to have long-range dependencies with the Hurst exponents ranging from 0.64 to 0.73. In comparison, the long-term effect of real-time electricity market signals is minimal.

While further studies are necessary, the use of LLM may not be suitable for electricity market signals where long-range dependencies are not evident. Indeed, real-time LMPs are computed either on an interval-by-interval basis or as part of short sliding window economic dispatch. Any temporal coupling is a result of temporal dependencies of demand and supplies, neither shown to have long-range dependencies. An unproven hypothesis is that the model complexity of LLM may offset any benefit it may bring to price forecasting.

5 Conclusion

This paper presents WIAE-GPF, a generative AI approach to probabilistic forecasting of nonparametric time series based on the innovation representation pioneered by Wiener, Kallianpur, and Rosenblatt six decades earlier. Three take-away conclusions stand out. One is that the innovation representation ensures that WIAE-GPF produces the correct conditional probability distributions under perfect learning condition assumptions. To our knowledge, WIAE-GPF is the first nonparametric GPF technique with such a theoretical guarantee. Second, WIAE-GPF demonstrated superior performance against major machine learning-based probabilistic forecasting techniques in our numerical studies using actual market data, including some of the advanced approaches involving transformer architecture, attention mechanism, and large language models. In addition, the local stationarity hypothesis assumed by the weak innovation representation appeared to hold well. Third, with Bayesian sufficiency established in this paper, Rosenblatt’s weak innovation representation of time series can be considered as a canonical representation and a powerful tool for stochastic decision making. Its applications in anomaly detection in power systems can be found in [47, 48, 49].

Finally, The black-box style of deep learning is often criticized for its lack of interpretability; some consider such methods as black magic that produces miraculous results. It is worth mentioning that WIAE-GPF has a highly intuitive and interpretable architecture parallel to that of the classic Kalman filtering. In particular, Kalman filtering extracts innovation representation as part of the measurement update, followed by the time-updated prediction according to the state-space model. WIAE-GPF extracts innovations by the weak innovation encoder and produces time-updated predictions through the weak innovation decoder. In this context, WIAE-GPF is a generalization of Kalman filtering to nonparametric and non-Gaussian settings.

Comments on the limitation and future work are in order. WIAE-GPF is derived based on the innovation representation of stationary processes. Extensions to certain classes of nonstationary processes would be the natural next step. Note that an innovation representation exists for nonstationary Gaussian processes with a time-varying state space model. Extension of WIAE-GPF to nonstationary time series under regime-switching models is also a natural next step, given the evidence of effective applications of regime-switching techniques in price forecasting. See [20, 19] and references therein.

Appendix A Proof of Theorem 2

Let $\left(\bar{{\bf V}}_{t}^{(m)}\right)$ and $\left(\bar{{\bm{X}}}_{t}^{(m)}\right)$ denote the latent process and the reconstruction sequence, under weights $\bar{\theta}_{m}$ and $\bar{\eta}_{m}$

	$\displaystyle\bar{{\bf V}}_{t}^{(m)}=G_{\bar{\theta}}^{(m)}({\bm{X}}_{t},{\bm{% X}}_{t-1},\cdots,{\bm{X}}_{t-m+1}),$
	$\displaystyle\bar{{\bm{X}}}_{t}^{(m)}=H_{\bar{\eta}}^{(m)}(\bar{{\bf V}}_{t},% \bar{{\bf V}}_{t-1},\cdots,\bar{{\bf V}}_{t-m+1}).$

We define the loss of a WIAE pair $(G_{\theta},H_{\eta})$ achieved under a $m$ -dimensional discriminator pairs as

L^{(m)}(\theta,\eta):=\max_{\gamma,\eta}\big{(}\mathbb{E}[D_{\gamma}^{(m)}% \left({\bf U}_{t:t-m+1}\right)]-\mathbb{E}[D_{\gamma}^{(m)}(\hat{{\bf V}}_{t:t% -n+1})]\\ +\lambda(\mathbb{E}[D_{\omega}^{(m)}({\bm{X}}_{t-n+2:t+T})]-\mathbb{E}[D_{% \omega}^{(m)}(({\bm{X}}_{t-n+2:t},\hat{{\bm{X}}}_{t+1:t+T}))])\big{)}.

We first show that $L^{(m)}(\theta_{m}^{*},\eta_{m}^{*})\rightarrow 0$ as $m\rightarrow\infty$ , where $(\theta_{m}^{*},\eta_{m}^{*})$ denotes the optimal weights of $(G_{\theta}^{(m)},H_{\eta}^{(m)})$ obtained by minimizing (3).

Following the line of [50], we defined the distance between two random processes $({\bm{X}}_{t})$ and $({\bf Y}_{t})$ by the expected $\ell_{\infty}$ norm:

d\left(({\bm{X}}_{t}),({\bf Y}_{t})\right):=\mathbb{E}\left[\sup_{t}\lvert{\bm% {X}}_{t}-{\bf Y}_{t}\rvert\right].

The uniform convergence assumed in assumption A2 is also defined on metric spaces with distance measure $d(\cdot,\cdot)$ . Hence, by assumption A2, $G_{\bar{\theta}}^{(m)}\rightarrow G$ uniformly, which implies that, $\forall\epsilon$ , there exists a $M_{1}$ such that $\forall m>M_{1}$ , $d\left((\bar{{\bf V}}_{t}^{(m)}),({\bf V}_{t})\right)<\epsilon.$ Thus, for $\forall F:\ell^{\infty}(T)\to\mathbb{R}$ , $F$ bounded and continuous,

d\left(F(({\bf V}_{t})),F((\bar{{\bf V}}_{t}^{(m)}))\right)<\delta(\epsilon).

In other words,

\lim_{m\rightarrow\infty}\mathbb{E}[F((\bar{{\bf V}}^{(m)}_{t}))]=\mathbb{E}[F% (({\bf V}_{t}))],

which fulfills the definition of weak convergence. Therefore,

\displaystyle\bar{{\bf V}}_{t:t-m+1}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny d% }}}{{\rightarrow}}{\bf V}_{t:t-m+1},

(10)

due to the fact that convergence in expectation implies convergence in distribution.

Similarly, by the uniform convergence of $H_{\bar{\eta}}^{(m)}$ to $H$ , we have that that $\forall m>M_{2}$ ,

\displaystyle d\left((\hat{{\bm{X}}}_{t}),(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))% )\right)<\epsilon,

where $(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))))$ represent the random sequence generated by passing $({\bf V}_{t})$ through $H_{\bar{\eta}}$ . Thus, for $\forall F:\ell^{\infty}(T)\to\mathbb{R}$ , $F$ bounded and continuous,

d\left(F((\hat{{\bm{X}}}_{t})),(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))))\right)<% \delta(\epsilon).

Hence we have $(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))))$ converges in distribution to $(\hat{{\bm{X}}}_{t})$ . Since $H$ is continuous and $H_{\bar{\eta}}^{(m)}$ converges uniformly to $H$ , $H_{\bar{\eta}}^{(m)}$ is also continuous. Thus, by continuous map** theorem,

\bar{{\bf V}}_{t-m+1:t}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{% \rightarrow}}{\bf V}_{t-m+1:t}\stackrel{{\scriptstyle}}{{\Rightarrow}}H_{\bar{% \eta}}^{(m)}(\bar{{\bf V}}_{t-m+1:t}^{(m)})\stackrel{{\scriptstyle\mbox{\tiny d% }}}{{\rightarrow}}H_{\bar{\eta}}^{(m)}\left({\bf V}_{t-m+1:t}\right),

that is, $\bar{{\bm{X}}}_{t}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny$d$}}}{{\rightarrow}% }H_{\bar{\eta}}^{(m)}({\bf V}_{t},\cdots,{\bf V}_{t-m+1})$ . Therefore,

\displaystyle({\bm{X}}_{t-m+2:t},\bar{{\bm{X}}}_{t+T}^{(m)})\stackrel{{% \scriptstyle\mbox{\tiny d}}}{{\rightarrow}}({\bm{X}}_{t-m+2:t},\hat{{\bm{X}}}_% {t+T})\stackrel{{\scriptstyle\mbox{\tiny d}}}{{=}}({\bm{X}}_{t-m+2:t},{\bm{X}}% _{t+T}).

(11)

By (10)&(11), $L^{(m)}(\bar{\theta}_{m},\bar{\eta}_{m})\rightarrow 0$ . Since $\theta^{*}_{m}$ and $\eta^{*}_{m}$ are the optimal parameters obtained by minimizing (3) evaluated by $m$ -dimensional discriminators $(D_{\omega}^{(m)},D_{\gamma}^{(m)})$ ,

\displaystyle L^{(m)}(\theta_{m}^{*},\eta_{m}^{*}):=\min_{\theta,\eta}L^{(m)}(% \theta,\eta)\leq L^{(m)}(\bar{\theta}_{m},\bar{\eta}_{m})\rightarrow 0.

Because $L^{(m)}(\theta_{m}^{*},\eta_{m}^{*})\rightarrow 0$ as $m\rightarrow\infty$ , $\bm{V}_{t:t-m+1}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{\rightarrow}}% \bm{V}_{t:t-m+1}^{(m)}$ and $({\bm{X}}_{t-m+2:t},\hat{\bm{X}}_{t+T}^{(m)})\stackrel{{\scriptstyle\mbox{% \tiny d}}}{{\rightarrow}}\bm{(}{\bm{X}}_{t-m+2:t},\bm{X}_{t+T}^{(m)})$ follow directly from the equivalence of convergence in Wasserstein distance and convergence in distribution [51]. Since the discriminator dimensionality also goes to $\infty$ , we have $({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+T}^{(m)})\stackrel{{\scriptstyle\mbox{\tiny d% }}}{{\rightarrow}}({\bm{X}}_{0:t},{\bm{X}}_{t+T})$ . Further, the conditional distribution of $\hat{{\bm{X}}}_{t+T}^{(m)}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}$ converges in distribution to ${\bm{X}}_{t+T}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}$ follows from a simple application of the Bayes rule. $\square$

Appendix B Definition of Metrics for Time Series Forecasting

Given the original time series $({\bm{x}}_{t})$ , the forecasts $(\tilde{{\bm{x}}}_{t})$ , $N$ the size of datasets, and $T$ the prediction step, the point estimation metrics can be calculated through:

	$\displaystyle\mbox{NMSE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}({\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T})^{2}}{\frac{1}{N-T}\sum_{t=T+1}^{N}{\bm{x}}_{t}^{2}},$
	$\displaystyle\mbox{NMAE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}\|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}\|}{\frac{1}{N-T}\sum_{t=1}^{N}\|{\bm{x}}_{t}\|},$
	$\displaystyle\mbox{MASE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}\|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}\|}{\frac{1}{N-T}\sum_{t=T+1}^{N}\|{\bm{x}}_{t}-{\bm{x}}_{% t-T}\|},$
	$\displaystyle\mbox{sMAPE}=\frac{1}{N-T}\sum_{t=1}^{N}\frac{\|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}\|}{(\|{\bm{x}}_{t}\|+\|\tilde{{\bm{x}}}_{t-T}\|)/2}.$

The purpose of adopting multiple metrics is to comprehensively evaluate the forecasting performance. NMSE and NMAE evaluate the overall performance, and MASE reflects the relative performance to the naive forecaster. Methods with MASE smaller than $1$ outperform the naive forecaster. sMAPE is the symmetric counterpart of mean absolute percentage error (MAPE) that can be both upper bounded and lower bounded. Since for electricity datasets, the actual values can be very close to $0$ , thus nullifies the effectiveness of MAPE, we regard sMAPE as the better metric.

For probabilistic methods, we further evaluates their CRPS. CRPS can be computed from

\displaystyle\mbox{CRPS}=\frac{1}{N-T}\sum_{t=T+1}^{N}\left(\int_{\mathbb{R}}% \left(\tilde{F}({\bm{x}}|{\bm{x}}_{1:t-T})-\mathbb{I}_{{\bm{x}}_{t}\leq{\bm{x}% }}\right)^{2}d{\bm{x}}\right),

where $\mathbb{I}$ is the indicator function and $\tilde{F}({\bm{x}}|{\bm{x}}_{0:t-T})$ the empirical cumulative density function (c.d.f.) of $\tilde{{\bm{X}}}_{t-T}$ conditioned on ${\bm{X}}_{0:t-T}={\bm{x}}_{0:t-T}$ predicted by probabilistic forecasting methods. CRPS is equivalent to comparing the empirical conditional c.d.f. forecasted by probabilistic methods with the indicator c.d.f. $\mathbb{I}_{\tilde{{\bm{x}}}_{t-T}>{\bm{x}}_{t}}$ of the true value ${\bm{x}}_{t}$ . It can be viewed as a generalization of MAE to probabilistic methods.

The coverage probability (CP) of an confidence interval predictor is the (estimated) probability that the ground truth falls within the predicted interval. For a $T$ -step prediction of $\beta\%$ -intervals, we denote the upper and lower bound by $\hat{U}_{t|t-T,\beta}$ and $\hat{L}_{t|t-T,\beta}$ . CP can be computed through

\displaystyle\mbox{CP}(\beta\%)=\frac{1}{N-T}\sum_{t=T+1}^{N}\mathbb{I}_{{\bm{% x}}_{t}\in[\hat{L}_{t|t-T,\beta},\hat{U}_{t|t-T,\beta}]}.

The closer the CP to its nominal value $\beta\%$ , the more accurate the prediction is. Thus, the coverage probability error (CPE) is often adopted for evaluation. CPE measures the deviation of CP from its nominal value $\beta\%$

\mbox{CPE}(\beta\%)=\mbox{CP}(\beta\%)-\beta\%.

The value of CPE closer to zero means the prediction interval estimation is more accurate.

Although CP and CPE are widely adopted for its simplicity, since they only estimate the unconditional coverage, they do not measure the accuracy the coverage based on the forecasted conditional probability distribution. Its limitation was discuss in [52].

In particular, while a good forecaster produces small CPE and a forecaster with high CPE must be a poor forecaster, a forecaster producing small CPE may not be a good forecaster. To this end, the normalized coverage width (NCW) can be used as a secondary measure. NCW is defined as

\displaystyle\mbox{NCW}(\beta\%)=\frac{1}{N-T}\sum_{t=T+1}^{N}\frac{\hat{U}_{t% |t-T,\beta}-\hat{L}_{t|t-T,\beta}}{\hat{U}_{\beta}-\hat{L}_{\beta}},

where $\hat{U}_{\beta}$ and $\hat{L}_{\beta}$ are the prediction interval estimated from the empirical quantile of the testing data. For instance, when predicting a $90\%$ interval, $\hat{U}_{90}$ is the empirical $0.95$ -quantile of the testing set, whereas $\hat{L}_{90}$ is the empirical $0.05$ -quantile of the testing set. As a result, NCW is the average width of intervals predicted normalized by the width of the interval estimated through the empirical marginal distribution of the testing set. One would expect that, conditional on observations, one would get a more concentrated prediction interval than the interval estimated based on unconditional distribution. Hence, a method with NCW smaller than $1$ estimates prediction interval more accurately than the unconditional estimation. At similar level of CP, the method with smaller NCW shows better accuracy in prediction interval estimation.

References

[1] A. Green, The great acceleration: Cio perspectives on generative ai, Tech. rep., MIT Technology Review Insights (2023).
URL https://www.databricks.com/sites/default/files/2023-07/ebook_mit-cio-generative-ai-report.pdf
[2] W. Härdle, H. Lütkepohl, R. Chen, A review of nonparametric time series analysis, International Statistical Review / Revue Internationale de Statistique 65 (1) (1997) 49–72, publisher: [Wiley, International Statistical Institute (ISI)]. doi:10.2307/1403432.
[3] J. Nowotarski, R. Weron, Computing electricity spot price prediction intervals using quantile regression and forecast averaging, Computational Statistics 30 (3) (2015) 791–803. doi:10.1007/s00180-014-0523-0.
[4] R. Weron, A. Misiorek, Forecasting spot electricity prices: A comparison of parametric and semiparametric time series models, International Journal of Forecasting 24 (4) (2008) 744–763. doi:https://doi.org/10.1016/j.ijforecast.2008.08.004.
[5] T. Hong, P. Pinson, Y. Wang, R. Weron, D. Yang, H. Zareipour, Energy forecasting: A review and outlook, IEEE Open Access Journal of Power and Energy 7 (2020) 376–388. doi:10.1109/OAJPE.2020.3029979.
[6] M. Zhou, Z. Yan, Y. X. Ni, G. Li, Y. Nie, Electricity price forecasting with confidence-interval estimation through an extended arima approach, IEE Proc.-Gener.Transmiss.Distrib 153 (2) (2006) 187–195.
[7] J. P. González, A. M. S. Muñoz San Roque, E. A. Pérez, Forecasting functional time series with a new hilbertian armax model: Application to electricity price forecasting, IEEE Transactions on Power Systems 33 (1) (2018) 545–556. doi:10.1109/TPWRS.2017.2700287.
[8] B. Uniejewski, R. Weron, Regularized quantile regression averaging for probabilistic electricity price forecasting, Energy Economics 95 (2021) 105121. doi:https://doi.org/10.1016/j.eneco.2021.105121.
[9] L. M. Lima, P. Damien, D. W. Bunn, Bayesian predictive distributions for imbalance prices with time-varying factor impacts, IEEE Transactions on Power Systems 38 (1) (2023) 349–357. doi:10.1109/TPWRS.2022.3165149.
[10] S. Chai, Z. Xu, Y. Jia, Conditional density forecast of electricity price based on ensemble elm and logistic emos, IEEE Transactions on Smart Grid 10 (3) (2019) 3031–3043. doi:10.1109/TSG.2018.2817284.
[11] D. Salinas, M. Bohlke-Schneider, L. Callot, R. Medico, J. Gasthaus, High-dimensional multivariate forecasting with low-rank gaussian copula processes, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019.
[12] G. Dudek, Multilayer perceptron for GEFCom2014 probabilistic electricity price forecasting, International Journal of Forecasting 32 (3) (2016) 1057–1060. doi:10.1016/j.ijforecast.2015.11.009.
[13] D. Lee, H. Shin, R. Baldick, Bivariate probabilistic wind power and real-time price forecasting and their applications to wind power bidding strategy development, IEEE Transactions on Power Systems 33 (6) (2018) 6087–6097. doi:10.1109/TPWRS.2018.2830785.
[14] J. Nowotarski, R. Weron, Recent advances in electricity price forecasting: A review of probabilistic forecasting, Renewable and Sustainable Energy Reviews 81 (2018) 1548–1568. doi:10.1016/j.rser.2017.05.234.
[15] D. J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, Fifth Edition, Chapman and Hall/CRC., 2011.
[16] B. Uniejewski, R. Weron, Regularized quantile regression averaging for probabilistic electricity price forecasting, Energy Economics 95 (2021) 105121. doi:10.1016/j.eneco.2021.105121.
URL https://www.sciencedirect.com/science/article/pii/S0140988321000268
[17] C. Zhang, Y. Fu, Probabilistic electricity price forecast with optimal prediction interval, IEEE Transactions on Power Systems (2023) 1–10doi:10.1109/TPWRS.2023.3235193.
[18] J.-F. Toubeau, T. Morstyn, J. Bottieau, K. Zheng, D. Apostolopoulou, Z. De Grève, Y. Wang, F. Vallée, Capturing spatio-temporal dependencies in the probabilistic forecasting of distribution locational marginal prices, IEEE Transactions on Smart Grid 12 (3) (2021) 2663–2674. doi:10.1109/TSG.2020.3047863.
[19] J. Bottieau, Y. Wang, Z. De Grève, F. Vallée, J.-F. Toubeau, Interpretable transformer model for capturing regime switching effects of real-time electricity prices, IEEE Transactions on Power Systems 38 (3) (2023) 2162–2176. doi:10.1109/TPWRS.2022.3195970.
[20] V. N. Ganesh, D. Bunn, Forecasting imbalance price densities with statistical methods and neural networks, IEEE Transactions on Energy Markets, Policy and Regulation 2 (1) (2024) 30–39. doi:10.1109/TEMPR.2023.3293693.
[21] N. Nguyen, B. Quanz, Temporal Latent Auto-Encoder: A Method for Probabilistic Multivariate Time Series Forecasting, Proceedings of the AAAI Conference on Artificial Intelligence 35 (10) (2021) 9117–9125, number: 10. doi:10.1609/aaai.v35i10.17101.
[22] Z. Zheng, L. Wang, L. Yang, Z. Zhang, Generative Probabilistic Wind Speed Forecasting: A Variational Recurrent Autoencoder Based Method, IEEE Transactions on Power Systems 37 (2) (2022) 1386–1398, conference Name: IEEE Transactions on Power Systems. doi:10.1109/TPWRS.2021.3105101.
[23] L. Li, J. Zhang, J. Yan, Y. **, Y. Zhang, Y. Duan, G. Tian, Synergetic Learning of Heterogeneous Temporal Sequences for Multi-Horizon Probabilistic Forecasting, Proceedings of the AAAI Conference on Artificial Intelligence 35 (10) (2021) 8420–8428, number: 10. doi:10.1609/aaai.v35i10.17023.
[24] M. Khodayar, S. Mohammadi, M. E. Khodayar, J. Wang, G. Liu, Convolutional Graph Autoencoder: A Generative Deep Neural Network for Probabilistic Spatio-Temporal Solar Irradiance Forecasting, IEEE Transactions on Sustainable Energy 11 (2) (2020) 571–583, conference Name: IEEE Transactions on Sustainable Energy. doi:10.1109/TSTE.2019.2897688.
[25] K. Rasul, A.-S. Sheikh, I. Schuster, U. M. Bergmann, R. Vollgraf, Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows, 2022.
[26] Y. Li, X. Lu, Y. Wang, D. Dou, Generative time series forecasting with diffusion, denoise, and disentanglement, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, Vol. 35, Curran Associates, Inc., 2022, pp. 23009–23022.
[27] A. Koochali, P. Schichtel, A. Dengel, S. Ahmed, Probabilistic Forecasting of Sensory Data With Generative Adversarial Networks – ForGAN, IEEE Access 7 (2019) 63868–63880, conference Name: IEEE Access. doi:10.1109/ACCESS.2019.2915544.
[28] K. Yeo, Z. Li, W. Gifford, Generative Adversarial Network for Probabilistic Forecast of Random Dynamical Systems, SIAM Journal on Scientific Computing 44 (4) (2022) A2150–A2175, publisher: Society for Industrial and Applied Mathematics. doi:10.1137/21M1457448.
[29] Z. Zhang, M. Wu, Predicting real-time locational marginal prices: A gan-based approach, IEEE Transactions on Power Systems 37 (2) (2022) 1286–1296. doi:10.1109/TPWRS.2021.3106263.
[30] Y. Li, Y. Ding, Y. Liu, T. Yang, P. Wang, J. Wang, W. Yao, Dense skip attention based deep learning for day-ahead electricity price forecasting, IEEE Transactions on Power Systems 38 (5) (2023) 4308–4327. doi:10.1109/TPWRS.2022.3217579.
[31] H. Xu, F. Hu, X. Liang, M. A. Gunmi, Attention mechanism multi-size depthwise convolutional long short-term memory neural network for forecasting real-time electricity prices, IEEE Transactions on Power Systems (2024) 1–12doi:10.1109/TPWRS.2024.3353759.
[32] S. Majumder, L. Dong, F. Doudi, Y. Cai, C. Tian, D. Kalathi, K. Ding, A. A. Thatte, N. Li, L. Xie, Exploring the capabilities and limitations of large language models in the electric energy sector, arXiv:2403.09125 (2024). arXiv:2403.09125.
[33] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer: Beyond efficient transformer for long sequence time-series forecasting, Proceedings of the AAAI Conference on Artificial Intelligence 35 (12) (2021) 11106–11115. doi:10.1609/aaai.v35i12.17325.
[34] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, S. Dustdar, Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting, 2022.
[35] N. Wiener, Nonlinear Problems in Random Theory, Technology Press of Massachusetts Institute of Technology, Cambridge, MA, 1958.
[36] M. Rosenblatt, Stationary Processes as Shifts of Functions of Independent Random Variables, Journal of Mathematics and Mechanics 8 (5) (1959) 665–681.
[37] X. Wang, L. Tong, Q. Zhao, Generative probabilistic time series forecasting and applications in grid operations, to appear in the Proceedings of Conference on Information Sciences and Systems (2024).
URL https://arxiv.longhoe.net/abs/2402.13870
[38] X. Wang, L. Tong, Innovations Autoencoder and its Application in One-class Anomalous Sequence Detection, Journal of Machine Learning Research 23 (49) (2022) 1–27.
[39] M. Arjovsky, S. Chintala, L.Bottou, Wasserstein GAN, arXiv:1701.07875 (Jan. 2017).
[40] P. J. Bickel, K. A. Doksum, Mathematical statistics: basic ideas and selected topics. (2nd ed.), Vol. 1, Pearson Prentice Hall, Upper Saddle River, N.J., 2007.
[41] M. White, R. Pike, C. Brown, R. Coutu, B. Ewing, S. Johnson, C. Mendrala, White paper: Inter-regional interchange scheduling analysis and options, Tech. rep., ISO New England and New York ISO (January 2011).
[42] T. Gneiting, M. Katzfuss, Probabilistic forecasting, Annual Review of Statistics and Its Application 1 (1) (2014) 125–151. arXiv:https://doi.org/10.1146/annurev-statistics-062713-085831, doi:10.1146/annurev-statistics-062713-085831.
[43] E. Tómasson, M. R. Hesamzadeh, F. A. Wolak, Optimal offer-bid strategy of an energy storage portfolio: A linear quasi-relaxation approach, Applied Energy 260 (2020) 114251. doi:https://doi.org/10.1016/j.apenergy.2019.114251.
[44] NERC, Balancing and frequency control, Tech. rep., NERC Resource Subcommittee, Priceton,NJ (January 2011).
URL https://www.nerc.com/comm/OC/BAL0031_Supporting_Documents_2017_DL/NERC%20Balancing%20and%20Frequency%20Control%20040520111.pdf
[45] M. A. Montemurro, P. A. Pury, Long-range fractal correlations in literary corpora, Fractals 10 (4) (2002) 451–461.
[46] J. Bhan, S. Kim, J. Kim, Y. Kwon, S. il Yang, K. Lee, Long-range correlations in korean literary corpora, Chaos, Solitons & Fractals 29 (1) (2006) 69–81. doi:https://doi.org/10.1016/j.chaos.2005.08.214.
[47] X. Wang, L. Tong, Innovations autoencoder and its application in one-class anomalous sequence detection, J. Mach. Learn. Res. 23 (1) (Jan 2022).
[48] K. R. Mestav, X. Wang, L. Tong, A deep learning approach to anomaly sequence detection for high-resolution monitoring of power systems, IEEE Transactions on Power Systems 38 (1) (2023) 4–13. doi:10.1109/TPWRS.2022.3168529.
[49] L. Tong, X. Wang, Q. Zhao, Grid monitoring and protection with continuous point-on-wave measurements and generative ai, arXiv:2403.06942 (2024). arXiv:2403.06942.
URL https://arxiv.longhoe.net/abs/2403.06942
[50] J. Hoffmann-Jørgensen, Stochastic Processes on Polish Spaces, Aarhus Universitet, Matematisk Institut., Aarhus, Denmark, 1991.
[51] C. Villani, The Wasserstein distances, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 57–75. doi:10.1007/978-3-540-71050-9_6.
[52] P. F. Christoffersen, Evaluating Interval Forecasts, International Economic Review 39 (4) (1998) 841–862. doi:10.2307/2527341.

$\displaystyle\Pr[{\bm{X}}_{t+T}\leq{\bm{x}}\|{\bm{X}}_{0:t}$	$\displaystyle={\bm{x}}_{0:t}]=\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}\|{\bm{X}}_{0% :t}={\bm{x}}_{0:t}]$	(5)
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm% {x}}\|G_{\theta^{}}({\bm{X}}_{0:t})=G_{\theta^{}}({\bm{x}}_{0:t}))]$
	$\displaystyle=\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}\|{\bf V}_{0:t}=\hbox{% \boldmath$\nu$\unboldmath}_{0:t}],$

	$\displaystyle\mbox{NMSE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}({\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T})^{2}}{\frac{1}{N-T}\sum_{t=T+1}^{N}{\bm{x}}_{t}^{2}},$
	$\displaystyle\mbox{NMAE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}\|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}\|}{\frac{1}{N-T}\sum_{t=1}^{N}\|{\bm{x}}_{t}\|},$
	$\displaystyle\mbox{MASE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}\|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}\|}{\frac{1}{N-T}\sum_{t=T+1}^{N}\|{\bm{x}}_{t}-{\bm{x}}_{% t-T}\|},$
	$\displaystyle\mbox{sMAPE}=\frac{1}{N-T}\sum_{t=1}^{N}\frac{\|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}\|}{(\|{\bm{x}}_{t}\|+\|\tilde{{\bm{x}}}_{t-T}\|)/2}.$

Forecasting Electricity Market Signals via Generative AI***This work was supported in part by the National Science Foundation under Award EECS 2218110.