Forecasting Electricity Market Signals
via Generative AI***This work was supported in part by the National Science Foundation under Award EECS 2218110.

Abstract

This paper presents a generative artificial intelligence approach to probabilistic forecasting of electricity market signals, such as real-time locational marginal prices and area control error signals. Inspired by the Wiener-Kallianpur innovation representation of nonparametric time series, we propose a weak innovation autoencoder architecture and a novel deep learning algorithm that extracts the canonical independent and identically distributed innovation sequence of the time series, from which samples of future time series are generated. The validity of the proposed approach is established by proving that, under ideal training conditions, the generated samples have the same conditional probability distribution as that of the ground truth. Three applications involving highly dynamic and volatile time series in real-time market operations are considered: (i) locational marginal price forecasting for self-scheduled resources such as battery storage participants, (ii) interregional price spread forecasting for virtual bidders in interchange markets, and (iii) area control error forecasting for frequency regulations. Numerical studies based on market data from multiple independent system operators demonstrate the superior performance of the proposed generative forecaster over leading classical and modern machine learning techniques under both probabilistic and point forecasting metrics.

keywords:
Probabilistic forecasting, electricity price forecasting, representation learning, generative artificial intelligence, energy and ancillary market forecasting, and risk-sensitive market operations.
journal: International Journal of Forecasting
\affiliation

[label1]organization=School of Electrical and Computer Engineering,addressline=Cornell University, city=Ithaca, state=New York, postcode=14853, country=USA

1 Introduction

A defining feature of generative artificial intelligence (AI) is its ability to produce artificial samples that resemble reality. In particular, generative AI learns the underlying structure of a phenomenon from examples, from which an arbitrarily large number of artificial samples having the same properties exhibited in the examples are produced. Generative AI with sophisticated neural network structure and machine learning methods have achieved remarkable performance in real-world applications unmatched by conventional techniques [1].

The classical subject of forecasting has a natural connection with generative AI through probabilistic forecasting. Given past observations, probabilistic forecasting aims to obtain the conditional probability distribution of future time series given past observations. Once such a distribution is obtained, Monte-Carlo samples of the future time series can be generated. However, forecasting conditional probability distribution faces daunting computation and sample complexities. First, nonparametric distribution forecasting is an infinite-dimensional functional estimation problem, typically requiring a finite-dimensional reduction through a histogram with a finite number of bins or quantiles with finite levels. Second, for continuously distributed random time series, there is only one realization of the future associated with the particular history on which conditional probability is defined, which makes learning the conditional distribution from data fundamentally difficult.

This paper develops a generative probabilistic forecasting (GPF) solution based on the generative AI principle that directly generates future time series samples based on current and past observations, bypassing the modeling, computational, and sample complexity challenges of forecasting the conditional probability distribution. Currently, such nonparametric GPF techniques do not exist. At the theoretical level, it is unknown what the GPF implementation architecture would be to guarantee the generation of samples with the correct conditional distribution.

1.1 Literature Review

The literature on parametric and nonparametric probabilistic forecasting of electricity market signals is vast. Our review here will focus on short-term (real-time) forecasting techniques for wholesale electricity prices and dispatch quantities such as area imbalances. To this end, we rely on earlier surveys [2, 3, 4] and pay more attention to contemporary machine learning-based techniques over the last decade or so.

Because wholesale electricity prices and dispatch imbalances are endogenously determined by optimization-based market clearing, they tend to be highly volatile due to their sensitivity to binding constraints. Overall, real-time prices and dispatch quantities behave quite differently from exogenous (physical) processes such as wind, solar, and demand time series. For this reason, our review excludes the extensive literature on energy forecasting, even though many techniques also apply to price forecasting. See [5] and references therein for energy forecasting.

Mainstream probabilistic forecasting techniques are from parametric or nonparametric families. Parametric forecasting predicts a parameterized conditional probability distribution of future time series variables, reducing an infinite-dimensional inference problem to one of finite-dimensional estimation. Popular approaches include autoregressive and moving average models [6, 7, 8, 9, 10, 11], Gaussian models [12], Student’s t𝑡titalic_t-distribution [13], and others [14]. A classic benchmark with strong performance is SNARX [4], which we include in our comparison. While restricting probability distributions to be parametric results in tractable computations, it comes with the loss of accuracy from model mismatch.

Nonparametric forecasting has a long history. See [2] for a review up to 1997. These techniques estimate the underlying probability distribution or its derived properties (e.g., quantile) without assuming a parametric form. Classical techniques [15] face formidable sample and computational complexities, further magnified when the underlying time series has arbitrary temporal dependencies. Quantile regression is one of the most popular techniques for forecasting electricity prices. By estimating multiple quantiles, one obtains a histogram approximation of the underlying probability density function. One well-recognized example using quantile regression for day-ahead LMP foreasting is [16]. Over the last decade, modern machine learning techniques are widely adopted to compute the conditional quantiles. Examples include Extreme Learning Machine (ELM) [17], Recurrent Neural Network (RNN) [18], and attention mechanism [19]. See [20] for a comparison study among multiple quantile regression methods. These techniques, however, are limited by the resolution of the quantile and cannot produce real-valued generative samples corresponding to the true distribution.

Most machine learning-based GPF methods involve a deep neural network (possibly with sophisticated LSTM or transformer architectures) that maps past observations and a set of exogenous random variables to samples of future time series. Distinguishing these techniques are methods used to train the neural network and ways of generating exogenous random variables. Examples include variational autoencoder (VAE) based methods [21, 22, 23, 24] and the more recent normalizing flow and denoising diffusion techniques [25, 26]. Generative adversarial network (GAN) learning has also been used [27, 28]. Missing in these methods is some form of theoretical guarantee that the generative samples should follow the desired conditional probability distribution.

Unlike point forecasting problems where the ground-truth samples are available, evaluating GPF is particularly challenging because of the lack of ground-truth distribution. Aside from standard evaluation criteria (see B), one way is to evaluate GPF by the performance of GPF-derived point forecasting techniques, since GPF can produce almost any type of point forecasts. The argument is that if a GPF method performs well, then all the GPF-derived point forecasting methods should perform well under point forecasting metrics. Thus, it is relevant to review some of the well-tested point forecasting techniques with which we compare the point forecasting techniques derived by the proposed GPF method.

One of the earliest GAN-based point forecasting techniques for real-time LMP is [29]. Very recently, the success of large language models (LLM) in natural language processing motivated its application in time series forecasting [30, 31], including the direct application of ChatGPT to price forecasting [32]. While there has been timely discussion of LLM’s capabilities and limitations in power system applications [32], there is a lack of understanding of the rationale of using highly complex LLM for LMP forecasting and the reasons for their improvement over the more conventional machine learning techniques. We include transformer-based point techniques as benchmarks for comparison in Sec. 4, given their demonstrated superior performance for natural languages and other time series. Informer [33] and Pyraformer [34] are point forecasting techniques that adopted the popular transformer model to capture the temporal dependencies. For their award-winning contributions, we include [33] and [34] in our comparative study in this paper.

1.2 Summary of Contributions

We propose Weak Innovation Autoencoder-based GPF (WIAE-GPF), a novel approach inspired by the classic Wiener-Kallianpur innovation representation of nonparametric time series [35] and a relaxation by Rosenblatt [36]. A key contribution of this work is to establish formally that the GPF architecture shown in Fig. 1 is “provable correct.” By provably correct, we mean that with optimally trained WIAE autoencoder (Gθ,Hη)subscript𝐺superscript𝜃subscript𝐻superscript𝜂(G_{\theta^{*}},H_{\eta^{*}})( italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), the WIAE-GPF output 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at time t𝑡titalic_t follows the probability distribution of the future variable 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT given (𝑿0=𝒙0,,𝑿t=𝒙t)formulae-sequencesubscript𝑿0subscript𝒙0subscript𝑿𝑡subscript𝒙𝑡({\bm{X}}_{0}={\bm{x}}_{0},\cdots,{\bm{X}}_{t}={\bm{x}}_{t})( bold_italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )—the observed time series up to time t𝑡titalic_t.

Refer to caption
Figure 1: Forecasting pipeline for WIAE-GPF.

Note that the generator output 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a function of observed time series 𝒙0:tsubscript𝒙:0𝑡{\bm{x}}_{0:t}bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT and independently generated exogenous random vector 𝒱~t=(𝐕1,,𝐕T)subscript~𝒱𝑡subscript𝐕1subscript𝐕𝑇\tilde{\mathcal{V}}_{t}=({\bf V}_{1},\cdots,{\bf V}_{T})over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( bold_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_V start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) with independent and identical (IID) uniformly distributed components, making 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT a function of the random vector 𝒱~tsubscript~𝒱𝑡\tilde{\mathcal{V}}_{t}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. By generating 𝒱~tsubscript~𝒱𝑡\tilde{{\cal V}}_{t}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from T𝑇Titalic_T IID samples of uniform distribution, WIAE-GPF produces realizations of 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT following the same conditional distribution as 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT. The formal definition of WIAE and its learning algorithm are presented in Sec. 2. The WIAE-GPF architecture and its validity are shown in Sec. 3.

Because practical implementations of WIAE are necessarily finite-dimensional, we establish a structural convergence property that the conditional distribution of the WIAE-GPF output converges to that of the conditional distribution of the time series. See Sec. 3.2 for details.

There have been some but limited applications of generative AI techniques in power system operations despite their accelerated advances in representing and learning time series models. Missing in particular are the validation and comparative studies using real-world market data. We fill this gap by comparing the WIAE-GPF with leading traditional and machine-learning algorithms for three applications: real-time LMP forecasting for energy markets, interregional LMP spread forecasting for interchange markets, and area control error (ACE) forecasting for regulation markets. Such comparisons are essential because these real-time market signals have characteristics not present in media signals such as video and natural language time series, where machine learning techniques have demonstrated success. Both LMP and ACE are highly dynamic time series with frequent spikes. Our comparison study offers a compelling case for WIAE-GPF across multiple performance measures for point and probabilistic forecasters.

The idea of WIAE-GPF was first presented in a preliminary version of this work [37], from which the current paper makes substantial new contributionsBased on Turnitin comparison, this paper exhibits less than 15% percent overall similarity and less than 4% similarity to the preliminary version.. In particular, the Bayesian sufficiency theorem in Sec. 3.1 is significantly stronger than that in [37]. Also new are a theorem (Theorem 1 in Sec. 3.1) that establishes formally the validity of WIAE-GPF and a theorem on the structural convergence (Theorem 2 in Sec. 3.2). We considered three specific real-time market applications, two of the three were not considered in [37]. The numerical results for all three applications as well as the analysis and discussions are all new.

1.3 Organization and Notations

This paper is organized as follows. Sec. 2 defines a nonparametric time series model, its innovation representations, and the learning algorithm of WIAE. Sec. 3 develops WIAE-GPF, the proposed GPF algorithm. Sec. 4 presents the application of WIAE-GPF in three market operations and the comparison studies of major forecasting benchmarks.

The notations used in this paper are standard. Random variables are in capital letters and their realizations in lowercases. Boldface letters are typically used for vectors and matrices. We use (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to denote a multivariate random time series, where column vector 𝑿t=(X1t,,Xdt)subscript𝑿𝑡subscript𝑋1𝑡subscript𝑋𝑑𝑡{\bm{X}}_{t}=(X_{1t},\cdots,X_{dt})bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT 1 italic_t end_POSTSUBSCRIPT , ⋯ , italic_X start_POSTSUBSCRIPT italic_d italic_t end_POSTSUBSCRIPT ) is the time series at time t𝑡titalic_t, and (Xit)subscript𝑋𝑖𝑡(X_{it})( italic_X start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) the i𝑖iitalic_ith sub-time series of (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In this paper, 𝑿t1:t2subscript𝑿:subscript𝑡1subscript𝑡2{\bm{X}}_{t_{1}:t_{2}}bold_italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the segment of (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) from t1subscript𝑡1t_{1}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to t2subscript𝑡2t_{2}italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, i.e., 𝐗t1:t2=(𝐗t1,,𝐗t2)subscript𝐗:subscript𝑡1subscript𝑡2subscript𝐗subscript𝑡1subscript𝐗subscript𝑡2{\bm{X}}_{t_{1}:t_{2}}=({\bm{X}}_{t_{1}},\cdots,{\bm{X}}_{t_{2}})bold_italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ( bold_italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , bold_italic_X start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ). For two random vectors 𝑿𝑿{\bm{X}}bold_italic_X and 𝐘𝐘{\bf Y}bold_Y, 𝑿=a.s.𝐘superscripta.s.𝑿𝐘{\bm{X}}\stackrel{{\scriptstyle\mbox{\tiny a.s.}}}{{=}}{\bf Y}bold_italic_X start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG a.s. end_ARG end_RELOP bold_Y means the two random variables equal almost surely, and 𝑿=d𝐘superscriptd𝑿𝐘{\bm{X}}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{=}}{\bf Y}bold_italic_X start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG d end_ARG end_RELOP bold_Y means the two equal in distribution. An IID random sequence with marginal distribution cumulative distribution F𝐹Fitalic_F is denoted by 𝑿tIIDFsuperscriptsimilar-toIIDsubscript𝑿𝑡𝐹{\bm{X}}_{t}\stackrel{{\scriptstyle\mbox{\tiny\sf IID}}}{{\sim}}Fbold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG IID end_ARG end_RELOP italic_F. Table 1 shows the major designated symbols used in the paper.

Table 1: Abbreviations and mathematical notations used in this paper.
GPF Generative Probabilitic Forecasting.
WIAE Weak Innovation AutoEncoder.
ACE Area Control Error.
LLM Large Language Model.
NMSE Normalized Mean Square Error.
NMAE Normalized Mean Absolut Error.
MASE Mean Absolute Scaled Error.
sMAPE Symmetric Mean Absolute Percentage Error.
CRPS Continuous Ranked Probability Score.
CP Coverage Probability.
CPE Coverage Probability Error.
NCW Normalized Coverage Width.
(𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) The random process of predictive interests.
(𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) The innovation sequence.
(𝐔t)subscript𝐔𝑡({\bf U}_{t})( bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) An IID sequence of uniform distribution.
(𝑿^t)subscript^𝑿𝑡(\hat{{\bm{X}}}_{t})( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) The rescontruction sequence output by WIAE decoder.
(𝐕^t(m))superscriptsubscript^𝐕𝑡𝑚\left(\hat{{\bf V}}_{t}^{(m)}\right)( over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) The weak innovation sequence estimated by a m𝑚mitalic_m-dimensional WIAE.
(𝑿^t(m))superscriptsubscript^𝑿𝑡𝑚\left(\hat{{\bm{X}}}_{t}^{(m)}\right)( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) The reconstruction sequence estimated by a m𝑚mitalic_m-dimensional WIAE.
(𝒙t)subscript𝒙𝑡({\bm{x}}_{t})( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) A sequence of real numbers indicating the past realizations of (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
(𝝂t)subscript𝝂𝑡(\hbox{\boldmath$\nu$\unboldmath}_{t})( bold_italic_ν start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) A sequence of real numbers indicating the past realizations of (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
G𝐺Gitalic_G WIAE encoder function.
H𝐻Hitalic_H WIAE decoder function.
Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT A neural network approximation of G𝐺Gitalic_G parameterized by θ𝜃\thetaitalic_θ.
Hηsubscript𝐻𝜂H_{\eta}italic_H start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT A neural network approximation of H𝐻Hitalic_H parameterized by η𝜂\etaitalic_η.
Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT Innovation discriminator that measures the distance between (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and (𝐔t)subscript𝐔𝑡({\bf U}_{t})( bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
Dωsubscript𝐷𝜔D_{\omega}italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT Reconstruction discriminator that measures the distance between (𝑿0:t+T)subscript𝑿:0𝑡𝑇({\bm{X}}_{0:t+T})( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT ) and (𝑿0:t,𝑿^t+1:t+T)subscript𝑿:0𝑡subscript^𝑿:𝑡1𝑡𝑇({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ).
𝒰[0,1]d𝒰superscript01𝑑{\cal U}[0,1]^{d}caligraphic_U [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT The continuous d𝑑ditalic_d-dimensional uniform distribution on [0,1]01[0,1][ 0 , 1 ].

2 Innovation Representation Learning

2.1 Strong and Weak Innovation Representations

In 1958, Wiener and Kallianpur proposed an innovation representation of scalar time series [35]. In the parlance of modern machine learning, an innovation representation is a causal autoencoder shown in Fig. 2 with the latent process (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) being an IID-uniform innovation sequence. In particular, 𝐕tsubscript𝐕𝑡{\bf V}_{t}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the new information (innovation) contained in 𝑿tsubscript𝑿𝑡{\bm{X}}_{t}bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT independent of the past 𝑿0:t1=(𝑿t1,𝑿t2,)subscript𝑿:0𝑡1subscript𝑿𝑡1subscript𝑿𝑡2{\bm{X}}_{0:t-1}=({\bm{X}}_{t-1},{\bm{X}}_{t-2},\cdots)bold_italic_X start_POSTSUBSCRIPT 0 : italic_t - 1 end_POSTSUBSCRIPT = ( bold_italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT , ⋯ ). Mathematically, the innovation representation of the time series is defined by causal map**s (G,H)𝐺𝐻(G,H)( italic_G , italic_H ) and (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ):

𝐕tsubscript𝐕𝑡\displaystyle{\bf V}_{t}bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =G(𝑿t,𝑿t1,),absent𝐺subscript𝑿𝑡subscript𝑿𝑡1\displaystyle=G({\bm{X}}_{t},{\bm{X}}_{t-1},\cdots),= italic_G ( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ⋯ ) , (\theparentequation.1)
(𝐕t)IID𝒰[0,1]d,superscriptsimilar-toIIDsubscript𝐕𝑡𝒰superscript01𝑑\displaystyle({\bf V}_{t})\stackrel{{\scriptstyle\mbox{\sf\tiny IID}}}{{\sim}}% {\cal U}[0,1]^{d},( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG IID end_ARG end_RELOP caligraphic_U [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , (\theparentequation.2)
𝑿^tsubscript^𝑿𝑡\displaystyle\hat{{\bm{X}}}_{t}over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =H(𝐕t,𝐕t1,),absent𝐻subscript𝐕𝑡subscript𝐕𝑡1\displaystyle=H({\bf V}_{t},{\bf V}_{t-1},\cdots),= italic_H ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ⋯ ) , (\theparentequation.3)
Refer to caption
Figure 2: An autoencoder interpretation of innovation representations.

The Wiener-Kallianpur’s innovation autoencoder requires further that the decoder output (𝑿^t)subscript^𝑿𝑡(\hat{{\bm{X}}}_{t})( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) reconstructs the input (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (almost surely), i.e., (𝐗t^)=a.s.(𝐗t)superscripta.s.^subscript𝐗𝑡subscript𝐗𝑡(\hat{{\bm{X}}_{t}})\stackrel{{\scriptstyle\mbox{\tiny a.s.}}}{{=}}({\bm{X}}_{% t})( over^ start_ARG bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG a.s. end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), which makes Wiener-Kallianpur’s autoencoder a strong innovation Autoencoder. The perfect causal reconstruction implies that the innovation sequence (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a sufficient statistic for all decision-making based on (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Therefore, using the IID-uniform (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for decision-making incurs no performance loss.

However, Rosenblatt showed that the Wiener-Kallianpur (strong) innovation representation does not exist for broad classes of random processes, including some of the widely used finite-state Markov chains [36]. Rosenblatt suggested a weak innovation representation, requiring that the autoencoder output (𝑿^t)subscript^𝑿𝑡(\hat{{\bm{X}}}_{t})( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) matches its input (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) only in distribution:

(𝑿0:t,𝑿^t+1:t+T)=d(𝑿0:t+T),t.superscriptdsubscript𝑿:0𝑡subscript^𝑿:𝑡1𝑡𝑇subscript𝑿:0𝑡𝑇for-all𝑡\displaystyle({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})\stackrel{{\scriptstyle% \mbox{\tiny d}}}{{=}}({\bm{X}}_{0:t+T}),\forall t.( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG d end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT ) , ∀ italic_t . (2)

Herein, we call the autoencoder (G,H)𝐺𝐻(G,H)( italic_G , italic_H ) for the weak innovation representation the Weak Innovation Auto Encoder (WIAE).

2.2 Innovation Representation Learning

Beyond the Gaussian and additive Gaussian models, there is no known algorithm to obtain WIAE, especially when the underlying time series is nonparametric with an unknown probability structure. In [38], the authors proposed a GAN-based learning of the strong innovation representation by jointly minimizing the Wasserstein distance of the latent process from the uniform IID process and the mean squared error (l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance) of the autoencoder output as the estimate of the input. However, strong innovation representation applies only to a restricted class of time series, and the joint optimization of autoencoder with mixed Wasserstein and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT distance measures can be challenging. Finally, learning a scalar innovation representation limits the ability to incorporate multiple time series observations. The WIAE learning proposed below overcomes these shortcomings.

2.3 WIAE Learning

We present a deep learning approach to learn a WIAE for the weak innovation representation defined in (2). Shown in Fig. 3 is the schematic that highlights key components of the WIAE learning.

Refer to caption
Figure 3: Training schematics of WIAE. Dash lines indicate the flow of training information.

The encoder Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and decoder Hηsubscript𝐻𝜂H_{\eta}italic_H start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT are causal convolutional neural networks parameterized by coefficients θ𝜃\thetaitalic_θ and η𝜂\etaitalic_η, respectively. The weak innovation representation, at its core, matches the input-output distributions and constrains the latent process (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to be IID-uniform. To this end, we introduce two neural network discriminators, the innovation discriminator Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT and the reconstruction discriminator Dωsubscript𝐷𝜔D_{\omega}italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT with parameters γ𝛾\gammaitalic_γ and ω𝜔\omegaitalic_ω respectively, to enforce (\theparentequation.2) and (2). In particular, the innovation discriminator Dγsubscript𝐷𝛾D_{\gamma}italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT compares the distributions of (𝐕^t)subscript^𝐕𝑡(\hat{{\bf V}}_{t})( over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and the reconstruction discriminator Dωsubscript𝐷𝜔D_{\omega}italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT the compares joint distributions of 𝑿0:t+Tsubscript𝑿:0𝑡𝑇{\bm{X}}_{0:t+T}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT and (𝑿0:t,𝑿^t+1:t+T)subscript𝑿:0𝑡subscript^𝑿:𝑡1𝑡𝑇({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) with sufficiently large T𝑇Titalic_T. These discriminators produce error signals to update neural network parameters (θ,η,γ,ω)𝜃𝜂𝛾𝜔(\theta,\eta,\gamma,\omega)( italic_θ , italic_η , italic_γ , italic_ω ). In this work, we adopt the Wasserstein discriminator proposed in [39] to compute the Wasserstein distance between distributions.

Because the two discriminators are both based on the Wasserstein-distance measure, their parameters (θ,η,ω,γ)𝜃𝜂𝜔𝛾(\theta,\eta,\omega,\gamma)( italic_θ , italic_η , italic_ω , italic_γ ) can be obtained via a single optimization:

L:=minθ,ηmaxγ,ω(𝔼[Dγ((𝐔t))]𝔼[Dγ((𝐕^t))]+λ(𝔼[Dω(𝑿0:t+T)]𝔼[Dω((𝑿0:t,𝑿^t+1:t+T))])),assign𝐿subscript𝜃𝜂subscript𝛾𝜔𝔼delimited-[]subscript𝐷𝛾subscript𝐔𝑡𝔼delimited-[]subscript𝐷𝛾subscript^𝐕𝑡𝜆𝔼delimited-[]subscript𝐷𝜔subscript𝑿:0𝑡𝑇𝔼delimited-[]subscript𝐷𝜔subscript𝑿:0𝑡subscript^𝑿:𝑡1𝑡𝑇L:=\min_{\theta,\eta}\max_{\gamma,\omega}\big{(}\mathbb{E}[D_{\gamma}\left(({% \bf U}_{t})\right)]-\mathbb{E}[D_{\gamma}((\hat{{\bf V}}_{t}))]\\ +\lambda(\mathbb{E}[D_{\omega}({\bm{X}}_{0:t+T})]-\mathbb{E}[D_{\omega}(({\bm{% X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T}))])\big{)},start_ROW start_CELL italic_L := roman_min start_POSTSUBSCRIPT italic_θ , italic_η end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_γ , italic_ω end_POSTSUBSCRIPT ( blackboard_E [ italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( ( bold_U start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] - blackboard_E [ italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( ( over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] end_CELL end_ROW start_ROW start_CELL + italic_λ ( blackboard_E [ italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ( ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) ) ] ) ) , end_CELL end_ROW (3)

where λ𝜆\lambdaitalic_λ is a real number that scales the two Wasserstein distances. The two parts of the inner maximization loop of loss function (3) regularize (Gθ,Hη)subscript𝐺𝜃subscript𝐻𝜂(G_{\theta},H_{\eta})( italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ) according to (\theparentequation.2) and (2). It’s evident that minimizing the inner loop with respect to θ𝜃\thetaitalic_θ and η𝜂\etaitalic_η is equivalent to enforcing (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) being IID uniform, and (𝑿0:t,𝑿^t+1:t+T)subscript𝑿:0𝑡subscript^𝑿:𝑡1𝑡𝑇({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+1:t+T})( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) having the same distribution as 𝑿0:t+Tsubscript𝑿:0𝑡𝑇{\bm{X}}_{0:t+T}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t + italic_T end_POSTSUBSCRIPT. The training of the four neural networks is standard. Here we used the off-the-shelf Adam optimizer.

In a practical implementation of WIAE, finite (input) dimensional neural networks are used. The training of a finite-dimensional WIAE via (3) must also be implemented by finite segments of the random processes involved. In Sec. 3.2, we consider the implications of such practical restrictions.

3 WIAE-GPF and its Properties

In this section, we introduce WIAE-GPF—a generative probabilistic forecasting techniques based on weak innovations representation. Specifically, given past observations 𝒙0:tsubscript𝒙:0𝑡{\bm{x}}_{0:t}bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT, WIAE-GPF produces (arbitrarily many) samples of 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that has the conditional distribution of 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT. We present next the structure of WIAE-GPF, the Bayesian sufficiency of WIAE, and a structure convergence when WIAE is implemented with finite-dimensional implementations.

3.1 Structure of WIAE-GPF

The structure of the proposed WIAE-GPF forecaster is shown in Fig. 1. At time t𝑡titalic_t, given the realization of 𝑿0:t=𝒙0:tsubscript𝑿:0𝑡subscript𝒙:0𝑡{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT and autoencoder (Gθ,Hη)subscript𝐺superscript𝜃subscript𝐻superscript𝜂(G_{\theta^{*}},H_{\eta^{*}})( italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) trained by (3), 𝒙0:tsubscript𝒙:0𝑡{\bm{x}}_{0:t}bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT up to time t𝑡titalic_t, the WIAE encoder Gθsubscript𝐺superscript𝜃G_{\theta^{*}}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT generates the innovation sequence 𝝂0:tsubscript𝝂:0𝑡\hbox{\boldmath$\nu$\unboldmath}_{0:t}bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT. The WIAE decoder Hηsubscript𝐻superscript𝜂H_{\eta^{*}}italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT maps 𝝂0:tsubscript𝝂:0𝑡\hbox{\boldmath$\nu$\unboldmath}_{0:t}bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT and independently generated IID-uniform pseudo innovations 𝒱~tIID𝒰[0,1]Tsuperscriptsimilar-toIIDsubscript~𝒱𝑡𝒰superscript01𝑇\tilde{{\cal V}}_{t}\stackrel{{\scriptstyle\mbox{\sf\tiny IID}}}{{\sim}}{\cal U% }[0,1]^{T}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG IID end_ARG end_RELOP caligraphic_U [ 0 , 1 ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to produce a sample 𝑿~t=𝒙~tsubscript~𝑿𝑡subscript~𝒙𝑡\tilde{{\bm{X}}}_{t}=\tilde{{\bm{x}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Note that when forecasting 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT, we do not have realizations for random samples of 𝑿t+1:t+Tsubscript𝑿:𝑡1𝑡𝑇{\bm{X}}_{t+1:t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT. The salient feature of WIAE-GPF is to replace samples from the unknown and arbitrarily distributed 𝑿t+1:t+Tsubscript𝑿:𝑡1𝑡𝑇{\bm{X}}_{t+1:t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT by realizations of pseudo innovations 𝒱~tsubscript~𝒱𝑡\tilde{\mathcal{V}}_{t}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT known to be IID-uniform. Thus, once the autoencoder is trained, generating random samples with the conditional distribution of 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT is trivial.

We now establish the validity of WIAE-GPF by showing that the WIAE-GPF output 𝑿~tsubscript~𝑿𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT has the same conditional distribution as 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT given 𝑿0:t=𝒙0:tsubscript𝑿:0𝑡subscript𝒙:0𝑡{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT. This is not obvious because the input of Hηsubscript𝐻superscript𝜂H_{\eta^{*}}italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT includes an exogenous random vector 𝒱~tsubscript~𝒱𝑡\tilde{{\cal V}}_{t}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the weak innovation (𝐕0:t)subscript𝐕:0𝑡({\bf V}_{0:t})( bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) that may not be a sufficient statistic.

We first show that the weak innovation sequence is Bayesian sufficientT(X)𝑇𝑋T(X)italic_T ( italic_X ) is Bayesian sufficient for the estimation of a random variable Y𝑌Yitalic_Y if the posterior distribution of Y𝑌Yitalic_Y given X𝑋Xitalic_X is the same as the one given T(X)𝑇𝑋T(X)italic_T ( italic_X ) [40]., which implies that any stochastic decision involving future time series 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT can be made without loss based on the innovations 𝐕0:tsubscript𝐕:0𝑡{\bf V}_{0:t}bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT. The same result was first presented in [37] under the more restrictive setting of Hηsubscript𝐻superscript𝜂H_{\eta^{*}}italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT being injective.

Lemma 1 (Bayesian Sufficiency of Multivariate Weak Innovations)

Let (𝐗t)subscript𝐗𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be a stationary time series for which the weak innovation representation exists. Let (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) be the weak innovation representation of (𝐗t)subscript𝐗𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Then, for all 𝐱𝐱{\bm{x}}bold_italic_x and 𝐗0:t=𝐱0:tsubscript𝐗:0𝑡subscript𝐱:0𝑡{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT,

Pr[𝑿t+T𝒙|𝑿0:t=𝒙0:t]=Pr[𝑿^t+T𝒙|𝐕0:t=𝝂0:t].Prsubscript𝑿𝑡𝑇conditional𝒙subscript𝑿:0𝑡subscript𝒙:0𝑡Prsubscript^𝑿𝑡𝑇conditional𝒙subscript𝐕:0𝑡subscript𝝂:0𝑡\displaystyle\Pr[{\bm{X}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}]=\Pr% \left[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bf V}_{0:t}=\hbox{\boldmath$\nu$% \unboldmath}_{0:t}\right].roman_Pr [ bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] = roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] . (4)

Proof: By the definition of weak innovation representation,

Pr[𝑿t+T𝒙|𝑿0:t\displaystyle\Pr[{\bm{X}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0:t}roman_Pr [ bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT =𝒙0:t]=Pr[𝑿^t+T𝒙|𝑿0:t=𝒙0:t]\displaystyle={\bm{x}}_{0:t}]=\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0% :t}={\bm{x}}_{0:t}]= bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] = roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] (5)
=(a)Pr[𝑿^t+T𝒙|Gθ(𝑿0:t)=Gθ(𝒙0:t))]\displaystyle\stackrel{{\scriptstyle(a)}}{{=}}\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm% {x}}|G_{\theta^{*}}({\bm{X}}_{0:t})=G_{\theta^{*}}({\bm{x}}_{0:t}))]start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG ( italic_a ) end_ARG end_RELOP roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) ) ]
=Pr[𝑿^t+T𝒙|𝐕0:t=𝝂0:t],absentPrsubscript^𝑿𝑡𝑇conditional𝒙subscript𝐕:0𝑡subscript𝝂:0𝑡\displaystyle=\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bf V}_{0:t}=\hbox{% \boldmath$\nu$\unboldmath}_{0:t}],= roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] ,

where (a)𝑎(a)( italic_a ) is from the Markovian structure of the autoencoder,i.e., 𝐗0:tGθ𝐕^0:tHη𝐗^0:tsuperscriptsubscript𝐺superscript𝜃subscript𝐗:0𝑡subscript^𝐕:0𝑡superscriptsubscript𝐻superscript𝜂subscript^𝐗:0𝑡{\bm{X}}_{0:t}\stackrel{{\scriptstyle G_{\theta^{*}}}}{{\rightarrow}}\hat{{\bf V% }}_{0:t}\stackrel{{\scriptstyle H_{\eta^{*}}}}{{\rightarrow}}\hat{{\bm{X}}}_{0% :t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG end_RELOP over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT. By definition of Bayesian statistics [40], 𝐕0:t=Gθ(𝑿0:t)subscript𝐕:0𝑡subscript𝐺superscript𝜃subscript𝑿:0𝑡{\bf V}_{0:t}=G_{\theta^{*}}({{\bm{X}}_{0:t}})bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) is a sufficient statistics for 𝑿t+Tsubscript𝑿𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT for all T>0𝑇0T>0italic_T > 0. \square

The validity of WIAE-GPF is shown next.

Theorem 1 (Validity of WIAE-GPF)

For all T>0𝑇0T>0italic_T > 0, the conditional distribution of the WIAE-GPF output 𝐗~tsubscript~𝐗𝑡\tilde{{\bm{X}}}_{t}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given 𝐗0:t=𝐱0:tsubscript𝐗:0𝑡subscript𝐱:0𝑡{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT is identical to that of 𝐗t+Tsubscript𝐗𝑡𝑇{\bm{X}}_{t+T}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT (given 𝐗0:t=𝐱0:tsubscript𝐗:0𝑡subscript𝐱:0𝑡{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT), i.e., ,

Pr[𝑿t+T𝒙|𝑿0:t=𝒙0:t]=Pr[𝑿~t𝒙|𝑿0:t=𝒙0:t].Prsubscript𝑿𝑡𝑇conditional𝒙subscript𝑿:0𝑡subscript𝒙:0𝑡Prsubscript~𝑿𝑡conditional𝒙subscript𝑿:0𝑡subscript𝒙:0𝑡\displaystyle\Pr[{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}]=% \Pr[\tilde{{\bm{X}}}_{t}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}].roman_Pr [ bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] = roman_Pr [ over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] . (6)

Proof: By Lemma 1,

Pr[𝑿t+T𝒙|𝑿0:t=𝒙0:t]=Pr[𝑿^t+T𝒙|𝐕0:t=𝝂0:t],Prsubscript𝑿𝑡𝑇conditional𝒙subscript𝑿:0𝑡subscript𝒙:0𝑡Prsubscript^𝑿𝑡𝑇conditional𝒙subscript𝐕:0𝑡subscript𝝂:0𝑡\displaystyle\Pr[{\bm{X}}_{t+T}\leq{\bm{x}}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}]=\Pr% [\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bf V}_{0:t}=\hbox{\boldmath$\nu$% \unboldmath}_{0:t}],roman_Pr [ bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] = roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] , (7)

where 𝐕0:t=Gθ(𝑿0:t)subscript𝐕:0𝑡subscript𝐺superscript𝜃subscript𝑿:0𝑡{\bf V}_{0:t}=G_{\theta^{*}}({\bm{X}}_{0:t})bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ) and 𝝂0:t=Gθ(𝒙0:t)subscript𝝂:0𝑡subscript𝐺superscript𝜃subscript𝒙:0𝑡\hbox{\boldmath$\nu$\unboldmath}_{0:t}=G_{\theta^{*}}({\bm{x}}_{0:t})bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ). Now consider

{𝑿^t+T=Gθ(𝐕0:t,𝐕t+1:t+T)𝑿~t=Gθ(𝐕0:t,𝒱~t),casessubscript^𝑿𝑡𝑇subscript𝐺superscript𝜃subscript𝐕:0𝑡subscript𝐕:𝑡1𝑡𝑇otherwisesubscript~𝑿𝑡subscript𝐺superscript𝜃subscript𝐕:0𝑡subscript~𝒱𝑡otherwise\begin{cases}\hat{{\bm{X}}}_{t+T}=G_{\theta^{*}}({\bf V}_{0:t},{\bf V}_{t+1:t+% T})\\ \tilde{{\bm{X}}}_{t}=G_{\theta^{*}}({\bf V}_{0:t},\tilde{{\cal V}}_{t}),\end{cases}{ start_ROW start_CELL over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL start_CELL end_CELL end_ROW

where, by definition, (𝐕0:t,𝐕t+1:t+T,𝒱~t)subscript𝐕:0𝑡subscript𝐕:𝑡1𝑡𝑇subscript~𝒱𝑡({\bf V}_{0:t},{\bf V}_{t+1:t+T},\tilde{{\cal V}}_{t})( bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , bold_V start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT , over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are jointly independent IID uniform sequences, and 𝒱~t=d𝐕t+1:t+Tsuperscriptdsubscript~𝒱𝑡subscript𝐕:𝑡1𝑡𝑇\tilde{{\cal V}}_{t}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{=}}{\bf V}_{t+1:t% +T}over~ start_ARG caligraphic_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG d end_ARG end_RELOP bold_V start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT. Therefore,

Pr[𝑿^t+T𝒙|𝐕0:t=𝝂0:t]=Pr[𝑿~t𝒙|𝐕0:t=𝝂0:t].Prsubscript^𝑿𝑡𝑇conditional𝒙subscript𝐕:0𝑡subscript𝝂:0𝑡Prsubscript~𝑿𝑡conditional𝒙subscript𝐕:0𝑡subscript𝝂:0𝑡\displaystyle\Pr[\hat{{\bm{X}}}_{t+T}\leq{\bm{x}}|{\bf V}_{0:t}=\hbox{% \boldmath$\nu$\unboldmath}_{0:t}]=\Pr[\tilde{{\bm{X}}}_{t}\leq{\bm{x}}|{\bf V}% _{0:t}=\hbox{\boldmath$\nu$\unboldmath}_{0:t}].roman_Pr [ over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ≤ bold_italic_x | bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] = roman_Pr [ over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ bold_italic_x | bold_V start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_ν start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT ] . (8)

Combining (7) and (8), we have (6). \square

3.2 Structural Convergence of WIAE-GPF

This section focuses on the practical issue of finite-dimensional implementations of WIAE and discriminators in Fig. 3. It is evident that no machine learning technique guarantees that a finite-dimensional implementation can extract weak innovations, even if the amount of historical samples available is unbounded. Here we present a structural convergence result to show that, under the ideal training conditions with unbounded training samples and global convergence of training, the innovations generated from a finite-dimensional WIAE converge in distribution to the true weak innovations.

The structural convergence analysis assumes that the training samples are unbounded, and the training algorithm converges to the global minimum. Let Gθ(m)superscriptsubscript𝐺superscript𝜃𝑚G_{\theta^{*}}^{(m)}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT be the optimally trained finite (input) dimensional CNN encoder that takes time-shifted m𝑚mitalic_m consecutive observations 𝑿tm+1:tsubscript𝑿:𝑡𝑚1𝑡{\bm{X}}_{t-m+1:t}bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 1 : italic_t end_POSTSUBSCRIPT and produces the latent process (𝐕^t(m))subscriptsuperscript^𝐕𝑚𝑡\left(\hat{{\bf V}}^{(m)}_{t}\right)( over^ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Likewise, let Hη(m)subscriptsuperscript𝐻𝑚superscript𝜂H^{(m)}_{\eta^{*}}italic_H start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT be the optimally trained m𝑚mitalic_m-dimensional CNN decoder that produces the WIAE output sequence (𝑿^t(m))subscriptsuperscript^𝑿𝑚𝑡\left(\hat{{\bm{X}}}^{(m)}_{t}\right)( over^ start_ARG bold_italic_X end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Similarly defined are the finite dimensional discriminators that take n𝑛nitalic_n consecutive inputs, denoted by (Dω(n),Dγ(n))superscriptsubscript𝐷𝜔𝑛superscriptsubscript𝐷𝛾𝑛(D_{\omega}^{(n)},D_{\gamma}^{(n)})( italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ). In this paper, we choose n=m𝑛𝑚n=mitalic_n = italic_m.

To analyze the asymptotic property of finite (input) dimensional WIAE-GPF, we make the following assumptions:

  1. A1

    Existence: The random process (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) has a weak innovation representation defined in (\theparentequation.1 - \theparentequation.3) & (2), and there exists a causal WIAE with continuous G𝐺Gitalic_G and H𝐻Hitalic_H.

  2. A2

    Feasibility: There exists a sequence of finite input dimension auto-encoder functions (Gθ¯(m),Hη¯(m))superscriptsubscript𝐺¯𝜃𝑚superscriptsubscript𝐻¯𝜂𝑚(G_{\bar{\theta}}^{(m)},H_{\bar{\eta}}^{(m)})( italic_G start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) that converges uniformly to (G,H)𝐺𝐻(G,H)( italic_G , italic_H ) under the mean-squared distance metric.

  3. A3

    Training: The training sample sizes are infinite. The training algorithm for all finite-dimensional WIAE using finite-dimensional training samples converges almost surely to the global optimum.

Theorem 2

Under (A1-A3),

(𝑿0:t,𝑿^t+T(m))d(𝑿0:t,𝑿t+T),tsuperscriptdsubscript𝑿:0𝑡superscriptsubscript^𝑿𝑡𝑇𝑚subscript𝑿:0𝑡subscript𝑿𝑡𝑇for-all𝑡\displaystyle({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+T}^{(m)})\stackrel{{% \scriptstyle\mbox{\tiny d}}}{{\rightarrow}}({\bm{X}}_{0:t},{\bm{X}}_{t+T}),~{}\forall t( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ) , ∀ italic_t (9)

as m𝑚mitalic_m goes to infinity.

Proof: See A.

3.3 From GPF to Point and Quantile Forecasting

GPF produces samples of the conditional probability distribution, from which point and quantile forecasts can be easily computed. Here we outline techniques to compute forecasts of several popular point and quantile forecasters. To this end, let {𝒙~t(k),k=1,,K}formulae-sequencesuperscriptsubscript~𝒙𝑡𝑘𝑘1𝐾\left\{\tilde{{\bm{x}}}_{t}^{(k)},k=1,\cdots,K\right\}{ over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_k = 1 , ⋯ , italic_K } be the set of GPF generated samples from the probability distribution of the time series at time t+T𝑡𝑇t+Titalic_t + italic_T conditioned on past observations up to time t𝑡titalic_t. For the simplicity of mathematical expressions, we assume that {𝒙~t(k)}superscriptsubscript~𝒙𝑡𝑘\left\{\tilde{{\bm{x}}}_{t}^{(k)}\right\}{ over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT } is sorted in the ascending order.

  • 1.

    Minimum Mean Squared Error (MMSE) Forecast: The MMSE forecast is the mean of the conditional distribution. The MMSE forecast 𝒙^tMMSEsubscriptsuperscript^𝒙MMSE𝑡\hat{{\bm{x}}}^{\mbox{\tiny MMSE}}_{t}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT MMSE end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by a GPF is given by the conditional sample mean

    𝒙^tMMSE=1Kk=1K𝒙~t(k).subscriptsuperscript^𝒙MMSE𝑡1𝐾superscriptsubscript𝑘1𝐾superscriptsubscript~𝒙𝑡𝑘\hat{{\bm{x}}}^{\mbox{\tiny MMSE}}_{t}=\frac{1}{K}\sum_{k=1}^{K}\tilde{{\bm{x}% }}_{t}^{(k)}.over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT MMSE end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT .
  • 2.

    Minimum Mean Absolute Error (MMAE) Forecast: The MMAE forecast is the median of the conditional distribution. The MMAE forecast 𝒙^tMMAEsubscriptsuperscript^𝒙MMAE𝑡\hat{{\bm{x}}}^{\mbox{\tiny MMAE}}_{t}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT MMAE end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by a GPF is given by the conditional sample median

    𝒙^tMMAE={𝒙~t((K+1)/2),if K is odd0.5(𝒙~t(K/2)+𝒙~t(K/2+1)),if K is even.subscriptsuperscript^𝒙MMAE𝑡casessuperscriptsubscript~𝒙𝑡𝐾12if K is odd0.5superscriptsubscript~𝒙𝑡𝐾2superscriptsubscript~𝒙𝑡𝐾21if K is even\hat{{\bm{x}}}^{\mbox{\tiny MMAE}}_{t}=\begin{cases}\tilde{{\bm{x}}}_{t}^{% \left((K+1)/2\right)},&\mbox{if $K$ is odd}\\ 0.5\left(\tilde{{\bm{x}}}_{t}^{\left(K/2\right)}+\tilde{{\bm{x}}}_{t}^{\left(K% /2+1\right)}\right),&\mbox{if $K$ is even}.\\ \end{cases}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT MMAE end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( ( italic_K + 1 ) / 2 ) end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_K is odd end_CELL end_ROW start_ROW start_CELL 0.5 ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K / 2 ) end_POSTSUPERSCRIPT + over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_K / 2 + 1 ) end_POSTSUPERSCRIPT ) , end_CELL start_CELL if italic_K is even . end_CELL end_ROW
  • 3.

    Quantile Forecast: The GPF forecast of q𝑞qitalic_q-quantile 𝒙^tq-quantilesubscriptsuperscript^𝒙q-quantile𝑡\hat{{\bm{x}}}^{\mbox{\tiny$q$-quantile}}_{t}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_q -quantile end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given by:

    𝒙^tq-quantile={𝒙~t(qK),if qK is an integer0.5(𝒙~t([qK])+𝒙~t([qK]+1)),otherwise,subscriptsuperscript^𝒙q-quantile𝑡casessuperscriptsubscript~𝒙𝑡𝑞𝐾if qK is an integer0.5superscriptsubscript~𝒙𝑡delimited-[]𝑞𝐾superscriptsubscript~𝒙𝑡delimited-[]𝑞𝐾1otherwise\hat{{\bm{x}}}^{\mbox{\tiny$q$-quantile}}_{t}=\begin{cases}\tilde{{\bm{x}}}_{t% }^{\left(qK\right)},&\mbox{if $qK$ is an integer}\\ 0.5\left(\tilde{{\bm{x}}}_{t}^{\left([qK]\right)}+\tilde{{\bm{x}}}_{t}^{\left(% [qK]+1\right)}\right),&\mbox{otherwise},\\ \end{cases}over^ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_q -quantile end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_q italic_K ) end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_q italic_K is an integer end_CELL end_ROW start_ROW start_CELL 0.5 ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( [ italic_q italic_K ] ) end_POSTSUPERSCRIPT + over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( [ italic_q italic_K ] + 1 ) end_POSTSUPERSCRIPT ) , end_CELL start_CELL otherwise , end_CELL end_ROW

    where [a]delimited-[]𝑎[a][ italic_a ] indicates the greatest integer not exceeding a𝑎aitalic_a.

4 WIAE-GPF for Market Operations

We now apply WIAE-GPF to forecasting market signals such as locational marginal prices and market imbalances. At the outset, we recognize that the underlying random processes are not known to be stationary, whereas WIAE-GPF is derived based on a representation of stationary processes. Here, we rely on the hypothesis that these processes are approximately stationary locally within the forecasting horizon. Our evaluations based on real market data presented here in some way validated this hypothesis. Brief discussions on the limitations and possible extensions can be found in Sec. 5.

We conducted extensive experiments to compare leading GPF and point forecasting techniques based on a suite of performance metrics. This section summarizes our findings for three market applications where GPF is particularly valuable to system operators and market participants: (a) LMP forecasting for the optimal bidding in NYISO’s 5-minute real-time energy market, (b) GPF for the interregional LMP spread for the Coordinated Transaction Scheduling (CTS) [41] market between NYISO and PJM, and (c) ACE forecasting for regulation services using PJM’s 15-second ACE data. Common to these applications is that the forecasted variables are endogenously determined by the market operations. In contrast to exogenous variables such as wind/solar generations and inelastic demands, the LMP and ACE values are the results of dispatch and commitment optimization, where binding constraints introduce spikes in dual variables from which LMPs are computed. They are highly dynamic as shown in Fig. 4.

Refer to caption
Figure 4: Real-time, day-ahead LMPs, and load at Long Island, and real-time LMP at NYC July 2023.

4.1 Baseline Methods in Comparison

Table 2: Comparison of the baselines.
Algorithm Forecasting Type Time Series Model Forecastor Output ML Models
SNARX [4] Probabilistic Semiparametric AR AR Model Parameters Kernel Estimation
WIAE-GPF Probabilistic Nonparametric Generative CNN + WIAE
TLAE [21] Probabilistic Parametric Generative RNN + VAE
DeepVAR [11] Probabilistic Parametric (AR Model) Model Parameters LSTM
LQRA [16] Probabilistic Nonparametric Forecasted Quantiles Ensemble method + Lasso regularization
BWGVT [19] Probabilistic Nonparametric Forecasted Quantiles LLM + Quantile Regression
Pyraformer [34] Point Nonparametric Point Estimate LLM
Informer [33] Point Nonparametric Point Estimate LLM

We compared WIAE-GPF with six leading forecasters based on their relevance to power system applications and their established reputations. See Table 2 for attributes of these techniques and references. WIAE-GPF is the only nonparametric GPF forecaster. Because there are limited nonparametric GPF techniques, we also included in our comparison popular machine-learning-based parameterized GPF and point-forecasting techniques.

We started with SNARX [4], a classical parametric forecasting technique based on an auto-regressive moving-average (ARMA) model. An early study showed that SNARX performed the best among a list of parametric techniques [4]. For deep-learning techniques, we compared with DeepVAR [11], a multivariate generalization of the popular DeepVAR that has become a key baseline for time series forecasting in multiple applications. Temporal Latent AutoEncoder (TLAE) [21] is an autoencoder-based parametric GPF where the conditional distribution of the future time series variables is obtained through a transformed Gaussian random process with mean and variance as parameters. Once these parameters are estimated from observed realizations, Gaussian Monte Carlo samples are fed into the decoder to generate samples of the forecasted random variable. LQRA [16] builds on top of quantile regression averaging by utilizing LASSO regularization for quantile regression. It computes conditional quantiles by linear transforming a collection of point forecasts. The point forecasts are obtained by fitting an expert model. The expert model is originally proposed for day-ahead price prediction, and thus we made our adaptation by adding the real-time price for the past 5 hours to the linear autoregression.

We also included three popular forecasting techniques based on LLMs. One is the award-winning technique Informer [33]; the other is Pyraformer [34] that captures temporal dependency at multiple granularities§§§Pyraformer is the keynote presentation at The International Conference on Learning Representations (ICLR) 2022.. Pyraformer showed superior performance over a wide range of LLM-based point forecasting techniques. Specifically developed for LMP forecasting, BWGVTWe have named the algorithm by the first letters of authors’ last names [19] combines quantile regression with a transformer architecture derived for LLMs.

4.2 Evaluation Metrics

Comparing probabilistic forecasting methods is difficult due to the lack of ground truth for the underlying conditional distribution. However, because a GPF can produce arbitrarily many Monte Carlo samplesWe used 1000100010001000 Monte-Carlo samples and sample average to obtain point estimates., it can be evaluated by all point forecasting metrics. More importantly, an ideal probabilistic forecaster that produces the correct conditional distribution will perform well under any point estimator metric under regularity conditions. Therefore, evaluating a GPF method based on a set of point-forecasting techniques is a credible way to assess its performance.

We used four popular point forecasting and two widely used probabilistic forecasting metrics. See B for their definitions. Normalized mean squared error (NMSE) measures the error associated with the mean of the estimated conditional probability distribution. Normalized Absolute Error (NMAE) measures the error associated with the median. Mean absolute scaled error (MASE) is the ratio of the NMAE of a method over that of the (naive) persistent predictor that uses the latest observation available as the forecast. Symmetric mean absolute percentage error (sMAPE) averages and symmetrizes the percentage error computed at each time stamp and is less sensible to outliers. The probabilistic forecasting metrics were the continuous ranked probability score (CRPS) [42], Coverage Probability Error (CPE), and Normalized Coverage Width (NCW). CRPS evaluates the quadratic difference between the predicted empirical cumulative density function (c.d.f.) with an indicator c.d.f. based on the ground truth. CPE and NCW are often used to evaluate prediction intervals. CPE is the deviation of the coverage probability (CP) from the nominal confidence level β%percent𝛽\beta\%italic_β %, whereas NCW represents the width of the prediction intervals. At similar level of CP, the method with smaller NCW shows better accuracy in prediction interval estimation. In this paper, we computed the CPE and NCW of 10%, 50%, and 90% intervals predicted by each probabilistic method. The mathematical definition of those metrics can be found in the appendix.

For probabilistic forecasting techniques, we used their conditional means as the point forecasts when evaluated by NMSE, whereas the conditional median is used for NMAE, MASE, and sMAPE. For the quantile regression technique, BWGVT, we use the estimated 0.5-quantile as its point forecast for all metrics since it’s unclear how to compute empirical mean from quantiles. When producing interval forecasts for GPF methods, we took the empirical quantiles from the Monte-Carlo forecasts of future values. In particular, the beta-coverage interval was defined as the β𝛽\betaitalic_β-width interval symmetric around the sample median.

4.3 LMP forecasting for Energy Market Participation

Table 3: Point evaluation of forecasting results for real-time price forecasting at Long Island. The numbers in the parentheses were the ranking of the algorithm. The columns under the label of LONGIL are the GPF performance of the 12-step foresting of the LMP at LONGIL. The columns under LONGIL & NYC are the performance of LMP forecasting at LONGIL using both LONGIL and NYC observations.
Methods NMSE NMAE MASE sMAPE
LONGIL LONGIL & NYC LONGIL LONGIL & NYC LONGIL LONGIL & NYC LONGIL LONGIL & NYC
SNARX [4] (8) 0.98520.98520.98520.9852 (7) 0.50290.50290.50290.5029 (8) 0.97330.97330.97330.9733 (8) 0.73180.73180.73180.7318 (8) 0.70530.70530.70530.7053 (8) 0.45480.45480.45480.4548 (7) 0.43190.43190.43190.4319 (7) 0.41620.41620.41620.4162
WIAE-GPF (1) 0.05850.0585\mathbf{0.0585}bold_0.0585 (2) 0.04870.04870.04870.0487 (1) 0.20740.2074\mathbf{0.2074}bold_0.2074 (1) 0.11860.1186\mathbf{0.1186}bold_0.1186 (1) 0.15030.1503\mathbf{0.1503}bold_0.1503 (1) 0.07370.0737\mathbf{0.0737}bold_0.0737 (1) 0.08390.0839\mathbf{0.0839}bold_0.0839 (1) 0.03160.0316\mathbf{0.0316}bold_0.0316
TLAE [21] (4) 0.29560.29560.29560.2956 (1) 0.02320.0232\mathbf{0.0232}bold_0.0232 (6) 0.41860.41860.41860.4186 (2) 0.13660.13660.13660.1366 (6) 0.30340.30340.30340.3034 (2) 0.08490.08490.08490.0849 (3) 0.27200.27200.27200.2720 (2) 0.31900.31900.31900.3190
DeepVAR [11] (6) 0.39190.39190.39190.3919 (6) 0.40600.40600.40600.4060 (5) 0.40880.40880.40880.4088 (7) 0.40970.40970.40970.4097 (5) 0.29630.29630.29630.2963 (7) 0.25460.25460.25460.2546 (5) 0.36290.36290.36290.3629 (3) 0.37030.37030.37030.3703
LQRA [16] (7) 0.82870.82870.82870.8287 (8) 0.71100.71100.71100.7110 (7) 0.84720.84720.84720.8472 (4) 0.26930.26930.26930.2693 (7) 0.61390.61390.61390.6139 (4) 0.16230.16230.16230.1623 (8) 0.49520.49520.49520.4952 (8) 0.48790.48790.48790.4879
BWGVT [19] (3) 0.26700.26700.26700.2670 (5) 0.25280.25280.25280.2528 (4) 0.31580.31580.31580.3158 (6) 0.32800.32800.32800.3280 (4) 0.22890.22890.22890.2289 (6) 0.20380.20380.20380.2038 (4) 0.28170.28170.28170.2817 (6) 0.39000.39000.39000.3900
Pyraformer [34] (5) 0.31280.31280.31280.3128 (3) 0.13820.13820.13820.1382 (2) 0.30740.30740.30740.3074 (3) 0.25560.25560.25560.2556 (2) 0.22280.22280.22280.2228 (3) 0.15880.15880.15880.1588 (2) 0.10590.10590.10590.1059 (4) 0.37650.37650.37650.3765
Informer [33] (2) 0.19120.19120.19120.1912 (4) 0.18290.18290.18290.1829 (3) 0.31470.31470.31470.3147 (5) 0.28300.28300.28300.2830 (3) 0.22810.22810.22810.2281 (5) 0.17590.17590.17590.1759 (6) 0.37610.37610.37610.3761 (5) 0.38270.38270.38270.3827
Table 4: Evaluation of probabilistic forecasting results of real-time prices at Long Island. The numbers inside the square parentheses are the corresponding NCW.
Methods CRPS CPE (90%) [NCW] CPE (50%) [NCW] CPE (10%) [NCW]
LONGIL LONGIL & NYC LONGIL LONGIL & NYC LONGLIL LONGIL & NYC LONGLIL LONGIL & NYC
SNARX [4] (3) 18.286418.286418.286418.2864 (3) 10.528610.528610.528610.5286 (5) 0.22630.2263-0.2263- 0.2263 [0.6620]delimited-[]0.6620[0.6620][ 0.6620 ] (6) 0.37790.3779-0.3779- 0.3779 [0.7990]delimited-[]0.7990[0.7990][ 0.7990 ] (5) 0.24510.2451-0.2451- 0.2451 [0.5574]delimited-[]0.5574[0.5574][ 0.5574 ] (2) 0.07460.0746-0.0746- 0.0746 [0.9161]delimited-[]0.9161[0.9161][ 0.9161 ] (3) 0.06410.0641-0.0641- 0.0641 [0.8145]delimited-[]0.8145[0.8145][ 0.8145 ] (6) 0.18700.18700.18700.1870 [0.8082]delimited-[]0.8082[0.8082][ 0.8082 ]
WIAE-GPF (1) 4.60294.6029\mathbf{4.6029}bold_4.6029 (1) 1.15191.1519\mathbf{1.1519}bold_1.1519 (1) 0.02450.0245~{}\mathbf{0.0245}bold_0.0245 [0.9936]delimited-[]0.9936[0.9936][ 0.9936 ] (1) 0.01570.0157~{}\mathbf{0.0157}bold_0.0157 [0.9958]delimited-[]0.9958[0.9958][ 0.9958 ] (1) 0.02160.0216\mathbf{0.0216}bold_0.0216 [0.9681]delimited-[]0.9681[0.9681][ 0.9681 ] (1) 0.01840.0184\mathbf{0.0184}bold_0.0184 [0.8742]delimited-[]0.8742[0.8742][ 0.8742 ] (1) 0.02420.0242\mathbf{0.0242}bold_0.0242 [0.9836]delimited-[]0.9836[0.9836][ 0.9836 ] (1) 0.01900.0190\mathbf{-0.0190}- bold_0.0190 [0.8207]delimited-[]0.8207[0.8207][ 0.8207 ]
TLAE [21] (2) 6.18746.18746.18746.1874 (2) 2.84492.8449{2.8449}2.8449 (2) 0.03250.0325~{}0.03250.0325 [0.8179]delimited-[]0.8179[0.8179][ 0.8179 ] (2) 0.05920.0592~{}0.05920.0592 [0.8176]delimited-[]0.8176[0.8176][ 0.8176 ] (2) 0.04230.04230.04230.0423 [0.6357]delimited-[]0.6357[0.6357][ 0.6357 ] (3) 0.12860.12860.12860.1286 [0.7637]delimited-[]0.7637[0.7637][ 0.7637 ] (5) 0.08480.0848-0.0848- 0.0848 [0.6704]delimited-[]0.6704[0.6704][ 0.6704 ] (4) 0.08340.0834-0.0834- 0.0834 [0.6516]delimited-[]0.6516[0.6516][ 0.6516 ]
DeepVAR [11] (4) 23.146023.146023.146023.1460 (5) 22.934322.934322.934322.9343 (4) 0.05750.0575-0.0575- 0.0575 [0.9738]delimited-[]0.9738[0.9738][ 0.9738 ] (3) 0.07610.0761-0.0761- 0.0761 [0.8467]delimited-[]0.8467[0.8467][ 0.8467 ] (4) 0.16370.1637-0.1637- 0.1637 [0.5709]delimited-[]0.5709[0.5709][ 0.5709 ] (5) 0.14880.1488-0.1488- 0.1488 [0.6614]delimited-[]0.6614[0.6614][ 0.6614 ] (2) 0.06160.0616-0.0616- 0.0616 [0.5889]delimited-[]0.5889[0.5889][ 0.5889 ] (3) 0.04310.0431-0.0431- 0.0431 [0.4260]delimited-[]0.4260[0.4260][ 0.4260 ]
LQRA [16] (6) 48.946048.946048.946048.9460 (4) 17.058317.058317.058317.0583 (6) 0.26010.2601-0.2601- 0.2601 [0.7130]delimited-[]0.7130[0.7130][ 0.7130 ] (5) 0.09580.0958-0.0958- 0.0958 [0.8434]delimited-[]0.8434[0.8434][ 0.8434 ] (3) 0.04650.0465-0.0465- 0.0465 [0.6534]delimited-[]0.6534[0.6534][ 0.6534 ] (4) 0.13640.13640.13640.1364 [0.9082]delimited-[]0.9082[0.9082][ 0.9082 ] (4) 0.06480.06480.06480.0648 [0.8494]delimited-[]0.8494[0.8494][ 0.8494 ] (5) 0.15290.15290.15290.1529 [0.8827]delimited-[]0.8827[0.8827][ 0.8827 ]
BWGVT [19] (5) 24.259524.259524.259524.2595 (6) 24.306524.306524.306524.3065 (3) 0.09990.0999~{}0.09990.0999 [7.3500]delimited-[]7.3500[7.3500][ 7.3500 ] (4) 0.08290.0829~{}0.08290.0829 [3.4864]delimited-[]3.4864[3.4864][ 3.4864 ] (6) 0.48150.48150.48150.4815 [2.7822]delimited-[]2.7822[2.7822][ 2.7822 ] (6) 0.43220.43220.43220.4322 [3.0100]delimited-[]3.0100[3.0100][ 3.0100 ] (6) 0.10340.10340.10340.1034 [7.2325]delimited-[]7.2325[7.2325][ 7.2325 ] (2) 0.03110.03110.03110.0311 [1.1181]delimited-[]1.1181[1.1181][ 1.1181 ]
Refer to caption
(a) WIAE-GPF
Refer to caption
(b) Pyraformer
Refer to caption
(c) SNARX
Refer to caption
(d) DeepVAR
Figure 5: Trajectories of the real-time price at LONGIL, and its prediction generated by selective methods.

For a self-scheduled resource submitting a quantity bid to the energy market, the ability to forecast future prices is essential in constructing its bids and offers. With GPF generating future LMP realizations, the problem of optimal offer/bid strategies can be formulated as scenario-based stochastic optimization [43]. Our experiment was based on a use case of a merchant storage owner submitting quantity offers and bids to a deregulated wholesale market, using LMP from NYISO as the hypothetical price realizations.

The real-time market of NYISO closes sixty minutes ahead of actual delivery, which means that the forecasting horizon needs to be longer than 60 minutes. Two experiments were conducted to produce probabilistic forecasts of 60-minute ahead LMPs at the Long Island (LONGIL) using (a) the day-ahead prices and the current and past real-time LMPs at LONGIL, along with the system load up to the time of submitting the bid; (b) the neighboring NYC real-time LMP in addition to the data in (a).

Fig. 4 shows the real-time LMP trajectories at both LONGIL and NYC along with the demand and the day-ahead LMP at LONGIL. The real-time LMPs at LONGIL and NYC showed apparent spatial dependencies, while the dependency between day-ahead and real-time LMPs at LONGIL. The dependencies between load and real-time LMP were less obvious. The real-time LMPs and load were collected every 5 minutes and day-ahead LMPs every hour. We used the first 25 days for training and validation, and last 6 days of July for evaluation.

Test results are shown in Table 3 and 4 with boldfaced numbers being the best performance. We observed that WIAE-GPF performed the best for all cases except for the forecasting of LONGIL LMP using both LONGIL and NYC data, for which WIAE-GPF is close second under NMSE. The strong performance of WIAE-GPF comes from that it aims to match conditional distribution, with a validity guarantee that ensures the Monte-Carlo samples generated have the same conditional distribution as that of the actual time series variable. Overall, the second-best is TLAE; the VAE-trained RNN autoencoder achieved noticeable gain by combining data from two locations, especially for the joint forecasting at LONGIL & NYC. However, under the probabilistic forecasting measure of CRPS, TLAE was 50% to 100% higher than that of WIAE-GPF. Because sampling the correlated Gaussian latent process is nontrivial, TLAE used a re-parameterization heuristic that could cause an accumulation of biases.

The three LLM-based forecasters (BWGVT, Pyraformer, and Informer) performed roughly the same, worse than WIAE-GPF and TLAE but better than SNARX and DeepVAR. Note that Pyraformer and Informer are both point forecasters trained to minimize the mean squared forecasting error. They did not perform well under NMAE, MASE, and sMAPE. BWGVT is adapted to be a probabilistic forecaster, although it did not perform well under the CRPS score. Its quantile prediction was outlier-sensitive, especially when compared with the GPF methods that depend on the stochastic latent process. We also observed that BWGVT predicted larger intervals as shown by its high NCW for all cases. Consequently, their CPE’s were always positive, indicating that the prediction intervals it generated covered more than the nominal percentages. Hence, both their point estimation results and CRPS were worse than TLAE and WIAE-GPF.

The other quantile regression based method, LQRA, performed worst among nonparametric techniques. Originally proposed for hourly price forecasting, LQRA’s assumed an simple regression model to produce point forecasts, with nonlinearity only introduced by the one-day ahead minimum and maximum day-ahead price. Although this model might be sufficient for day-ahead price forecasting, its model mismatch when applied to real-time price forecasting intensifies the inaccuracy.

For the two parametric techniques DeepVAR and SNARX, being significantly simpler than LLM forecasters, performed slightly worse. Both techniques did not perform well under point estimation metrics. Shown by the CPE’s in Table 4, their prediction interval estimation is also worse than other probabilistic forecasting methods. Their NCW showed that they predicted narrow intervals that failed to cover the majority of the ground truths. The inaccurate semiparametric-AR model assumption appears to be culpable. Their performance was influenced by model mismatch, where ARMA models typically expect smoother trajectories. For the real-time LMP that exhibits high volatility, these methods were slow catch the rapid changes, leading to predicting shifted peaks that resulted in huge forecasting errors. This phenomena were corroborated by their high sMAPE, indicating higher percentage error at each time stamp.

It is interesting to observe that using LMPs in neighboring locations improved the forecasting performance of all methods except DeepVAR and BWGVT, as shown by the LONGIL and LONGIL & NYC columns under the DeepVAR and BWGVT rows in Table 3. This implies that these two methods didn’t account for spatial correlations optimally. We conclude this to the difficulty of training of transformer and LSTM models.

To gain insights into the performance of WIAE-GPF and other benchmark techniques, we plotted the ground truth trajectories (black) and trajectory forecasts generated by WIAE-GPF (red) and a competing algorithm (blue) in Fig. 5. Note that the spikes were not predicted by any methods. This was not surprising given the nature of how these spikes were produced. Aside from these spikes, these figures show clearly that WIAE-GPF (red) tracked the ground truth (black) the closest, which was supported by the fact that WIAE-GPF has the smallest NMAE. We also observed that WIAE-GPF had the smallest variation, which is supported by the fact that WIAE-GPF had the smallest NMSE. Furthermore, WIAE-GPF was the least affected by the price spikes. This was because, as a GPF method, the Monte-Carlo samples used to produce the MMSE point estimate were less likely to include extreme samples. However, for SNARX, as shown in Fig. 5(c), the spikes caused significant deviations from the forecasted trajectory.

4.4 Interregional LMP spread for Interchange Markets

The interchange market aims to improve overall economic efficiency across ISOs by allowing virtual bidders to arbitrage price differences at proxy buses of two neighboring ISOs. This experiment was based on the use case of a virtual bidder bidding into the CTS market between NYISO and PJM. The proxy buses of this market were Sandy Point of NYISO and Neptune of PJM.

The CTS market closes 75 minutes ahead of delivery and is cleared every 15 minutes. A virtual bidder submits a price-quantity bid along with the direction of the virtual trade from the source of the proxy with low LMP to the destination proxy with high LMP. Once the market is cleared, the settlement is based on the actual LMP spread between the two proxies and the cleared quantity. The bidder profits if the virtual trade direction matches the direction of the real-time LMP spread. Otherwise, the bidder incurs a loss. Therefore, the ability to predict the LMP spread direction is especially important.

We performed a 75-minute ahead LMP spread forecasting using the interface power flow and LMP spread data between NYISO and PJM at the Neptune proxy, collected in February 2024. The interface power flow samples were collected every 5 minutes, and LMP spread every 15 minutes. We used the first 24 days for training and validation, and the last 5 days of February for testing.

We added Prediction Error Rate (PER) as a measure for the accuracy of the virtual trading direction prediction, given that the sign of spread is of great importance to profitability. PER indicates the percentage of forecasts that don’t have the same direction as the ground truth. For point forecasts, we compared the signs of the forecasts with the signs of the ground truth. For probabilistic forecasting, we compared the direction of the ground truth with that of the minimum error-probability prediction of the LMP spread, which is the sign of the conditional median. For GPF, we compare the sample median with the sign of the ground truth.

Table 5: Evaluation of forecasting results for spread forecasting between NYISO and PJM.
Methods NMSE NMAE MASE sMAPE PER
SNARX [4] (7) 2.45312.45312.45312.4531 (7) 1.34151.34151.34151.3415 (7) 1.18471.18471.18471.1847 (7) 0.49580.49580.49580.4958 (7) 0.67810.67810.67810.6781
WIAE-GPF (1) 0.00980.0098\mathbf{0.0098}bold_0.0098 (1) 0.27380.2738\mathbf{0.2738}bold_0.2738 (1) 0.24180.2418\mathbf{0.2418}bold_0.2418 (1) 0.44930.4493\mathbf{0.4493}bold_0.4493 (1) 0.06060.0606\mathbf{0.0606}bold_0.0606
TLAE [21] (5) 0.95920.95920.95920.9592 (5) 0.97850.97850.97850.9785 (5) 0.86410.86410.86410.8641 (4) 0.47850.47850.47850.4785 (4) 0.36920.36920.36920.3692
DeepVAR [11] (6) 1.89861.89861.89861.8986 (3) 0.72240.72240.72240.7224 (3) 0.63800.63800.63800.6380 (5) 0.48060.48060.48060.4806 (3) 0.35050.35050.35050.3505
BWGVT [19] (3) 0.90530.90530.90530.9053 (4) 0.85250.85250.85250.8525 (4) 0.75290.75290.75290.7529 (3) 0.46740.46740.46740.4674 (2) 0.23130.23130.23130.2313
Pyraformer [34] (4) 0.94780.94780.94780.9478 (6) 1.26741.26741.26741.2674 (6) 1.11931.11931.11931.1193 (6) 0.49090.49090.49090.4909 (6) 0.67380.67380.67380.6738
Informer [33] (2) 0.80450.80450.80450.8045 (2) 0.41850.41850.41850.4185 (2) 0.42520.42520.42520.4252 (2) 0.45800.45800.45800.4580 (5) 0.54870.54870.54870.5487
Table 6: Probabilistic forecasting results of spread forecasting between NYISO and PJM.
Methods CRPS CPE (90%) [NCW] CPE (50%) [NCW] CPE (10%) [NCW]
SNARX [4] (5) 120.0403120.0403120.0403120.0403 (5) 0.54430.5443-0.5443- 0.5443 [5.0419]delimited-[]5.0419[5.0419][ 5.0419 ] (5) 0.29580.2958-0.2958- 0.2958 [1.4555]delimited-[]1.4555[1.4555][ 1.4555 ] (2) 0.01950.0195-0.0195- 0.0195 [0.6015]delimited-[]0.6015[0.6015][ 0.6015 ]
WIAE-GPF (1) 4.03294.0329\mathbf{4.0329}bold_4.0329 (1) 0.02150.0215\mathbf{0.0215}bold_0.0215 [0.4427]delimited-[]0.4427[0.4427][ 0.4427 ] (2) 0.02740.0274{-0.0274}- 0.0274 [0.5255]delimited-[]0.5255[0.5255][ 0.5255 ] (3) 0.02120.02120.02120.0212 [0.1692]delimited-[]0.1692[0.1692][ 0.1692 ]
TLAE [21] (2) 15.519515.519515.519515.5195 (3) 0.04430.0443-0.0443- 0.0443 [0.7745]delimited-[]0.7745[0.7745][ 0.7745 ] (1) 0.00520.0052\mathbf{0.0052}bold_0.0052 [0.8784]delimited-[]0.8784[0.8784][ 0.8784 ] (5) 0.09330.0933-0.0933- 0.0933 [0.3133]delimited-[]0.3133[0.3133][ 0.3133 ]
DeepVAR [11] (4) 32.829632.829632.829632.8296 (2) 0.03550.0355-0.0355- 0.0355 [2.0279]delimited-[]2.0279[2.0279][ 2.0279 ] (3) 0.14800.1480{-0.1480}- 0.1480 [1.4739]delimited-[]1.4739[1.4739][ 1.4739 ] (1) 0.00210.0021\mathbf{-0.0021}- bold_0.0021 [0.4198]delimited-[]0.4198[0.4198][ 0.4198 ]
BWGVT [19] (3) 31.566031.566031.566031.5660 (4) 0.09890.09890.09890.0989 [5.1788]delimited-[]5.1788[5.1788][ 5.1788 ] (4) 0.18350.18350.18350.1835 [6.0939]delimited-[]6.0939[6.0939][ 6.0939 ] (4) 0.06560.06560.06560.0656 [4.0473]delimited-[]4.0473[4.0473][ 4.0473 ]
Refer to caption
(a) WIAE-GPF
Refer to caption
(b) Pyraformer
Refer to caption
(c) SNARX
Refer to caption
(d) DeepVAR
Figure 6: Trajectories of the interregional LMP spread between NYISO and PJM, and its prediction generated by selective methods.

Seen from Table 5 and 6, WIAE-GPF performed better than all other techniques in all metrics. TLAE performed the second-best in CRPS (15.519515.519515.519515.5195) but slightly worse than BWGVT when evaluated under point estimation metrics. Its sequential sampling of the latent Gaussian process added to its numerical instability. BWGVT was the overall second-best performing probabilistic technique. Its transformer architecture with enhanced capability of capturing long-term temporal dependency didn’t offer much gain for the training difficulty imposed by the increasing number of deep-learning parameters, see Sec. 4.6. BWGVT also exhibited the tendency to predict a wide interval covering more than the nominal percentage. Point estimation techniques, namely Pyraformer and Informer, were not competitive when evaluated under point estimation metrics other than NMSE. Among probabilistic methods, DeepVAR performed similarly to the LLM methods, and SNARX had the most difficulties. These (semi) parametric methods suffered from model mismatch, and were sensitive to sudden changes, Thus, shifted peaks and valleys were often witnessed in their predictions.

Same observation can also be made through Fig. 6. WIAE-GPF has the most stable prediction of interregional LMP spreads, which is corroborated by its smallest NMSE and NMAE. Pyraformer also follows the trend of LMP spreads accurately but with higher variance. The two AR-based parametric models, SNARX and DeepVAR exhibited the tendency to predict shifted spikes and failures to catch the rapid and dramatic change of LMP spread.

4.5 Area Control Error Forecasting for Reserve Market Participants

ACE is defined as the difference between actual and scheduled load-generation imbalance, adjusted by the area frequency deviation [44]. It is the control signal for frequency regulation, and its probabilistic forecasting is especially important for the operator to procure resources and market participants to bid in the regulation ancillary service market.

In this subsection, we present the simulation results of a 5-minute ahead forecasting of ACE. We utilized the ACE data from Jan 24th to 26th, collected by PJM. The ACE signal is measured every 15 seconds and can be quite volatile, as shown by the trajectory in Fig. 7.

Refer to caption
Figure 7: Trajectory of ACE at PJM, Jan. 24th - 26th, 2023.
Table 7: Estimation Results of ACE forecasting for PJM. The prediction step is 5-minute ahead.
Methods NMSE NMAE MASE sMAPE
SNARX [4] (6) 1.19221.19221.19221.1922 (6) 1.13031.13031.13031.1303 (6) 0.70290.70290.70290.7029 (6) 0.46050.46050.46050.4605
WIAE-GPF (1) 0.59570.5957\mathbf{0.5957}bold_0.5957 (1) 0.75550.7555\mathbf{0.7555}bold_0.7555 (1) 0.46980.4698\mathbf{0.4698}bold_0.4698 (1) 0.10590.1059\mathbf{0.1059}bold_0.1059
TLAE [21] (5) 1.17271.17271.17271.1727 (5) 1.06051.06051.06051.0605 (5) 0.65950.65950.65950.6595 (3) 0.27820.27820.27820.2782
DeepVAR [11] (7) 1.44311.44311.44311.4431 (7) 1.17501.17501.17501.1750 (7) 0.73070.73070.73070.7307 (5) 0.39520.39520.39520.3952
BWGVT [19] (3) 0.95620.95620.95620.9562 (2) 0.97930.97930.97930.9793 (2) 0.60900.60900.60900.6090 (4) 0.31680.31680.31680.3168
Pyraformer [34] (4) 0.97830.97830.97830.9783 (4) 0.99480.99480.99480.9948 (4) 0.61860.61860.61860.6186 (7) 0.49860.49860.49860.4986
Informer [33] (2) 0.60060.60060.60060.6006 (3) 0.98190.98190.98190.9819 (3) 0.61060.61060.61060.6106 (2) 0.22470.22470.22470.2247
Table 8: Probabilistic Estimation Results of ACE forecasting for PJM. The prediction step is 5-minute ahead.
Methods CRPS CPE (90%) [NCW] CPE (50%) [NCW] CPE (10%) [NCW]
SNARX [4] (5) 2.10072.10072.10072.1007 (5) 0.82600.8260-0.8260- 0.8260 [1.6759]delimited-[]1.6759[1.6759][ 1.6759 ] (5) 0.47970.4797-0.4797- 0.4797 [1.6803]delimited-[]1.6803[1.6803][ 1.6803 ] (2) 0.03430.03430.03430.0343 [5.5768]delimited-[]5.5768[5.5768][ 5.5768 ]
WIAE-GPF (1) 0.00810.0081\mathbf{0.0081}bold_0.0081 (1) 0.00160.0016\mathbf{-0.0016}- bold_0.0016 [0.9199]delimited-[]0.9199[0.9199][ 0.9199 ] (1) 0.03210.0321\mathbf{0.0321}bold_0.0321 [0.9336]delimited-[]0.9336[0.9336][ 0.9336 ] (1) 0.01320.0132\mathbf{-0.0132}- bold_0.0132 [0.8885]delimited-[]0.8885[0.8885][ 0.8885 ]
TLAE [21] (4) 1.55411.55411.55411.5541 (4) 0.78570.7857-0.7857- 0.7857 [0.0004]delimited-[]0.0004[0.0004][ 0.0004 ] (4) 0.44890.4489-0.4489- 0.4489 [0.0005]delimited-[]0.0005[0.0005][ 0.0005 ] (4) 0.09570.0957-0.0957- 0.0957 [0.0027]delimited-[]0.0027[0.0027][ 0.0027 ]
DeepVAR [11] (3) 1.29471.29471.29471.2947 (3) 0.35260.3526-0.3526- 0.3526 [0.5665]delimited-[]0.5665[0.5665][ 0.5665 ] (3) 0.25600.2560-0.2560- 0.2560 [0.5296]delimited-[]0.5296[0.5296][ 0.5296 ] (3) 0.05210.0521-0.0521- 0.0521 [0.5434]delimited-[]0.5434[0.5434][ 0.5434 ]
BWGVT [19] (2) 1.24881.24881.24881.2488 (2) 0.00650.00650.00650.0065 [1.8309]delimited-[]1.8309[1.8309][ 1.8309 ] (2) 0.07540.07540.07540.0754 [2.0385]delimited-[]2.0385[2.0385][ 2.0385 ] (5) 0.09960.09960.09960.0996 [2.4261]delimited-[]2.4261[2.4261][ 2.4261 ]

WIAE-GPF achieved better performance than other methods, with CRPS less than 0.010.010.010.01 and sMAPE less than 11%percent1111\%11 %, as shown by the WIAE-GPF row. We credited the strong performance of WIAE-GPF to the simplicity of its latent process, and its Bayesian sufficiency. BWGVT ranked second among all methods since the ACE data has few outliers. But its CRPS at 1.24881.24881.24881.2488 is dramatically larger than that of WIAE-GPF. Its CPE and NCW for 10% confidence interval prediction also showed that it cannot accurately predict a narrow interval. Pyraformer and Informer, trained with NMSE, had better performance under NMSE but worse under NMAE. With NMSE over 110%percent110110\%110 %, TLAE had the worst performance among GPF methods. DeepVAR and SNARX performed worse than the other forecasting methods, with NMSE and NMAE larger than 110%percent110110\%110 %, possibly due to model mismatch.

4.6 Discussion: On using LLM

Table 9: Statistics that models long-range dependency of time series.
Metrics Real Time Interchange Spread ACE
Hurst Exponent 0.52570.52570.52570.5257 0.53010.53010.53010.5301 0.53510.53510.53510.5351
DFA 0.60530.60530.60530.6053 0.66140.66140.66140.6614 0.86090.86090.86090.8609

The success of LLM-based prediction in natural processing ignited broad interest in adopting LLM models in various applications, including electricity price forecasting with BWGVT, Pyraformer, and Informer. Our experiments showed that the innovation-based method (WIAE) performed uniformly better than the three LLM techniques, except for the real-time prediction at LONGIL-NYC under NMSE, where Pyraformer was the best among all forecasters. Note that the innovation representation used in WIAE can model but not explicitly long-range dependencies of the random process. WIAE does not include attention modeling.

As LLM-based forecasting techniques, Pyraformer, Informer and BWGVT performed better than the more conventional SNARX (see Table 3, 5, 7). Compared with the more straightforward deep learning method of DeepVAR, LLM-based forecasting did not show clear advantages. The same can be said when they were compared with TLAE. Authors of [20] pointed out that the simple convolutional neural network outperformed RNNs and LLMs on imbalance price forecasting, for the forecasted time series are not a good fit to the complicated deep learning models.

To understand if long-range dependencies matter in the probabilistic forecasting of electricity market signals, we examined the characteristics of LMP signals using the Hurst exponent and Detrended Fluctuation Analysis (DFA) as indicators for the long-range dependencies of LMP; both parameters had the range [0,1], with deviation from 0.5 indicating symptoms of long-range dependencies.

Table 9 shows the estimated Hurst exponent and DFA. The Hurst Exponent and DFA slope displayed a slight deviation from 0.5. The English and Korean literature [45, 46] are known to have long-range dependencies with the Hurst exponents ranging from 0.64 to 0.73. In comparison, the long-term effect of real-time electricity market signals is minimal.

While further studies are necessary, the use of LLM may not be suitable for electricity market signals where long-range dependencies are not evident. Indeed, real-time LMPs are computed either on an interval-by-interval basis or as part of short sliding window economic dispatch. Any temporal coupling is a result of temporal dependencies of demand and supplies, neither shown to have long-range dependencies. An unproven hypothesis is that the model complexity of LLM may offset any benefit it may bring to price forecasting.

5 Conclusion

This paper presents WIAE-GPF, a generative AI approach to probabilistic forecasting of nonparametric time series based on the innovation representation pioneered by Wiener, Kallianpur, and Rosenblatt six decades earlier. Three take-away conclusions stand out. One is that the innovation representation ensures that WIAE-GPF produces the correct conditional probability distributions under perfect learning condition assumptions. To our knowledge, WIAE-GPF is the first nonparametric GPF technique with such a theoretical guarantee. Second, WIAE-GPF demonstrated superior performance against major machine learning-based probabilistic forecasting techniques in our numerical studies using actual market data, including some of the advanced approaches involving transformer architecture, attention mechanism, and large language models. In addition, the local stationarity hypothesis assumed by the weak innovation representation appeared to hold well. Third, with Bayesian sufficiency established in this paper, Rosenblatt’s weak innovation representation of time series can be considered as a canonical representation and a powerful tool for stochastic decision making. Its applications in anomaly detection in power systems can be found in [47, 48, 49].

Finally, The black-box style of deep learning is often criticized for its lack of interpretability; some consider such methods as black magic that produces miraculous results. It is worth mentioning that WIAE-GPF has a highly intuitive and interpretable architecture parallel to that of the classic Kalman filtering. In particular, Kalman filtering extracts innovation representation as part of the measurement update, followed by the time-updated prediction according to the state-space model. WIAE-GPF extracts innovations by the weak innovation encoder and produces time-updated predictions through the weak innovation decoder. In this context, WIAE-GPF is a generalization of Kalman filtering to nonparametric and non-Gaussian settings.

Comments on the limitation and future work are in order. WIAE-GPF is derived based on the innovation representation of stationary processes. Extensions to certain classes of nonstationary processes would be the natural next step. Note that an innovation representation exists for nonstationary Gaussian processes with a time-varying state space model. Extension of WIAE-GPF to nonstationary time series under regime-switching models is also a natural next step, given the evidence of effective applications of regime-switching techniques in price forecasting. See [20, 19] and references therein.

Appendix A Proof of Theorem 2

Let (𝐕¯t(m))superscriptsubscript¯𝐕𝑡𝑚\left(\bar{{\bf V}}_{t}^{(m)}\right)( over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) and (𝑿¯t(m))superscriptsubscript¯𝑿𝑡𝑚\left(\bar{{\bm{X}}}_{t}^{(m)}\right)( over¯ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) denote the latent process and the reconstruction sequence, under weights θ¯msubscript¯𝜃𝑚\bar{\theta}_{m}over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and η¯msubscript¯𝜂𝑚\bar{\eta}_{m}over¯ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

𝐕¯t(m)=Gθ¯(m)(𝑿t,𝑿t1,,𝑿tm+1),superscriptsubscript¯𝐕𝑡𝑚superscriptsubscript𝐺¯𝜃𝑚subscript𝑿𝑡subscript𝑿𝑡1subscript𝑿𝑡𝑚1\displaystyle\bar{{\bf V}}_{t}^{(m)}=G_{\bar{\theta}}^{(m)}({\bm{X}}_{t},{\bm{% X}}_{t-1},\cdots,{\bm{X}}_{t-m+1}),over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT = italic_G start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 1 end_POSTSUBSCRIPT ) ,
𝑿¯t(m)=Hη¯(m)(𝐕¯t,𝐕¯t1,,𝐕¯tm+1).superscriptsubscript¯𝑿𝑡𝑚superscriptsubscript𝐻¯𝜂𝑚subscript¯𝐕𝑡subscript¯𝐕𝑡1subscript¯𝐕𝑡𝑚1\displaystyle\bar{{\bm{X}}}_{t}^{(m)}=H_{\bar{\eta}}^{(m)}(\bar{{\bf V}}_{t},% \bar{{\bf V}}_{t-1},\cdots,\bar{{\bf V}}_{t-m+1}).over¯ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT = italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ⋯ , over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t - italic_m + 1 end_POSTSUBSCRIPT ) .

We define the loss of a WIAE pair (Gθ,Hη)subscript𝐺𝜃subscript𝐻𝜂(G_{\theta},H_{\eta})( italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_H start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ) achieved under a m𝑚mitalic_m-dimensional discriminator pairs as

L(m)(θ,η):=maxγ,η(𝔼[Dγ(m)(𝐔t:tm+1)]𝔼[Dγ(m)(𝐕^t:tn+1)]+λ(𝔼[Dω(m)(𝑿tn+2:t+T)]𝔼[Dω(m)((𝑿tn+2:t,𝑿^t+1:t+T))])).assignsuperscript𝐿𝑚𝜃𝜂subscript𝛾𝜂𝔼delimited-[]superscriptsubscript𝐷𝛾𝑚subscript𝐔:𝑡𝑡𝑚1𝔼delimited-[]superscriptsubscript𝐷𝛾𝑚subscript^𝐕:𝑡𝑡𝑛1𝜆𝔼delimited-[]superscriptsubscript𝐷𝜔𝑚subscript𝑿:𝑡𝑛2𝑡𝑇𝔼delimited-[]superscriptsubscript𝐷𝜔𝑚subscript𝑿:𝑡𝑛2𝑡subscript^𝑿:𝑡1𝑡𝑇L^{(m)}(\theta,\eta):=\max_{\gamma,\eta}\big{(}\mathbb{E}[D_{\gamma}^{(m)}% \left({\bf U}_{t:t-m+1}\right)]-\mathbb{E}[D_{\gamma}^{(m)}(\hat{{\bf V}}_{t:t% -n+1})]\\ +\lambda(\mathbb{E}[D_{\omega}^{(m)}({\bm{X}}_{t-n+2:t+T})]-\mathbb{E}[D_{% \omega}^{(m)}(({\bm{X}}_{t-n+2:t},\hat{{\bm{X}}}_{t+1:t+T}))])\big{)}.start_ROW start_CELL italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_θ , italic_η ) := roman_max start_POSTSUBSCRIPT italic_γ , italic_η end_POSTSUBSCRIPT ( blackboard_E [ italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_U start_POSTSUBSCRIPT italic_t : italic_t - italic_m + 1 end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( over^ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t : italic_t - italic_n + 1 end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL + italic_λ ( blackboard_E [ italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_n + 2 : italic_t + italic_T end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_n + 2 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + 1 : italic_t + italic_T end_POSTSUBSCRIPT ) ) ] ) ) . end_CELL end_ROW

We first show that L(m)(θm,ηm)0superscript𝐿𝑚superscriptsubscript𝜃𝑚superscriptsubscript𝜂𝑚0L^{(m)}(\theta_{m}^{*},\eta_{m}^{*})\rightarrow 0italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) → 0 as m𝑚m\rightarrow\inftyitalic_m → ∞, where (θm,ηm)superscriptsubscript𝜃𝑚superscriptsubscript𝜂𝑚(\theta_{m}^{*},\eta_{m}^{*})( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) denotes the optimal weights of (Gθ(m),Hη(m))superscriptsubscript𝐺𝜃𝑚superscriptsubscript𝐻𝜂𝑚(G_{\theta}^{(m)},H_{\eta}^{(m)})( italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_H start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) obtained by minimizing (3).

Following the line of [50], we defined the distance between two random processes (𝑿t)subscript𝑿𝑡({\bm{X}}_{t})( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and (𝐘t)subscript𝐘𝑡({\bf Y}_{t})( bold_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) by the expected subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT norm:

d((𝑿t),(𝐘t)):=𝔼[supt|𝑿t𝐘t|].assign𝑑subscript𝑿𝑡subscript𝐘𝑡𝔼delimited-[]subscriptsupremum𝑡subscript𝑿𝑡subscript𝐘𝑡d\left(({\bm{X}}_{t}),({\bf Y}_{t})\right):=\mathbb{E}\left[\sup_{t}\lvert{\bm% {X}}_{t}-{\bf Y}_{t}\rvert\right].italic_d ( ( bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ( bold_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) := blackboard_E [ roman_sup start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | ] .

The uniform convergence assumed in assumption A2 is also defined on metric spaces with distance measure d(,)𝑑d(\cdot,\cdot)italic_d ( ⋅ , ⋅ ). Hence, by assumption A2, Gθ¯(m)Gsuperscriptsubscript𝐺¯𝜃𝑚𝐺G_{\bar{\theta}}^{(m)}\rightarrow Gitalic_G start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT → italic_G uniformly, which implies that, ϵfor-allitalic-ϵ\forall\epsilon∀ italic_ϵ, there exists a M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT such that m>M1for-all𝑚subscript𝑀1\forall m>M_{1}∀ italic_m > italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, d((𝐕¯t(m)),(𝐕t))<ϵ.𝑑superscriptsubscript¯𝐕𝑡𝑚subscript𝐕𝑡italic-ϵd\left((\bar{{\bf V}}_{t}^{(m)}),({\bf V}_{t})\right)<\epsilon.italic_d ( ( over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) , ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) < italic_ϵ . Thus, for F:(T):for-all𝐹superscript𝑇\forall F:\ell^{\infty}(T)\to\mathbb{R}∀ italic_F : roman_ℓ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_T ) → blackboard_R, F𝐹Fitalic_F bounded and continuous,

d(F((𝐕t)),F((𝐕¯t(m))))<δ(ϵ).𝑑𝐹subscript𝐕𝑡𝐹superscriptsubscript¯𝐕𝑡𝑚𝛿italic-ϵd\left(F(({\bf V}_{t})),F((\bar{{\bf V}}_{t}^{(m)}))\right)<\delta(\epsilon).italic_d ( italic_F ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , italic_F ( ( over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) ) ) < italic_δ ( italic_ϵ ) .

In other words,

limm𝔼[F((𝐕¯t(m)))]=𝔼[F((𝐕t))],subscript𝑚𝔼delimited-[]𝐹subscriptsuperscript¯𝐕𝑚𝑡𝔼delimited-[]𝐹subscript𝐕𝑡\lim_{m\rightarrow\infty}\mathbb{E}[F((\bar{{\bf V}}^{(m)}_{t}))]=\mathbb{E}[F% (({\bf V}_{t}))],roman_lim start_POSTSUBSCRIPT italic_m → ∞ end_POSTSUBSCRIPT blackboard_E [ italic_F ( ( over¯ start_ARG bold_V end_ARG start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] = blackboard_E [ italic_F ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] ,

which fulfills the definition of weak convergence. Therefore,

𝐕¯t:tm+1(m)d𝐕t:tm+1,superscriptdsuperscriptsubscript¯𝐕:𝑡𝑡𝑚1𝑚subscript𝐕:𝑡𝑡𝑚1\displaystyle\bar{{\bf V}}_{t:t-m+1}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny d% }}}{{\rightarrow}}{\bf V}_{t:t-m+1},over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t : italic_t - italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP bold_V start_POSTSUBSCRIPT italic_t : italic_t - italic_m + 1 end_POSTSUBSCRIPT , (10)

due to the fact that convergence in expectation implies convergence in distribution.

Similarly, by the uniform convergence of Hη¯(m)superscriptsubscript𝐻¯𝜂𝑚H_{\bar{\eta}}^{(m)}italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT to H𝐻Hitalic_H, we have that that m>M2for-all𝑚subscript𝑀2\forall m>M_{2}∀ italic_m > italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

d((𝑿^t),(Hη¯(m)((𝐕t))))<ϵ,𝑑subscript^𝑿𝑡superscriptsubscript𝐻¯𝜂𝑚subscript𝐕𝑡italic-ϵ\displaystyle d\left((\hat{{\bm{X}}}_{t}),(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))% )\right)<\epsilon,italic_d ( ( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , ( italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ) < italic_ϵ ,

where (Hη¯(m)((𝐕t))))(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))))( italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ) represent the random sequence generated by passing (𝐕t)subscript𝐕𝑡({\bf V}_{t})( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) through Hη¯subscript𝐻¯𝜂H_{\bar{\eta}}italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT. Thus, for F:(T):for-all𝐹superscript𝑇\forall F:\ell^{\infty}(T)\to\mathbb{R}∀ italic_F : roman_ℓ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_T ) → blackboard_R, F𝐹Fitalic_F bounded and continuous,

d(F((𝑿^t)),(Hη¯(m)((𝐕t)))))<δ(ϵ).d\left(F((\hat{{\bm{X}}}_{t})),(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))))\right)<% \delta(\epsilon).italic_d ( italic_F ( ( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , ( italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ) ) < italic_δ ( italic_ϵ ) .

Hence we have (Hη¯(m)((𝐕t))))(H_{\bar{\eta}}^{(m)}(({\bf V}_{t}))))( italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ) converges in distribution to (𝑿^t)subscript^𝑿𝑡(\hat{{\bm{X}}}_{t})( over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Since H𝐻Hitalic_H is continuous and Hη¯(m)superscriptsubscript𝐻¯𝜂𝑚H_{\bar{\eta}}^{(m)}italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT converges uniformly to H𝐻Hitalic_H, Hη¯(m)superscriptsubscript𝐻¯𝜂𝑚H_{\bar{\eta}}^{(m)}italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT is also continuous. Thus, by continuous map** theorem,

𝐕¯tm+1:t(m)d𝐕tm+1:tHη¯(m)(𝐕¯tm+1:t(m))dHη¯(m)(𝐕tm+1:t),superscriptdsuperscriptsubscript¯𝐕:𝑡𝑚1𝑡𝑚subscript𝐕:𝑡𝑚1𝑡superscriptabsentsuperscriptsubscript𝐻¯𝜂𝑚superscriptsubscript¯𝐕:𝑡𝑚1𝑡𝑚superscriptdsuperscriptsubscript𝐻¯𝜂𝑚subscript𝐕:𝑡𝑚1𝑡\bar{{\bf V}}_{t-m+1:t}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{% \rightarrow}}{\bf V}_{t-m+1:t}\stackrel{{\scriptstyle}}{{\Rightarrow}}H_{\bar{% \eta}}^{(m)}(\bar{{\bf V}}_{t-m+1:t}^{(m)})\stackrel{{\scriptstyle\mbox{\tiny d% }}}{{\rightarrow}}H_{\bar{\eta}}^{(m)}\left({\bf V}_{t-m+1:t}\right),over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t - italic_m + 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP bold_V start_POSTSUBSCRIPT italic_t - italic_m + 1 : italic_t end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ⇒ end_ARG start_ARG end_ARG end_RELOP italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( over¯ start_ARG bold_V end_ARG start_POSTSUBSCRIPT italic_t - italic_m + 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t - italic_m + 1 : italic_t end_POSTSUBSCRIPT ) ,

that is, 𝑿¯t(m)dHη¯(m)(𝐕t,,𝐕tm+1)superscript𝑑superscriptsubscript¯𝑿𝑡𝑚superscriptsubscript𝐻¯𝜂𝑚subscript𝐕𝑡subscript𝐕𝑡𝑚1\bar{{\bm{X}}}_{t}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny$d$}}}{{\rightarrow}% }H_{\bar{\eta}}^{(m)}({\bf V}_{t},\cdots,{\bf V}_{t-m+1})over¯ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG italic_d end_ARG end_RELOP italic_H start_POSTSUBSCRIPT over¯ start_ARG italic_η end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( bold_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ⋯ , bold_V start_POSTSUBSCRIPT italic_t - italic_m + 1 end_POSTSUBSCRIPT ). Therefore,

(𝑿tm+2:t,𝑿¯t+T(m))d(𝑿tm+2:t,𝑿^t+T)=d(𝑿tm+2:t,𝑿t+T).superscriptdsubscript𝑿:𝑡𝑚2𝑡superscriptsubscript¯𝑿𝑡𝑇𝑚subscript𝑿:𝑡𝑚2𝑡subscript^𝑿𝑡𝑇superscriptdsubscript𝑿:𝑡𝑚2𝑡subscript𝑿𝑡𝑇\displaystyle({\bm{X}}_{t-m+2:t},\bar{{\bm{X}}}_{t+T}^{(m)})\stackrel{{% \scriptstyle\mbox{\tiny d}}}{{\rightarrow}}({\bm{X}}_{t-m+2:t},\hat{{\bm{X}}}_% {t+T})\stackrel{{\scriptstyle\mbox{\tiny d}}}{{=}}({\bm{X}}_{t-m+2:t},{\bm{X}}% _{t+T}).( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 2 : italic_t end_POSTSUBSCRIPT , over¯ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 2 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG d end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 2 : italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ) . (11)

By (10)&(11), L(m)(θ¯m,η¯m)0superscript𝐿𝑚subscript¯𝜃𝑚subscript¯𝜂𝑚0L^{(m)}(\bar{\theta}_{m},\bar{\eta}_{m})\rightarrow 0italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , over¯ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) → 0. Since θmsubscriptsuperscript𝜃𝑚\theta^{*}_{m}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and ηmsubscriptsuperscript𝜂𝑚\eta^{*}_{m}italic_η start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT are the optimal parameters obtained by minimizing (3) evaluated by m𝑚mitalic_m-dimensional discriminators (Dω(m),Dγ(m))superscriptsubscript𝐷𝜔𝑚superscriptsubscript𝐷𝛾𝑚(D_{\omega}^{(m)},D_{\gamma}^{(m)})( italic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT , italic_D start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ),

L(m)(θm,ηm):=minθ,ηL(m)(θ,η)L(m)(θ¯m,η¯m)0.assignsuperscript𝐿𝑚superscriptsubscript𝜃𝑚superscriptsubscript𝜂𝑚subscript𝜃𝜂superscript𝐿𝑚𝜃𝜂superscript𝐿𝑚subscript¯𝜃𝑚subscript¯𝜂𝑚0\displaystyle L^{(m)}(\theta_{m}^{*},\eta_{m}^{*}):=\min_{\theta,\eta}L^{(m)}(% \theta,\eta)\leq L^{(m)}(\bar{\theta}_{m},\bar{\eta}_{m})\rightarrow 0.italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) := roman_min start_POSTSUBSCRIPT italic_θ , italic_η end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_θ , italic_η ) ≤ italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , over¯ start_ARG italic_η end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) → 0 .

Because L(m)(θm,ηm)0superscript𝐿𝑚superscriptsubscript𝜃𝑚superscriptsubscript𝜂𝑚0L^{(m)}(\theta_{m}^{*},\eta_{m}^{*})\rightarrow 0italic_L start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_η start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) → 0 as m𝑚m\rightarrow\inftyitalic_m → ∞, 𝑽t:tm+1(m)d𝑽t:tm+1(m)superscriptdsuperscriptsubscript𝑽:𝑡𝑡𝑚1𝑚superscriptsubscript𝑽:𝑡𝑡𝑚1𝑚\bm{V}_{t:t-m+1}^{(m)}\stackrel{{\scriptstyle\mbox{\tiny d}}}{{\rightarrow}}% \bm{V}_{t:t-m+1}^{(m)}bold_italic_V start_POSTSUBSCRIPT italic_t : italic_t - italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP bold_italic_V start_POSTSUBSCRIPT italic_t : italic_t - italic_m + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT and (𝑿tm+2:t,𝑿^t+T(m))d(𝑿tm+2:t,𝑿t+T(m))superscriptdsubscript𝑿:𝑡𝑚2𝑡superscriptsubscript^𝑿𝑡𝑇𝑚subscript𝑿:𝑡𝑚2𝑡superscriptsubscript𝑿𝑡𝑇𝑚({\bm{X}}_{t-m+2:t},\hat{\bm{X}}_{t+T}^{(m)})\stackrel{{\scriptstyle\mbox{% \tiny d}}}{{\rightarrow}}\bm{(}{\bm{X}}_{t-m+2:t},\bm{X}_{t+T}^{(m)})( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 2 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP bold_( bold_italic_X start_POSTSUBSCRIPT italic_t - italic_m + 2 : italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) follow directly from the equivalence of convergence in Wasserstein distance and convergence in distribution [51]. Since the discriminator dimensionality also goes to \infty, we have (𝑿0:t,𝑿^t+T(m))d(𝑿0:t,𝑿t+T)superscriptdsubscript𝑿:0𝑡superscriptsubscript^𝑿𝑡𝑇𝑚subscript𝑿:0𝑡subscript𝑿𝑡𝑇({\bm{X}}_{0:t},\hat{{\bm{X}}}_{t+T}^{(m)})\stackrel{{\scriptstyle\mbox{\tiny d% }}}{{\rightarrow}}({\bm{X}}_{0:t},{\bm{X}}_{t+T})( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT ) start_RELOP SUPERSCRIPTOP start_ARG → end_ARG start_ARG d end_ARG end_RELOP ( bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT , bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT ). Further, the conditional distribution of 𝑿^t+T(m)|𝑿0:t=𝒙0:tconditionalsuperscriptsubscript^𝑿𝑡𝑇𝑚subscript𝑿:0𝑡subscript𝒙:0𝑡\hat{{\bm{X}}}_{t+T}^{(m)}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}over^ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT converges in distribution to 𝑿t+T|𝑿0:t=𝒙0:tconditionalsubscript𝑿𝑡𝑇subscript𝑿:0𝑡subscript𝒙:0𝑡{\bm{X}}_{t+T}|{\bm{X}}_{0:t}={\bm{x}}_{0:t}bold_italic_X start_POSTSUBSCRIPT italic_t + italic_T end_POSTSUBSCRIPT | bold_italic_X start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT follows from a simple application of the Bayes rule. \square

Appendix B Definition of Metrics for Time Series Forecasting

Given the original time series (𝒙t)subscript𝒙𝑡({\bm{x}}_{t})( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), the forecasts (𝒙~t)subscript~𝒙𝑡(\tilde{{\bm{x}}}_{t})( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), N𝑁Nitalic_N the size of datasets, and T𝑇Titalic_T the prediction step, the point estimation metrics can be calculated through:

NMSE=1NTt=T+1N(𝒙t𝒙~tT)21NTt=T+1N𝒙t2,NMSE1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁superscriptsubscript𝒙𝑡subscript~𝒙𝑡𝑇21𝑁𝑇superscriptsubscript𝑡𝑇1𝑁superscriptsubscript𝒙𝑡2\displaystyle\mbox{NMSE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}({\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T})^{2}}{\frac{1}{N-T}\sum_{t=T+1}^{N}{\bm{x}}_{t}^{2}},NMSE = divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ,
NMAE=1NTt=T+1N|𝒙t𝒙~tT|1NTt=1N|𝒙t|,NMAE1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscript𝒙𝑡subscript~𝒙𝑡𝑇1𝑁𝑇superscriptsubscript𝑡1𝑁subscript𝒙𝑡\displaystyle\mbox{NMAE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}|}{\frac{1}{N-T}\sum_{t=1}^{N}|{\bm{x}}_{t}|},NMAE = divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT | end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG ,
MASE=1NTt=T+1N|𝒙t𝒙~tT|1NTt=T+1N|𝒙t𝒙tT|,MASE1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscript𝒙𝑡subscript~𝒙𝑡𝑇1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscript𝒙𝑡subscript𝒙𝑡𝑇\displaystyle\mbox{MASE}=\frac{\frac{1}{N-T}\sum_{t=T+1}^{N}|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}|}{\frac{1}{N-T}\sum_{t=T+1}^{N}|{\bm{x}}_{t}-{\bm{x}}_{% t-T}|},MASE = divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT | end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT | end_ARG ,
sMAPE=1NTt=1N|𝒙t𝒙~tT|(|𝒙t|+|𝒙~tT|)/2.sMAPE1𝑁𝑇superscriptsubscript𝑡1𝑁subscript𝒙𝑡subscript~𝒙𝑡𝑇subscript𝒙𝑡subscript~𝒙𝑡𝑇2\displaystyle\mbox{sMAPE}=\frac{1}{N-T}\sum_{t=1}^{N}\frac{|{\bm{x}}_{t}-% \tilde{{\bm{x}}}_{t-T}|}{(|{\bm{x}}_{t}|+|\tilde{{\bm{x}}}_{t-T}|)/2}.sMAPE = divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT | end_ARG start_ARG ( | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | + | over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT | ) / 2 end_ARG .

The purpose of adopting multiple metrics is to comprehensively evaluate the forecasting performance. NMSE and NMAE evaluate the overall performance, and MASE reflects the relative performance to the naive forecaster. Methods with MASE smaller than 1111 outperform the naive forecaster. sMAPE is the symmetric counterpart of mean absolute percentage error (MAPE) that can be both upper bounded and lower bounded. Since for electricity datasets, the actual values can be very close to 00, thus nullifies the effectiveness of MAPE, we regard sMAPE as the better metric.

For probabilistic methods, we further evaluates their CRPS. CRPS can be computed from

CRPS=1NTt=T+1N((F~(𝒙|𝒙1:tT)𝕀𝒙t𝒙)2𝑑𝒙),CRPS1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscriptsuperscript~𝐹conditional𝒙subscript𝒙:1𝑡𝑇subscript𝕀subscript𝒙𝑡𝒙2differential-d𝒙\displaystyle\mbox{CRPS}=\frac{1}{N-T}\sum_{t=T+1}^{N}\left(\int_{\mathbb{R}}% \left(\tilde{F}({\bm{x}}|{\bm{x}}_{1:t-T})-\mathbb{I}_{{\bm{x}}_{t}\leq{\bm{x}% }}\right)^{2}d{\bm{x}}\right),CRPS = divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ∫ start_POSTSUBSCRIPT blackboard_R end_POSTSUBSCRIPT ( over~ start_ARG italic_F end_ARG ( bold_italic_x | bold_italic_x start_POSTSUBSCRIPT 1 : italic_t - italic_T end_POSTSUBSCRIPT ) - blackboard_I start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ bold_italic_x end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d bold_italic_x ) ,

where 𝕀𝕀\mathbb{I}blackboard_I is the indicator function and F~(𝒙|𝒙0:tT)~𝐹conditional𝒙subscript𝒙:0𝑡𝑇\tilde{F}({\bm{x}}|{\bm{x}}_{0:t-T})over~ start_ARG italic_F end_ARG ( bold_italic_x | bold_italic_x start_POSTSUBSCRIPT 0 : italic_t - italic_T end_POSTSUBSCRIPT ) the empirical cumulative density function (c.d.f.) of 𝑿~tTsubscript~𝑿𝑡𝑇\tilde{{\bm{X}}}_{t-T}over~ start_ARG bold_italic_X end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT conditioned on 𝑿0:tT=𝒙0:tTsubscript𝑿:0𝑡𝑇subscript𝒙:0𝑡𝑇{\bm{X}}_{0:t-T}={\bm{x}}_{0:t-T}bold_italic_X start_POSTSUBSCRIPT 0 : italic_t - italic_T end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 : italic_t - italic_T end_POSTSUBSCRIPT predicted by probabilistic forecasting methods. CRPS is equivalent to comparing the empirical conditional c.d.f. forecasted by probabilistic methods with the indicator c.d.f. 𝕀𝒙~tT>𝒙tsubscript𝕀subscript~𝒙𝑡𝑇subscript𝒙𝑡\mathbb{I}_{\tilde{{\bm{x}}}_{t-T}>{\bm{x}}_{t}}blackboard_I start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - italic_T end_POSTSUBSCRIPT > bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT of the true value 𝒙tsubscript𝒙𝑡{\bm{x}}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. It can be viewed as a generalization of MAE to probabilistic methods.

The coverage probability (CP) of an confidence interval predictor is the (estimated) probability that the ground truth falls within the predicted interval. For a T𝑇Titalic_T-step prediction of β%percent𝛽\beta\%italic_β %-intervals, we denote the upper and lower bound by U^t|tT,βsubscript^𝑈conditional𝑡𝑡𝑇𝛽\hat{U}_{t|t-T,\beta}over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT and L^t|tT,βsubscript^𝐿conditional𝑡𝑡𝑇𝛽\hat{L}_{t|t-T,\beta}over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT. CP can be computed through

CP(β%)=1NTt=T+1N𝕀𝒙t[L^t|tT,β,U^t|tT,β].CPpercent𝛽1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscript𝕀subscript𝒙𝑡subscript^𝐿conditional𝑡𝑡𝑇𝛽subscript^𝑈conditional𝑡𝑡𝑇𝛽\displaystyle\mbox{CP}(\beta\%)=\frac{1}{N-T}\sum_{t=T+1}^{N}\mathbb{I}_{{\bm{% x}}_{t}\in[\hat{L}_{t|t-T,\beta},\hat{U}_{t|t-T,\beta}]}.CP ( italic_β % ) = divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT blackboard_I start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT , over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT .

The closer the CP to its nominal value β%percent𝛽\beta\%italic_β %, the more accurate the prediction is. Thus, the coverage probability error (CPE) is often adopted for evaluation. CPE measures the deviation of CP from its nominal value β%percent𝛽\beta\%italic_β %

CPE(β%)=CP(β%)β%.CPEpercent𝛽CPpercent𝛽percent𝛽\mbox{CPE}(\beta\%)=\mbox{CP}(\beta\%)-\beta\%.CPE ( italic_β % ) = CP ( italic_β % ) - italic_β % .

The value of CPE closer to zero means the prediction interval estimation is more accurate.

Although CP and CPE are widely adopted for its simplicity, since they only estimate the unconditional coverage, they do not measure the accuracy the coverage based on the forecasted conditional probability distribution. Its limitation was discuss in [52].

In particular, while a good forecaster produces small CPE and a forecaster with high CPE must be a poor forecaster, a forecaster producing small CPE may not be a good forecaster. To this end, the normalized coverage width (NCW) can be used as a secondary measure. NCW is defined as

NCW(β%)=1NTt=T+1NU^t|tT,βL^t|tT,βU^βL^β,NCWpercent𝛽1𝑁𝑇superscriptsubscript𝑡𝑇1𝑁subscript^𝑈conditional𝑡𝑡𝑇𝛽subscript^𝐿conditional𝑡𝑡𝑇𝛽subscript^𝑈𝛽subscript^𝐿𝛽\displaystyle\mbox{NCW}(\beta\%)=\frac{1}{N-T}\sum_{t=T+1}^{N}\frac{\hat{U}_{t% |t-T,\beta}-\hat{L}_{t|t-T,\beta}}{\hat{U}_{\beta}-\hat{L}_{\beta}},NCW ( italic_β % ) = divide start_ARG 1 end_ARG start_ARG italic_N - italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = italic_T + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT divide start_ARG over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT - over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_t | italic_t - italic_T , italic_β end_POSTSUBSCRIPT end_ARG start_ARG over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT - over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT end_ARG ,

where U^βsubscript^𝑈𝛽\hat{U}_{\beta}over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT and L^βsubscript^𝐿𝛽\hat{L}_{\beta}over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT are the prediction interval estimated from the empirical quantile of the testing data. For instance, when predicting a 90%percent9090\%90 % interval, U^90subscript^𝑈90\hat{U}_{90}over^ start_ARG italic_U end_ARG start_POSTSUBSCRIPT 90 end_POSTSUBSCRIPT is the empirical 0.950.950.950.95-quantile of the testing set, whereas L^90subscript^𝐿90\hat{L}_{90}over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT 90 end_POSTSUBSCRIPT is the empirical 0.050.050.050.05-quantile of the testing set. As a result, NCW is the average width of intervals predicted normalized by the width of the interval estimated through the empirical marginal distribution of the testing set. One would expect that, conditional on observations, one would get a more concentrated prediction interval than the interval estimated based on unconditional distribution. Hence, a method with NCW smaller than 1111 estimates prediction interval more accurately than the unconditional estimation. At similar level of CP, the method with smaller NCW shows better accuracy in prediction interval estimation.

References

  • [1] A. Green, The great acceleration: Cio perspectives on generative ai, Tech. rep., MIT Technology Review Insights (2023).
    URL https://www.databricks.com/sites/default/files/2023-07/ebook_mit-cio-generative-ai-report.pdf
  • [2] W. Härdle, H. Lütkepohl, R. Chen, A review of nonparametric time series analysis, International Statistical Review / Revue Internationale de Statistique 65 (1) (1997) 49–72, publisher: [Wiley, International Statistical Institute (ISI)]. doi:10.2307/1403432.
  • [3] J. Nowotarski, R. Weron, Computing electricity spot price prediction intervals using quantile regression and forecast averaging, Computational Statistics 30 (3) (2015) 791–803. doi:10.1007/s00180-014-0523-0.
  • [4] R. Weron, A. Misiorek, Forecasting spot electricity prices: A comparison of parametric and semiparametric time series models, International Journal of Forecasting 24 (4) (2008) 744–763. doi:https://doi.org/10.1016/j.ijforecast.2008.08.004.
  • [5] T. Hong, P. Pinson, Y. Wang, R. Weron, D. Yang, H. Zareipour, Energy forecasting: A review and outlook, IEEE Open Access Journal of Power and Energy 7 (2020) 376–388. doi:10.1109/OAJPE.2020.3029979.
  • [6] M. Zhou, Z. Yan, Y. X. Ni, G. Li, Y. Nie, Electricity price forecasting with confidence-interval estimation through an extended arima approach, IEE Proc.-Gener.Transmiss.Distrib 153 (2) (2006) 187–195.
  • [7] J. P. González, A. M. S. Muñoz San Roque, E. A. Pérez, Forecasting functional time series with a new hilbertian armax model: Application to electricity price forecasting, IEEE Transactions on Power Systems 33 (1) (2018) 545–556. doi:10.1109/TPWRS.2017.2700287.
  • [8] B. Uniejewski, R. Weron, Regularized quantile regression averaging for probabilistic electricity price forecasting, Energy Economics 95 (2021) 105121. doi:https://doi.org/10.1016/j.eneco.2021.105121.
  • [9] L. M. Lima, P. Damien, D. W. Bunn, Bayesian predictive distributions for imbalance prices with time-varying factor impacts, IEEE Transactions on Power Systems 38 (1) (2023) 349–357. doi:10.1109/TPWRS.2022.3165149.
  • [10] S. Chai, Z. Xu, Y. Jia, Conditional density forecast of electricity price based on ensemble elm and logistic emos, IEEE Transactions on Smart Grid 10 (3) (2019) 3031–3043. doi:10.1109/TSG.2018.2817284.
  • [11] D. Salinas, M. Bohlke-Schneider, L. Callot, R. Medico, J. Gasthaus, High-dimensional multivariate forecasting with low-rank gaussian copula processes, in: H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, R. Garnett (Eds.), Advances in Neural Information Processing Systems, Vol. 32, Curran Associates, Inc., 2019.
  • [12] G. Dudek, Multilayer perceptron for GEFCom2014 probabilistic electricity price forecasting, International Journal of Forecasting 32 (3) (2016) 1057–1060. doi:10.1016/j.ijforecast.2015.11.009.
  • [13] D. Lee, H. Shin, R. Baldick, Bivariate probabilistic wind power and real-time price forecasting and their applications to wind power bidding strategy development, IEEE Transactions on Power Systems 33 (6) (2018) 6087–6097. doi:10.1109/TPWRS.2018.2830785.
  • [14] J. Nowotarski, R. Weron, Recent advances in electricity price forecasting: A review of probabilistic forecasting, Renewable and Sustainable Energy Reviews 81 (2018) 1548–1568. doi:10.1016/j.rser.2017.05.234.
  • [15] D. J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, Fifth Edition, Chapman and Hall/CRC., 2011.
  • [16] B. Uniejewski, R. Weron, Regularized quantile regression averaging for probabilistic electricity price forecasting, Energy Economics 95 (2021) 105121. doi:10.1016/j.eneco.2021.105121.
    URL https://www.sciencedirect.com/science/article/pii/S0140988321000268
  • [17] C. Zhang, Y. Fu, Probabilistic electricity price forecast with optimal prediction interval, IEEE Transactions on Power Systems (2023) 1–10doi:10.1109/TPWRS.2023.3235193.
  • [18] J.-F. Toubeau, T. Morstyn, J. Bottieau, K. Zheng, D. Apostolopoulou, Z. De Grève, Y. Wang, F. Vallée, Capturing spatio-temporal dependencies in the probabilistic forecasting of distribution locational marginal prices, IEEE Transactions on Smart Grid 12 (3) (2021) 2663–2674. doi:10.1109/TSG.2020.3047863.
  • [19] J. Bottieau, Y. Wang, Z. De Grève, F. Vallée, J.-F. Toubeau, Interpretable transformer model for capturing regime switching effects of real-time electricity prices, IEEE Transactions on Power Systems 38 (3) (2023) 2162–2176. doi:10.1109/TPWRS.2022.3195970.
  • [20] V. N. Ganesh, D. Bunn, Forecasting imbalance price densities with statistical methods and neural networks, IEEE Transactions on Energy Markets, Policy and Regulation 2 (1) (2024) 30–39. doi:10.1109/TEMPR.2023.3293693.
  • [21] N. Nguyen, B. Quanz, Temporal Latent Auto-Encoder: A Method for Probabilistic Multivariate Time Series Forecasting, Proceedings of the AAAI Conference on Artificial Intelligence 35 (10) (2021) 9117–9125, number: 10. doi:10.1609/aaai.v35i10.17101.
  • [22] Z. Zheng, L. Wang, L. Yang, Z. Zhang, Generative Probabilistic Wind Speed Forecasting: A Variational Recurrent Autoencoder Based Method, IEEE Transactions on Power Systems 37 (2) (2022) 1386–1398, conference Name: IEEE Transactions on Power Systems. doi:10.1109/TPWRS.2021.3105101.
  • [23] L. Li, J. Zhang, J. Yan, Y. **, Y. Zhang, Y. Duan, G. Tian, Synergetic Learning of Heterogeneous Temporal Sequences for Multi-Horizon Probabilistic Forecasting, Proceedings of the AAAI Conference on Artificial Intelligence 35 (10) (2021) 8420–8428, number: 10. doi:10.1609/aaai.v35i10.17023.
  • [24] M. Khodayar, S. Mohammadi, M. E. Khodayar, J. Wang, G. Liu, Convolutional Graph Autoencoder: A Generative Deep Neural Network for Probabilistic Spatio-Temporal Solar Irradiance Forecasting, IEEE Transactions on Sustainable Energy 11 (2) (2020) 571–583, conference Name: IEEE Transactions on Sustainable Energy. doi:10.1109/TSTE.2019.2897688.
  • [25] K. Rasul, A.-S. Sheikh, I. Schuster, U. M. Bergmann, R. Vollgraf, Multivariate Probabilistic Time Series Forecasting via Conditioned Normalizing Flows, 2022.
  • [26] Y. Li, X. Lu, Y. Wang, D. Dou, Generative time series forecasting with diffusion, denoise, and disentanglement, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural Information Processing Systems, Vol. 35, Curran Associates, Inc., 2022, pp. 23009–23022.
  • [27] A. Koochali, P. Schichtel, A. Dengel, S. Ahmed, Probabilistic Forecasting of Sensory Data With Generative Adversarial Networks – ForGAN, IEEE Access 7 (2019) 63868–63880, conference Name: IEEE Access. doi:10.1109/ACCESS.2019.2915544.
  • [28] K. Yeo, Z. Li, W. Gifford, Generative Adversarial Network for Probabilistic Forecast of Random Dynamical Systems, SIAM Journal on Scientific Computing 44 (4) (2022) A2150–A2175, publisher: Society for Industrial and Applied Mathematics. doi:10.1137/21M1457448.
  • [29] Z. Zhang, M. Wu, Predicting real-time locational marginal prices: A gan-based approach, IEEE Transactions on Power Systems 37 (2) (2022) 1286–1296. doi:10.1109/TPWRS.2021.3106263.
  • [30] Y. Li, Y. Ding, Y. Liu, T. Yang, P. Wang, J. Wang, W. Yao, Dense skip attention based deep learning for day-ahead electricity price forecasting, IEEE Transactions on Power Systems 38 (5) (2023) 4308–4327. doi:10.1109/TPWRS.2022.3217579.
  • [31] H. Xu, F. Hu, X. Liang, M. A. Gunmi, Attention mechanism multi-size depthwise convolutional long short-term memory neural network for forecasting real-time electricity prices, IEEE Transactions on Power Systems (2024) 1–12doi:10.1109/TPWRS.2024.3353759.
  • [32] S. Majumder, L. Dong, F. Doudi, Y. Cai, C. Tian, D. Kalathi, K. Ding, A. A. Thatte, N. Li, L. Xie, Exploring the capabilities and limitations of large language models in the electric energy sector, arXiv:2403.09125 (2024). arXiv:2403.09125.
  • [33] H. Zhou, S. Zhang, J. Peng, S. Zhang, J. Li, H. Xiong, W. Zhang, Informer: Beyond efficient transformer for long sequence time-series forecasting, Proceedings of the AAAI Conference on Artificial Intelligence 35 (12) (2021) 11106–11115. doi:10.1609/aaai.v35i12.17325.
  • [34] S. Liu, H. Yu, C. Liao, J. Li, W. Lin, A. X. Liu, S. Dustdar, Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting, 2022.
  • [35] N. Wiener, Nonlinear Problems in Random Theory, Technology Press of Massachusetts Institute of Technology, Cambridge, MA, 1958.
  • [36] M. Rosenblatt, Stationary Processes as Shifts of Functions of Independent Random Variables, Journal of Mathematics and Mechanics 8 (5) (1959) 665–681.
  • [37] X. Wang, L. Tong, Q. Zhao, Generative probabilistic time series forecasting and applications in grid operations, to appear in the Proceedings of Conference on Information Sciences and Systems (2024).
    URL https://arxiv.longhoe.net/abs/2402.13870
  • [38] X. Wang, L. Tong, Innovations Autoencoder and its Application in One-class Anomalous Sequence Detection, Journal of Machine Learning Research 23 (49) (2022) 1–27.
  • [39] M. Arjovsky, S. Chintala, L.Bottou, Wasserstein GAN, arXiv:1701.07875 (Jan. 2017).
  • [40] P. J. Bickel, K. A. Doksum, Mathematical statistics: basic ideas and selected topics. (2nd ed.), Vol. 1, Pearson Prentice Hall, Upper Saddle River, N.J., 2007.
  • [41] M. White, R. Pike, C. Brown, R. Coutu, B. Ewing, S. Johnson, C. Mendrala, White paper: Inter-regional interchange scheduling analysis and options, Tech. rep., ISO New England and New York ISO (January 2011).
  • [42] T. Gneiting, M. Katzfuss, Probabilistic forecasting, Annual Review of Statistics and Its Application 1 (1) (2014) 125–151. arXiv:https://doi.org/10.1146/annurev-statistics-062713-085831, doi:10.1146/annurev-statistics-062713-085831.
  • [43] E. Tómasson, M. R. Hesamzadeh, F. A. Wolak, Optimal offer-bid strategy of an energy storage portfolio: A linear quasi-relaxation approach, Applied Energy 260 (2020) 114251. doi:https://doi.org/10.1016/j.apenergy.2019.114251.
  • [44] NERC, Balancing and frequency control, Tech. rep., NERC Resource Subcommittee, Priceton,NJ (January 2011).
    URL https://www.nerc.com/comm/OC/BAL0031_Supporting_Documents_2017_DL/NERC%20Balancing%20and%20Frequency%20Control%20040520111.pdf
  • [45] M. A. Montemurro, P. A. Pury, Long-range fractal correlations in literary corpora, Fractals 10 (4) (2002) 451–461.
  • [46] J. Bhan, S. Kim, J. Kim, Y. Kwon, S. il Yang, K. Lee, Long-range correlations in korean literary corpora, Chaos, Solitons & Fractals 29 (1) (2006) 69–81. doi:https://doi.org/10.1016/j.chaos.2005.08.214.
  • [47] X. Wang, L. Tong, Innovations autoencoder and its application in one-class anomalous sequence detection, J. Mach. Learn. Res. 23 (1) (Jan 2022).
  • [48] K. R. Mestav, X. Wang, L. Tong, A deep learning approach to anomaly sequence detection for high-resolution monitoring of power systems, IEEE Transactions on Power Systems 38 (1) (2023) 4–13. doi:10.1109/TPWRS.2022.3168529.
  • [49] L. Tong, X. Wang, Q. Zhao, Grid monitoring and protection with continuous point-on-wave measurements and generative ai, arXiv:2403.06942 (2024). arXiv:2403.06942.
    URL https://arxiv.longhoe.net/abs/2403.06942
  • [50] J. Hoffmann-Jørgensen, Stochastic Processes on Polish Spaces, Aarhus Universitet, Matematisk Institut., Aarhus, Denmark, 1991.
  • [51] C. Villani, The Wasserstein distances, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 57–75. doi:10.1007/978-3-540-71050-9_6.
  • [52] P. F. Christoffersen, Evaluating Interval Forecasts, International Economic Review 39 (4) (1998) 841–862. doi:10.2307/2527341.