Stock Volume Forecasting with Advanced Information by Conditional Variational Auto-Encoder

Parley R Yang [email protected] Faculty of Mathematics, University of CambridgeCambridgeUK  and  Alexander Y Shestopaloff School of Mathematical Sciences, Queen Mary University of LondonLondonUK
Abstract.

We demonstrate the use of Conditional Variational Encoder (CVAE) to improve the forecasts of daily stock volume time series in both short and long term forecasting tasks, with the use of advanced information of input variables such as rebalancing dates. CVAE generates non-linear time series as out-of-sample forecasts, which have better accuracy and closer fit of correlation to the actual data, compared to traditional linear models. These generative forecasts can also be used for scenario generation, which aids interpretation. We further discuss correlations in non-stationary time series and other potential extensions from the CVAE forecasts.

copyright: none

1. Introduction

1.1. Motivation

Let Yt+k|tsubscript𝑌𝑡conditional𝑘𝑡Y_{t+k|t}italic_Y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT be the forecast of variable Y𝑌Yitalic_Y at time t+k𝑡𝑘t+kitalic_t + italic_k, given information available up to time t𝑡titalic_t. For instance, in the case of daily stock data where Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates the end of day stock price of day t𝑡titalic_t, Yt+1|tsubscript𝑌𝑡conditional1𝑡Y_{t+1|t}italic_Y start_POSTSUBSCRIPT italic_t + 1 | italic_t end_POSTSUBSCRIPT would be the forecasted stock price on day t+1𝑡1t+1italic_t + 1 based on information up to time t𝑡titalic_t.

Advanced information concerns with the notion of information available up to time t𝑡titalic_t. In linear time series, such information is mainly the past lags of the data (autoregressive terms) and error terms, together with other features observed up to time t𝑡titalic_t. Advanced information acknowledges the future state of some variables at time t𝑡titalic_t. For instance, there is an explicit rule of when Stoxx index rebalancing would occur (STOXX, 2024) and this information may concern the future state t+k𝑡𝑘t+kitalic_t + italic_k, but is known at time t𝑡titalic_t. If we write RBt𝑅subscript𝐵𝑡RB_{t}italic_R italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the indicator of whether day t𝑡titalic_t is a rebalancing day for Stoxx index, then RBt+k𝑅subscript𝐵𝑡𝑘RB_{t+k}italic_R italic_B start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT is known at time t𝑡titalic_t for all k𝑘kitalic_k. It can therefore be helpful to utilise such advanced information at time t𝑡titalic_t to construct a forecast of Y𝑌Yitalic_Y at time t+k𝑡𝑘t+kitalic_t + italic_k, especially when k𝑘kitalic_k is large. However, how to incorporate such an advanced information to the time series model becomes a crucial question, in light of the potential non-linear effect this information has on Ytsubscript𝑌𝑡Y_{t}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In this paper, we use Conditional Variational Auto-Encoder (CVAE) to account for this in a non-linear modelling and forecasting setting.

The notion of long term forecasting concerns k𝑘kitalic_k being large, for instance, the stock price two weeks (ten trading days) later, Yt+10|tsubscript𝑌𝑡conditional10𝑡Y_{t+10|t}italic_Y start_POSTSUBSCRIPT italic_t + 10 | italic_t end_POSTSUBSCRIPT. In the analysis of linear stationary time series (e.g. ARMA and VARMA), the long term forecasts converge to the unconditional expectation due to the assumption of stationarity111Technically, the conditional forecast Yt+k|t=𝔼[Yt+k|t]𝔼[Y]subscript𝑌𝑡conditional𝑘𝑡𝔼delimited-[]conditionalsubscript𝑌𝑡𝑘subscript𝑡𝔼delimited-[]𝑌Y_{t+k|t}=\mathbb{E}[Y_{t+k}|{\mathscr{F}}_{t}]\to\mathbb{E}[Y]italic_Y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT = blackboard_E [ italic_Y start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT | script_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] → blackboard_E [ italic_Y ] as k𝑘k\to\inftyitalic_k → ∞ where tsubscript𝑡{\mathscr{F}}_{t}script_F start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the set of information up to time t𝑡titalic_t, and 𝔼[Yt]=𝔼[Y]t𝔼delimited-[]subscript𝑌𝑡𝔼delimited-[]𝑌for-all𝑡\mathbb{E}[Y_{t}]=\mathbb{E}[Y]\forall tblackboard_E [ italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = blackboard_E [ italic_Y ] ∀ italic_t due to the stationarity assumption.. However, in the case of non-stationary time series, such convergence is not guaranteed, and can be of interests.

Empirically, in the case of stock price, it may be convenient to put Yt+10|tsubscript𝑌𝑡conditional10𝑡Y_{t+10|t}italic_Y start_POSTSUBSCRIPT italic_t + 10 | italic_t end_POSTSUBSCRIPT or even Yt+100|tsubscript𝑌𝑡conditional100𝑡Y_{t+100|t}italic_Y start_POSTSUBSCRIPT italic_t + 100 | italic_t end_POSTSUBSCRIPT to some trend-stationary expectation or equivalent, as the difference of stock prices (either YtYt1subscript𝑌𝑡subscript𝑌𝑡1Y_{t}-Y_{t-1}italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT or log(Yt)log(Yt1)subscript𝑌𝑡subscript𝑌𝑡1\log(Y_{t})-\log(Y_{t-1})roman_log ( italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - roman_log ( italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )) are often modelled as a stationary time series. In the case of stock volume, however, it is not as convenient to assert such stationary expectation, as stock volumes are often affected by extraordinary shocks, e.g. company announcements and index rebalancing. It is therefore a harder task, both in modelling and forecasting, to provide analysis for stock volume time series.

Furthermore, forecasting stock volume in a non-linear fashion helps the pricing of various financial derivatives, such as stock buyback contracts, which heavily depend on market volumes for periods covered by such contracts.

1.2. Literature Review

There are some recent literature on time series with the use of neural network, such as Time Series generative adversarial networks (Yoon et al., 2019) and Generative Time Series with bi-directional Variational Auto-Encoder (VAE) (Li et al., 2022). We take a similar approach in designing the latent space and building Gaussian assumptions conditional on latent space. However, we take into account the advanced information and utilise such information to improve the quality of our forecast, with the incorporation of latent variables. We also use CVAE to enable scenario generation and interpretation. To this end, we engage with technical papers on CVAE (Doersch, 2016; Sohn et al., 2015) for modelling and derivations under a time series setting.

In terms of the literature of advanced information, there are no direct reference of this terminology in time series, however, we find similar concepts in Bayesian Time Series (Berliner, 1996; Tsay and Chen, 2018), where one updates the state equations based on some signals. Such signal may be learnt or stated in advanced for the model — advanced information may be considered as a set of variables where we know the existence of its value in advance, and need not be learnt for the duration of present and future forecasting period.

The empirical need for stock volume forecasting can be motivated in the general financial machine learning literature (De Prado, 2018), though most of the recent literature focus on the microstructure statistics, such as in the case of limit order book (Cont et al., 2013). The role of stock volume time series in buyback contract pricing has been mentioned in recent literature (Guéant et al., 2020; Hamdouche et al., 2022), which introduce practical forecasting problems that we aim to empirically contribute to.

1.3. Contributions

Our overall contributions can be summarised into threefold. Firstly, we identify the class of problem of forecasting with advanced information, which can be considered as an expectation computation based on a richer information set — this is practically modelled with non-linear interactions in a CVAE architecture of neural networks, which additionally enables generative forecasting. This is elaborated in section 2.

Secondly, we demonstrate the capability for event-driven interpretations and alternative scenario generation. This utilises the generator aspect of CVAE to answer questions such as what happens on the special occasion and what are the alternative scenarios. By generating the forecast paths under different conditional values, we are able to answer these questions and henceforth provide interpretation to the model. These are provided in addition to the traditional model evaluation metrics such as Mean Squared Error (MSE) and correlation matrix, which are detailed in section 3.

Lastly, we contribute to the empirical literature on daily stock volume forecasting, which is studied across the incumbents of EURO STOXX 50 index, which is a cluster of 50 high market-capitalisation stocks listed in the Eurozone. This contributes to the demand of long term forecasting in empirical finance, such as for buyback contracts.

1.4. Notations

Most of the notations are explained when first introduced. Common annotations are as follows. N(,)𝑁N(\cdot,\cdot)italic_N ( ⋅ , ⋅ ) indicates the Gaussian Distribution, Iqsubscript𝐼𝑞I_{q}italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT indicates a q-dimensional identity matrix, upper case letters usually denote objects being a random variable, whereas lower case letters usually denote data observed at specific values. 𝔼[]𝔼delimited-[]\mathbb{E}[\cdot]blackboard_E [ ⋅ ] denotes the expectation operator, V()𝑉V(\cdot)italic_V ( ⋅ ) denotes the variance operator, with Cov(,)𝐶𝑜𝑣Cov(\cdot,\cdot)italic_C italic_o italic_v ( ⋅ , ⋅ ) as the covariance operator, and Corr(,)𝐶𝑜𝑟𝑟Corr(\cdot,\cdot)italic_C italic_o italic_r italic_r ( ⋅ , ⋅ ) as the correlation operator. Conditional expectation 𝔼XP[Q(X)]subscript𝔼similar-to𝑋𝑃delimited-[]𝑄𝑋\mathbb{E}_{X\sim P}[Q(X)]blackboard_E start_POSTSUBSCRIPT italic_X ∼ italic_P end_POSTSUBSCRIPT [ italic_Q ( italic_X ) ] refers to the expectation of Q(X)𝑄𝑋Q(X)italic_Q ( italic_X ) conditional on XPsimilar-to𝑋𝑃X\sim Pitalic_X ∼ italic_P.

2. Methodology: From Non-Linear Modelling To Generative Algorithm For Forecasting

2.1. Time series in a CVAE context

We first give an overview of the modelling assumptions of CVAE, and put time series in such a context.

Let X,Y𝑋𝑌X,Yitalic_X , italic_Y be random variables. Yd𝑌superscript𝑑Y\in\mathbb{R}^{d}italic_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and Xp𝑋superscript𝑝X\in\mathbb{R}^{p}italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT. Let Zq𝑍superscript𝑞Z\in\mathbb{R}^{q}italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT be a latent variable with distribution ZN(0,Iq)similar-to𝑍𝑁0subscript𝐼𝑞Z\sim N(0,I_{q})italic_Z ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ). Let D𝐷Ditalic_D be the dataset consisting paired observations of (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ). The key assumptions are: conditional distribution of the output given input and latent variable as Gaussian with non-linear and unknown mean, written as

(1) Y|X,ZN(f(X,Z),σ2Id)similar-toconditional𝑌𝑋𝑍𝑁𝑓𝑋𝑍superscript𝜎2subscript𝐼𝑑Y|X,Z\sim N(f(X,Z),\sigma^{2}I_{d})italic_Y | italic_X , italic_Z ∼ italic_N ( italic_f ( italic_X , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT )

for some unknown function f:p×qd:𝑓superscript𝑝superscript𝑞superscript𝑑f:\mathbb{R}^{p}\times\mathbb{R}^{q}\to\mathbb{R}^{d}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and conditional distribution of latent variable given the observed data (X,Y𝑋𝑌X,Yitalic_X , italic_Y) is Gaussian, written as

(2) Z|X,YN(μ(X,Y),Σ(X,Y))similar-toconditional𝑍𝑋𝑌𝑁𝜇𝑋𝑌Σ𝑋𝑌Z|X,Y\sim N(\mu(X,Y),\Sigma(X,Y))italic_Z | italic_X , italic_Y ∼ italic_N ( italic_μ ( italic_X , italic_Y ) , roman_Σ ( italic_X , italic_Y ) )

A CVAE is a tuple (f^en,f^de)superscript^𝑓𝑒𝑛superscript^𝑓𝑑𝑒(\hat{f}^{en},\hat{f}^{de})( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_e italic_n end_POSTSUPERSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ) where f^de:p×qd:superscript^𝑓𝑑𝑒superscript𝑝superscript𝑞superscript𝑑\hat{f}^{de}:\mathbb{R}^{p}\times\mathbb{R}^{q}\to\mathbb{R}^{d}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a decoder which induces a probability distribution P^(Y|X,Z)^𝑃conditional𝑌𝑋𝑍\hat{P}(Y|X,Z)over^ start_ARG italic_P end_ARG ( italic_Y | italic_X , italic_Z ) according to Equation 1 and an encoder is a function f^en:p×dq×(0,)q:superscript^𝑓𝑒𝑛superscript𝑝superscript𝑑superscript𝑞superscript0𝑞\hat{f}^{en}:\mathbb{R}^{p}\times\mathbb{R}^{d}\to\mathbb{R}^{q}\times(0,% \infty)^{q}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_e italic_n end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT × ( 0 , ∞ ) start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT which induces the moments of the Gaussian distribution Z|X,Yconditional𝑍𝑋𝑌Z|X,Yitalic_Z | italic_X , italic_Y according to Equation 2, with the assumption that Σ(X,Y)Σ𝑋𝑌\Sigma(X,Y)roman_Σ ( italic_X , italic_Y ) can be written as a diagonal matrix with positive entries on all diagonals.

Further to this, time series yt,xtsubscript𝑦𝑡subscript𝑥𝑡y_{t},x_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are seen as observation of random variables Y,X𝑌𝑋Y,Xitalic_Y , italic_X at time t𝑡titalic_t, and by deliberate modelling, we aim to train a CVAE (f^en,f^de)superscript^𝑓𝑒𝑛superscript^𝑓𝑑𝑒(\hat{f}^{en},\hat{f}^{de})( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_e italic_n end_POSTSUPERSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ) that enables the forecast of future states of yt+ksubscript𝑦𝑡𝑘y_{t+k}italic_y start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT conditional on xt+1subscript𝑥𝑡1x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. For example, let xt=yt1subscript𝑥𝑡subscript𝑦𝑡1x_{t}=y_{t-1}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, then yt+1|t=𝔼ZN(0,Iq),X=yt[N(f^de(X,Z),σ2Id)]=𝔼ZN(0,Iq)[N(f^de(yt,Z),σ2Id)]=𝔼ZN(0,Iq)[f^de(yt,Z)]subscript𝑦𝑡conditional1𝑡subscript𝔼formulae-sequencesimilar-to𝑍𝑁0subscript𝐼𝑞𝑋subscript𝑦𝑡delimited-[]𝑁superscript^𝑓𝑑𝑒𝑋𝑍superscript𝜎2subscript𝐼𝑑subscript𝔼similar-to𝑍𝑁0subscript𝐼𝑞delimited-[]𝑁superscript^𝑓𝑑𝑒subscript𝑦𝑡𝑍superscript𝜎2subscript𝐼𝑑subscript𝔼similar-to𝑍𝑁0subscript𝐼𝑞delimited-[]superscript^𝑓𝑑𝑒subscript𝑦𝑡𝑍y_{t+1|t}=\mathbb{E}_{Z\sim N(0,I_{q}),X=y_{t}}[N(\hat{f}^{de}(X,Z),\sigma^{2}% I_{d})]=\mathbb{E}_{Z\sim N(0,I_{q})}[N(\hat{f}^{de}(y_{t},Z),\sigma^{2}I_{d})% ]=\mathbb{E}_{Z\sim N(0,I_{q})}[\hat{f}^{de}(y_{t},Z)]italic_y start_POSTSUBSCRIPT italic_t + 1 | italic_t end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) , italic_X = italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_N ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_X , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] = blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ italic_N ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] = blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z ) ]. Iteratively for k2𝑘2k\geq 2italic_k ≥ 2, we have yt+k|t=𝔼ZN(0,Iq)[f^de(yt+k1|t,Z)]subscript𝑦𝑡conditional𝑘𝑡subscript𝔼similar-to𝑍𝑁0subscript𝐼𝑞delimited-[]superscript^𝑓𝑑𝑒subscript𝑦𝑡𝑘conditional1𝑡𝑍y_{t+k|t}=\mathbb{E}_{Z\sim N(0,I_{q})}[\hat{f}^{de}(y_{t+k-1|t},Z)]italic_y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_t + italic_k - 1 | italic_t end_POSTSUBSCRIPT , italic_Z ) ].

The training of a CVAE is done by using two Neural Network architectures F1,F2subscript𝐹1subscript𝐹2F_{1},F_{2}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to define the functions fenF1,fdeF2formulae-sequencesuperscript𝑓𝑒𝑛subscript𝐹1superscript𝑓𝑑𝑒subscript𝐹2{f}^{en}\in F_{1},{f}^{de}\in F_{2}italic_f start_POSTSUPERSCRIPT italic_e italic_n end_POSTSUPERSCRIPT ∈ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ∈ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, followed by gradient methods and other classical optimisation techniques to maximise 𝔼X,YD[log(P(Y|X))]subscript𝔼similar-to𝑋𝑌𝐷delimited-[]𝑃conditional𝑌𝑋\mathbb{E}_{X,Y\sim D}[\log(P(Y|X))]blackboard_E start_POSTSUBSCRIPT italic_X , italic_Y ∼ italic_D end_POSTSUBSCRIPT [ roman_log ( italic_P ( italic_Y | italic_X ) ) ] through conditional marginalisation over Z𝑍Zitalic_Z. Derivations and other technical remarks are written in appendix A.1.

2.2. Generative scheme from CVAE

For a given decoder f^desuperscript^𝑓𝑑𝑒\hat{f}^{de}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT and conditional variable xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the distribution 𝔼ZN(0,Iq),X=xt[N(f^de(X,Z),σ2Id)]subscript𝔼formulae-sequencesimilar-to𝑍𝑁0subscript𝐼𝑞𝑋subscript𝑥𝑡delimited-[]𝑁superscript^𝑓𝑑𝑒𝑋𝑍superscript𝜎2subscript𝐼𝑑\mathbb{E}_{Z\sim N(0,I_{q}),X=x_{t}}[N(\hat{f}^{de}(X,Z),\sigma^{2}I_{d})]blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) , italic_X = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_N ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_X , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] can be approximated by the generative scheme of S𝑆Sitalic_S samples

s[S],for-all𝑠delimited-[]𝑆\displaystyle\forall s\in[S],∀ italic_s ∈ [ italic_S ] , draw zsN(0,Iq),similar-todraw subscript𝑧𝑠𝑁0subscript𝐼𝑞\displaystyle\texttt{ draw }z_{s}\sim N(0,I_{q}),draw italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) ,
(3) then draw ytsN(f^de(xt,zs),σ2Id)similar-tothen draw superscriptsubscript𝑦𝑡𝑠𝑁superscript^𝑓𝑑𝑒subscript𝑥𝑡subscript𝑧𝑠superscript𝜎2subscript𝐼𝑑\displaystyle\texttt{ then draw }y_{t}^{s}\sim N(\hat{f}^{de}(x_{t},z_{s}),% \sigma^{2}I_{d})then draw italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ italic_N ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT )

Now, through Equation 3, we may compute the approximated expectation by taking the average across samples {yts}s[S]subscriptsuperscriptsubscript𝑦𝑡𝑠𝑠delimited-[]𝑆\{y_{t}^{s}\}_{s\in[S]}{ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_s ∈ [ italic_S ] end_POSTSUBSCRIPT.

Putting this scheme in a forecasting regime, we may generate forecasts {yt+1|ts}s[S]subscriptsuperscriptsubscript𝑦𝑡conditional1𝑡𝑠𝑠delimited-[]𝑆\{y_{t+1|t}^{s}\}_{s\in[S]}{ italic_y start_POSTSUBSCRIPT italic_t + 1 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_s ∈ [ italic_S ] end_POSTSUBSCRIPT and thereafter {yt+k|ts}s[S]subscriptsuperscriptsubscript𝑦𝑡conditional𝑘𝑡𝑠𝑠delimited-[]𝑆\{y_{t+k|t}^{s}\}_{s\in[S]}{ italic_y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_s ∈ [ italic_S ] end_POSTSUBSCRIPT for a generalised k𝑘kitalic_k. For a given horizon of forecast K𝐾Kitalic_K, we call a forecast path as the vector yt+|ts:=(yt+1|ts,yt+2|ts,,yt+K|ts)y^{s}_{t+\cdot|t}:=(y_{t+1|t}^{s},y_{t+2|t}^{s},...,y_{t+K|t}^{s})italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT := ( italic_y start_POSTSUBSCRIPT italic_t + 1 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_t + 2 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t + italic_K | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ), and the average forecast path y¯t+|t:=(y¯t+1|t,y¯t+2|t,,y¯t+K|t)\bar{y}_{t+\cdot|t}:=(\bar{y}_{t+1|t},\bar{y}_{t+2|t},...,\bar{y}_{t+K|t})over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT := ( over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + 1 | italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + 2 | italic_t end_POSTSUBSCRIPT , … , over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + italic_K | italic_t end_POSTSUBSCRIPT ) where each entry y¯t+k|tsubscript¯𝑦𝑡conditional𝑘𝑡\bar{y}_{t+k|t}over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT is the arithmetic average over the generated samples {yt+k|ts}s[S]subscriptsuperscriptsubscript𝑦𝑡conditional𝑘𝑡𝑠𝑠delimited-[]𝑆\{y_{t+k|t}^{s}\}_{s\in[S]}{ italic_y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_s ∈ [ italic_S ] end_POSTSUBSCRIPT.

2.3. A special forecasting scenario: advanced information

In this section, we formaly introduce, and give examples to, the notion of advanced information.

Given K1𝐾1K\geq 1italic_K ≥ 1, we are interested to forecast random variable Yt+k,k[K]subscript𝑌𝑡𝑘𝑘delimited-[]𝐾Y_{t+k},k\in[K]italic_Y start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT , italic_k ∈ [ italic_K ] given information availble up to time t𝑡titalic_t. Let Xt=(Xt0,Xt1)subscript𝑋𝑡superscriptsubscript𝑋𝑡0superscriptsubscript𝑋𝑡1X_{t}=(X_{t}^{0},X_{t}^{1})italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) where Xt+k+10superscriptsubscript𝑋𝑡𝑘10X_{t+k+1}^{0}italic_X start_POSTSUBSCRIPT italic_t + italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is always known at time t𝑡titalic_t, for k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] but only up to Xt+11superscriptsubscript𝑋𝑡11X_{t+1}^{1}italic_X start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is known at time t𝑡titalic_t, not Xt+21superscriptsubscript𝑋𝑡21X_{t+2}^{1}italic_X start_POSTSUBSCRIPT italic_t + 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT or anything further. We say X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is an advanced information and X1superscript𝑋1X^{1}italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is an ordinary information. Intuitively, X1superscript𝑋1X^{1}italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT is the part of information where we only know up to the time we are requested to forecast, as was commonly assumed in classical linear time series models, whereas X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is the nuanced part where we know some information ahead of time. Then, we consider filtration t:=t+K0×t1assignsubscriptsuperscript𝑡subscriptsuperscript0𝑡𝐾subscriptsuperscript1𝑡{\mathscr{F}}^{*}_{t}:={\mathscr{F}}^{0}_{t+K}\times{\mathscr{F}}^{1}_{t}script_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := script_F start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_K end_POSTSUBSCRIPT × script_F start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT where t+K0subscriptsuperscript0𝑡𝐾{\mathscr{F}}^{0}_{t+K}script_F start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_K end_POSTSUBSCRIPT is the filtration generated by Xt+K0superscriptsubscript𝑋𝑡𝐾0X_{t+K}^{0}italic_X start_POSTSUBSCRIPT italic_t + italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT and t1subscriptsuperscript1𝑡{\mathscr{F}}^{1}_{t}script_F start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the filtration generated by Xt+11superscriptsubscript𝑋𝑡11X_{t+1}^{1}italic_X start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT. Forecasting with advanced information concerns investigating the conditional distributions k(|t)\mathbb{Q}^{*k}(\cdot|{\mathscr{F}}^{*}_{t})blackboard_Q start_POSTSUPERSCRIPT ∗ italic_k end_POSTSUPERSCRIPT ( ⋅ | script_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) where k[K]𝑘delimited-[]𝐾k\in[K]italic_k ∈ [ italic_K ] and k(S|t)superscriptabsent𝑘conditional𝑆subscriptsuperscript𝑡\mathbb{Q}^{*k}(S|{\mathscr{F}}^{*}_{t})blackboard_Q start_POSTSUPERSCRIPT ∗ italic_k end_POSTSUPERSCRIPT ( italic_S | script_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) measures the probability of Yt+kSsubscript𝑌𝑡𝑘𝑆Y_{t+k}\in Sitalic_Y start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT ∈ italic_S given tsubscriptsuperscript𝑡{\mathscr{F}}^{*}_{t}script_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The expected forecast takes the form of Yt+k|t:=Yt+k𝑑k(Yt+k|t)assignsubscript𝑌𝑡conditional𝑘𝑡subscript𝑌𝑡𝑘differential-dsuperscriptabsent𝑘conditionalsubscript𝑌𝑡𝑘subscriptsuperscript𝑡Y_{t+k|t}:=\int Y_{t+k}d\mathbb{Q}^{*k}(Y_{t+k}|{\mathscr{F}}^{*}_{t})italic_Y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT := ∫ italic_Y start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT italic_d blackboard_Q start_POSTSUPERSCRIPT ∗ italic_k end_POSTSUPERSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT | script_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

To give some contextual remark, the inclusion of X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT can often be considered as properties of the time series, such as category of variables. Say Y𝑌Y\in\mathbb{R}italic_Y ∈ blackboard_R and some of the data belong to group 1 whereas the other belong to group 2, then X0{0,1}2superscript𝑋0superscript012X^{0}\in\{0,1\}^{2}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can be a one-hot indicator function which outputs (1,0)10(1,0)( 1 , 0 ) for group 1 and (0,1)01(0,1)( 0 , 1 ) for group 2. In this case, when forecasting, X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT is always known ahead of time (with effectively infinitely large K𝐾Kitalic_K). This means many panel data models fall into the scenario of advanced information, as the category of the response can be considered as a known property, henceforth advanced information. Likewise for seasonality, where such category could be considered as an information exploitable ahead of the forecasting time.

Another example is rebalancing, as was motivated in section 1.1, that rebalancing dates are known ahead of the desired forecasting time, so X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT may be an indicator of rebalancing date which outputs 1111 if the date is a rebalancing date, and 00 otherwise.

As a remark, X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT may also be a special case for cointegration, as a source of drift. This may fall into the wider literature of ’common trend removal’ in cointegration analysis.

2.4. Forecasting algorithms using advanced information and CVAE

Algorithm 1 Iterative Forecasting with Advanced Information (General)

Input: t𝑡titalic_t (time the forecast is requested), S𝑆Sitalic_S (number of samples), f^desuperscript^𝑓𝑑𝑒\hat{f}^{de}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT (a trained decoder), K𝐾Kitalic_K (desired forecasting horizon), {xτ0}τ=t+1t+Ksuperscriptsubscriptsuperscriptsubscript𝑥𝜏0𝜏𝑡1𝑡𝐾\{x_{\tau}^{0}\}_{\tau=t+1}^{t+K}{ italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_K end_POSTSUPERSCRIPT (advanced information), xt+11superscriptsubscript𝑥𝑡11x_{t+1}^{1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT (ordinary information).

Output: Simulated forecast paths {(k,yt+k|ts):k[K]}s[S]subscriptconditional-set𝑘superscriptsubscript𝑦𝑡conditional𝑘𝑡𝑠𝑘delimited-[]𝐾𝑠delimited-[]𝑆\{(k,y_{t+k|t}^{s}):k\in[K]\}_{s\in[S]}{ ( italic_k , italic_y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) : italic_k ∈ [ italic_K ] } start_POSTSUBSCRIPT italic_s ∈ [ italic_S ] end_POSTSUBSCRIPT

1:  Draw S samples from the distribution 𝔼ZN(0,Iq),X0=xt+10,X1=xt+11[N(f^de(X0,X1,Z),σ2Id)]subscript𝔼formulae-sequencesimilar-to𝑍𝑁0subscript𝐼𝑞formulae-sequencesuperscript𝑋0subscriptsuperscript𝑥0𝑡1superscript𝑋1subscriptsuperscript𝑥1𝑡1delimited-[]𝑁superscript^𝑓𝑑𝑒superscript𝑋0superscript𝑋1𝑍superscript𝜎2subscript𝐼𝑑\mathbb{E}_{Z\sim N(0,I_{q}),X^{0}=x^{0}_{t+1},X^{1}=x^{1}_{t+1}}[N(\hat{f}^{% de}(X^{0},X^{1},Z),\sigma^{2}I_{d})]blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_N ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] according to Equation 3. These samples are denoted as yt+1|tssuperscriptsubscript𝑦𝑡conditional1𝑡𝑠y_{t+1|t}^{s}italic_y start_POSTSUBSCRIPT italic_t + 1 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT for s[S]𝑠delimited-[]𝑆s\in[S]italic_s ∈ [ italic_S ].
2:  for τ{2,3,,K}𝜏23𝐾\tau\in\{2,3,...,K\}italic_τ ∈ { 2 , 3 , … , italic_K } do
3:     Update xt+τ|t1subscriptsuperscript𝑥1𝑡conditional𝜏𝑡x^{1}_{t+\tau|t}italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_τ | italic_t end_POSTSUBSCRIPT with {yt+τ1|ts:s[S]}conditional-setsuperscriptsubscript𝑦𝑡𝜏conditional1𝑡𝑠𝑠delimited-[]𝑆\{y_{t+\tau-1|t}^{s}:s\in[S]\}{ italic_y start_POSTSUBSCRIPT italic_t + italic_τ - 1 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT : italic_s ∈ [ italic_S ] }
4:     Draw S samples from the distribution 𝔼ZN(0,Iq),X0=xt+τ0,X1=xt+τ|t1[N(f^de(X0,X1,Z),σ2Id)]subscript𝔼formulae-sequencesimilar-to𝑍𝑁0subscript𝐼𝑞formulae-sequencesuperscript𝑋0subscriptsuperscript𝑥0𝑡𝜏superscript𝑋1subscriptsuperscript𝑥1𝑡conditional𝜏𝑡delimited-[]𝑁superscript^𝑓𝑑𝑒superscript𝑋0superscript𝑋1𝑍superscript𝜎2subscript𝐼𝑑\mathbb{E}_{Z\sim N(0,I_{q}),X^{0}=x^{0}_{t+\tau},X^{1}=x^{1}_{t+\tau|t}}[N(% \hat{f}^{de}(X^{0},X^{1},Z),\sigma^{2}I_{d})]blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_τ end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_τ | italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_N ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] according to Equation 3. These samples are denoted as yt+τ|tssuperscriptsubscript𝑦𝑡conditional𝜏𝑡𝑠y_{t+\tau|t}^{s}italic_y start_POSTSUBSCRIPT italic_t + italic_τ | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT for s[S]𝑠delimited-[]𝑆s\in[S]italic_s ∈ [ italic_S ].
5:  end for
6:  return  {(k,yt+k|ts):k[K]}s[S]subscriptconditional-set𝑘superscriptsubscript𝑦𝑡conditional𝑘𝑡𝑠𝑘delimited-[]𝐾𝑠delimited-[]𝑆\{(k,y_{t+k|t}^{s}):k\in[K]\}_{s\in[S]}{ ( italic_k , italic_y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) : italic_k ∈ [ italic_K ] } start_POSTSUBSCRIPT italic_s ∈ [ italic_S ] end_POSTSUBSCRIPT

In algorithm 1, we present the iterative forecasting algorithm with advanced information, under the setting where decoder fdesuperscript𝑓𝑑𝑒{f}^{de}italic_f start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT takes X0,X1,Zsuperscript𝑋0superscript𝑋1𝑍X^{0},X^{1},Zitalic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_Z as input. The algorithm uses Equation 3 for the CVAE generation and serves as a practical algorithm for forecasting with advanced information.

Another algorithm as a special case for AR(1)-type of forecasting (the case where Xt1=Yt1subscriptsuperscript𝑋1𝑡subscript𝑌𝑡1X^{1}_{t}=Y_{t-1}italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT) is presented in algorithm 2, where there are more specified approach in handling X1superscript𝑋1X^{1}italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, as would be done in classical linear time series models. This is also the exact algorithm used in the empirical applications section next.

Algorithm 2 Iterative Forecasting with Advanced Information and 1-lag Autoregressive Ordinary Information

Input: t𝑡titalic_t, S𝑆Sitalic_S, f^desuperscript^𝑓𝑑𝑒\hat{f}^{de}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT, K𝐾Kitalic_K, {xτ0}τ=t+1t+Ksuperscriptsubscriptsuperscriptsubscript𝑥𝜏0𝜏𝑡1𝑡𝐾\{x_{\tau}^{0}\}_{\tau=t+1}^{t+K}{ italic_x start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_τ = italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + italic_K end_POSTSUPERSCRIPT, xt+11superscriptsubscript𝑥𝑡11x_{t+1}^{1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, and ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (last observation of Y𝑌Yitalic_Y at the time of forecast)

Output: {(k,yt+k|ts):k[K]}s[S]subscriptconditional-set𝑘superscriptsubscript𝑦𝑡conditional𝑘𝑡𝑠𝑘delimited-[]𝐾𝑠delimited-[]𝑆\{(k,y_{t+k|t}^{s}):k\in[K]\}_{s\in[S]}{ ( italic_k , italic_y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) : italic_k ∈ [ italic_K ] } start_POSTSUBSCRIPT italic_s ∈ [ italic_S ] end_POSTSUBSCRIPT

1:  Draw S samples from the distribution 𝔼ZN(0,Iq),X0=xt+10,X1=yt[N(f^de(X0,X1,Z),σ2Id)]subscript𝔼formulae-sequencesimilar-to𝑍𝑁0subscript𝐼𝑞formulae-sequencesuperscript𝑋0subscriptsuperscript𝑥0𝑡1superscript𝑋1subscript𝑦𝑡delimited-[]𝑁superscript^𝑓𝑑𝑒superscript𝑋0superscript𝑋1𝑍superscript𝜎2subscript𝐼𝑑\mathbb{E}_{Z\sim N(0,I_{q}),X^{0}=x^{0}_{t+1},X^{1}=y_{t}}[N(\hat{f}^{de}(X^{% 0},X^{1},Z),\sigma^{2}I_{d})]blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_N ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] according to Equation 3. These samples are denoted as yt+1|tssuperscriptsubscript𝑦𝑡conditional1𝑡𝑠y_{t+1|t}^{s}italic_y start_POSTSUBSCRIPT italic_t + 1 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT for s[S]𝑠delimited-[]𝑆s\in[S]italic_s ∈ [ italic_S ].
2:  for τ{2,3,,K}𝜏23𝐾\tau\in\{2,3,...,K\}italic_τ ∈ { 2 , 3 , … , italic_K } do
3:     Average y^t+τ1|t=s[S]yt+τ1|tsSsubscript^𝑦𝑡𝜏conditional1𝑡subscript𝑠delimited-[]𝑆superscriptsubscript𝑦𝑡𝜏conditional1𝑡𝑠𝑆\hat{y}_{t+\tau-1|t}=\frac{\sum_{s\in[S]}y_{t+\tau-1|t}^{s}}{S}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + italic_τ - 1 | italic_t end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ [ italic_S ] end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_t + italic_τ - 1 | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_ARG start_ARG italic_S end_ARG
4:     Draw S samples from the distribution 𝔼ZN(0,Iq),X0=xt+τ0,X1=y^t+τ1|t[N(f^de(X0,X1,Z),σ2Id)]subscript𝔼formulae-sequencesimilar-to𝑍𝑁0subscript𝐼𝑞formulae-sequencesuperscript𝑋0subscriptsuperscript𝑥0𝑡𝜏superscript𝑋1subscript^𝑦𝑡𝜏conditional1𝑡delimited-[]𝑁superscript^𝑓𝑑𝑒superscript𝑋0superscript𝑋1𝑍superscript𝜎2subscript𝐼𝑑\mathbb{E}_{Z\sim N(0,I_{q}),X^{0}=x^{0}_{t+\tau},X^{1}=\hat{y}_{t+\tau-1|t}}[% N(\hat{f}^{de}(X^{0},X^{1},Z),\sigma^{2}I_{d})]blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) , italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + italic_τ end_POSTSUBSCRIPT , italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + italic_τ - 1 | italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_N ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ] according to Equation 3. These samples are denoted as yt+τ|tssuperscriptsubscript𝑦𝑡conditional𝜏𝑡𝑠y_{t+\tau|t}^{s}italic_y start_POSTSUBSCRIPT italic_t + italic_τ | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT for s[S]𝑠delimited-[]𝑆s\in[S]italic_s ∈ [ italic_S ].
5:  end for
6:  return  {(k,yt+k|ts):k[K]}s[S]subscriptconditional-set𝑘superscriptsubscript𝑦𝑡conditional𝑘𝑡𝑠𝑘delimited-[]𝐾𝑠delimited-[]𝑆\{(k,y_{t+k|t}^{s}):k\in[K]\}_{s\in[S]}{ ( italic_k , italic_y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) : italic_k ∈ [ italic_K ] } start_POSTSUBSCRIPT italic_s ∈ [ italic_S ] end_POSTSUBSCRIPT

3. Empirical Application: Daily Stock Volume Forecasting

In this section, we model and forecast daily stock volume for 50 European stocks which were components of Euro Stoxx 50 as of the end of year 2023.

3.1. Data Availability and Processing

Daily stock volume data were obtained from Yahoo Finance. We split the training and testing as start of year 2021 to end of year 2022, and start of year 2023 to end of June 2023, respectively. We use the traning data to normalise the time series — that is, for each stock, we find mean and variance in training period, then de-mean and unify the variance222The YYμ^σ^maps-to𝑌𝑌^𝜇^𝜎Y\mapsto\frac{Y-\hat{\mu}}{\hat{\sigma}}italic_Y ↦ divide start_ARG italic_Y - over^ start_ARG italic_μ end_ARG end_ARG start_ARG over^ start_ARG italic_σ end_ARG end_ARG map**, where μ^^𝜇\hat{\mu}over^ start_ARG italic_μ end_ARG and σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG are, respectively, the mean and variance estimated from their training period, as one of the standard Machine Learning data processing procedure.

Refer to caption
Figure 1. Illustrations of raw data and processed data for training and testing

In addition to the volume data, we obtain stoxx rebalancing dates according to the STOXX index guide, and individual stock categories (Location and Sector) from the index-tracking ETF, EUE. These are seen as advanced information in modelling and foreacsting, as the information attributed are either known ahead of time, or believed to be unchanged throughout the forecasting horizon.

As an illustration on some of these data, in Figure 1, we plot two of the stocks (ASML.AS and BNP.PA) in both row data and processed data. The processed data rows (the second row for training and the third row for testing) also contains vertical red lines highlighting the rebalancing dates. It is visually clear that rebalancing dates do tend to co-occur with a higher-than-average volume — this is something we would wish our model to capture, both in in-sample modelling and out-of-sample forecasting.

We annotate the scalar y(i)t𝑦subscript𝑖𝑡y(i)_{t}italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the normalised observed stock volume in day t𝑡titalic_t and for stock i[N]𝑖delimited-[]𝑁i\in[N]italic_i ∈ [ italic_N ] where N=50𝑁50N=50italic_N = 50, and the vector 𝒚t50subscript𝒚𝑡superscript50\bm{y}_{t}\in\mathbb{R}^{50}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 50 end_POSTSUPERSCRIPT as the vector of observed stock volume in day t𝑡titalic_t across all stocks. Where there are missing observations in a day, we drop such an observation.

As for features, RBt{0,1}3𝑅subscript𝐵𝑡superscript013RB_{t}\in\{0,1\}^{3}italic_R italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT denotes a three-dimensional one-hot encoder indicating the day relative to rebalancing day — in particular, RBt=(1,0,0)𝑅subscript𝐵𝑡100RB_{t}=(1,0,0)italic_R italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 1 , 0 , 0 ) indicates the last observation before rebalacing day, RBt=(0,1,0)𝑅subscript𝐵𝑡010RB_{t}=(0,1,0)italic_R italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 0 , 1 , 0 ) indicates the rebalacing day, and RBt=(0,0,1)𝑅subscript𝐵𝑡001RB_{t}=(0,0,1)italic_R italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 0 , 0 , 1 ) indicates the next observation after rebalacing day. RBt=(0,0,0)𝑅subscript𝐵𝑡000RB_{t}=(0,0,0)italic_R italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 0 , 0 , 0 ) for all other days. The feature DoWt{0,1}5𝐷𝑜subscript𝑊𝑡superscript015DoW_{t}\in\{0,1\}^{5}italic_D italic_o italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT denotes the day of the week of the observation. Another set of features are Sector(i)t{0,1}10𝑆𝑒𝑐𝑡𝑜𝑟subscript𝑖𝑡superscript0110Sector(i)_{t}\in\{0,1\}^{10}italic_S italic_e italic_c italic_t italic_o italic_r ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT and Location(i)t{0,1}7𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛subscript𝑖𝑡superscript017Location(i)_{t}\in\{0,1\}^{7}italic_L italic_o italic_c italic_a italic_t italic_i italic_o italic_n ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT which are one-hot encoders for the sector of the stock i𝑖iitalic_i and location of stock i𝑖iitalic_i respectively. They are invariant over time and henceforth considered as advanced information.

3.2. Forecasting Tasks

We would like to accomplish two forecasting tasks through our models.

The long term forecasting concerns the scenario where we are given large K𝐾Kitalic_K and a fixed t𝑡titalic_t (end of the training sample), and we request to forecast y(i)t+k|t𝑦subscript𝑖𝑡conditional𝑘𝑡y(i)_{t+k|t}italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT for k[K],i[N]formulae-sequence𝑘delimited-[]𝐾𝑖delimited-[]𝑁k\in[K],i\in[N]italic_k ∈ [ italic_K ] , italic_i ∈ [ italic_N ]. In this dataset, we have K=120𝐾120K=120italic_K = 120 representing the first half of 2023 (business days and after removal of days with empty observations due to holidays or data source error).

The short term rolling forecast concerns the scenario of week-long forecasts on multiple periods, hence K5𝐾5K\leq 5italic_K ≤ 5 (K=5𝐾5K=5italic_K = 5 is often the case where the forecast run from Monday to Friday — in some cases such as the public holiday, it would reduce to K<5𝐾5K<5italic_K < 5). Given a set of time points t1,,tusubscript𝑡1subscript𝑡𝑢t_{1},...,t_{u}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT indicating end of the current week, and some k1,,ku5subscript𝑘1subscript𝑘𝑢5k_{1},...,k_{u}\leq 5italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ≤ 5 which indicates the duration of forecasts from the start of the upcoming week to the end of the upcoming week, we request to forecast y(i)t+k|t𝑦subscript𝑖𝑡conditionalsuperscript𝑘𝑡y(i)_{t+k^{\prime}|t}italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t + italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_t end_POSTSUBSCRIPT for k[k],(t,k){(t1,k1),,(tu,ku)},i[N]formulae-sequencesuperscript𝑘delimited-[]𝑘formulae-sequence𝑡𝑘subscript𝑡1subscript𝑘1subscript𝑡𝑢subscript𝑘𝑢𝑖delimited-[]𝑁k^{\prime}\in[k],(t,k)\in\{(t_{1},k_{1}),...,(t_{u},k_{u})\},i\in[N]italic_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_k ] , ( italic_t , italic_k ) ∈ { ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_t start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) } , italic_i ∈ [ italic_N ].

3.3. Summary of Models

Recall the setting of CVAE as per Equation 1 and Equation 2. We model a univariate CVAE (U-CVAE) on each individual y(i)t𝑦subscript𝑖𝑡y(i)_{t}italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and a multivariate CVAE (M-CVAE) on the vector 𝒚tsubscript𝒚𝑡\bm{y}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. W keep q=1𝑞1q=1italic_q = 1 for simplicity.

For the U-CVAE, we have Y(i)t𝑌subscript𝑖𝑡Y(i)_{t}\in\mathbb{R}italic_Y ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R and model

(4) Y(i)t|X(i)t,Zsimilar-toconditional𝑌subscript𝑖𝑡𝑋subscript𝑖𝑡𝑍absent\displaystyle Y(i)_{t}|X(i)_{t},Z\simitalic_Y ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_X ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z ∼ N(f(X(i)t,Z),σ2),𝑁𝑓𝑋subscript𝑖𝑡𝑍superscript𝜎2\displaystyle N(f(X(i)_{t},Z),\sigma^{2}),italic_N ( italic_f ( italic_X ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,
(5) X(i)t=(X(i)t0,X(i)t1),ZN(0,1)formulae-sequence𝑋subscript𝑖𝑡𝑋subscriptsuperscript𝑖0𝑡𝑋subscriptsuperscript𝑖1𝑡similar-to𝑍𝑁01\displaystyle X(i)_{t}=(X(i)^{0}_{t},X(i)^{1}_{t}),\ \ Z\sim N(0,1)italic_X ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_X ( italic_i ) start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_X ( italic_i ) start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_Z ∼ italic_N ( 0 , 1 )

with X(i)t0=(Sector(i)t,Location(i)t,DoWt,RBt)𝑋subscriptsuperscript𝑖0𝑡𝑆𝑒𝑐𝑡𝑜𝑟subscript𝑖𝑡𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛subscript𝑖𝑡𝐷𝑜subscript𝑊𝑡𝑅subscript𝐵𝑡X(i)^{0}_{t}=(Sector(i)_{t},Location(i)_{t},DoW_{t},RB_{t})italic_X ( italic_i ) start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_S italic_e italic_c italic_t italic_o italic_r ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_L italic_o italic_c italic_a italic_t italic_i italic_o italic_n ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_D italic_o italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_R italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and X(i)t1=Yt1(i)𝑋subscriptsuperscript𝑖1𝑡subscript𝑌𝑡1𝑖X(i)^{1}_{t}=Y_{t-1}(i)italic_X ( italic_i ) start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_i ).

For the M-CVAE, we have 𝒀t50subscript𝒀𝑡superscript50\bm{Y}_{t}\in\mathbb{R}^{50}bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 50 end_POSTSUPERSCRIPT and model

(6) 𝒀t|Xt,Zsimilar-toconditionalsubscript𝒀𝑡subscript𝑋𝑡𝑍absent\displaystyle\bm{Y}_{t}|X_{t},Z\simbold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z ∼ N(f(Xt,Z),σ2I50),𝑁𝑓subscript𝑋𝑡𝑍superscript𝜎2subscript𝐼50\displaystyle N(f(X_{t},Z),\sigma^{2}I_{50}),italic_N ( italic_f ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT 50 end_POSTSUBSCRIPT ) ,
(7) Xt=(Xt0,Xt1),ZN(0,1)formulae-sequencesubscript𝑋𝑡superscriptsubscript𝑋𝑡0superscriptsubscript𝑋𝑡1similar-to𝑍𝑁01\displaystyle X_{t}=(X_{t}^{0},X_{t}^{1}),\ \ Z\sim N(0,1)italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) , italic_Z ∼ italic_N ( 0 , 1 )

with Xt0=(DoWt,RBt)superscriptsubscript𝑋𝑡0𝐷𝑜subscript𝑊𝑡𝑅subscript𝐵𝑡X_{t}^{0}=(DoW_{t},RB_{t})italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = ( italic_D italic_o italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_R italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and Xt1=𝒀t1superscriptsubscript𝑋𝑡1subscript𝒀𝑡1X_{t}^{1}=\bm{Y}_{t-1}italic_X start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = bold_italic_Y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT.

For comparison, we also provide two baseline models: for the univariate baseline, we model and forecast y(i)t𝑦subscript𝑖𝑡y(i)_{t}italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using ARMA(1,1); and for the multivariate baseline, we do so on 𝒚tsubscript𝒚𝑡\bm{y}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using VAR(1). 333VARMA(1,1) faces insufficient data due to the number of parameters almost exceeding the amount of data available.

3.4. Evaluation Metrics

Refer to caption
Figure 2. Short Term Rolling Forecasts: M-CVAE and VAR(1) Illustrations
Refer to caption
Figure 3. Long Term Forecasts: U-CVAE and ARMA(1,1) Illustrations

To evaluate the forecasts, we employ two core concepts in time series: Mean Squared Errors (MSE) and correlations. The computation of MSE is straightforward: given forecasts y^(i)t+k|t^𝑦subscript𝑖𝑡conditional𝑘𝑡\hat{y}(i)_{t+k|t}over^ start_ARG italic_y end_ARG ( italic_i ) start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT generated by the models, we compute the average MSE by comparing against the actual observation in the testing period: yt+ksubscript𝑦𝑡𝑘y_{t+k}italic_y start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT, namely (y^(i)t+k|tyt+k)2superscript^𝑦subscript𝑖𝑡conditional𝑘𝑡subscript𝑦𝑡𝑘2(\hat{y}(i)_{t+k|t}-y_{t+k})^{2}( over^ start_ARG italic_y end_ARG ( italic_i ) start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT averaged across the testing period (the precise notation varies depending on whether it is summed over short-time rolling forecast and long term forecast) for each stock i𝑖iitalic_i, denoted MSE(i)𝑀𝑆𝐸𝑖MSE(i)italic_M italic_S italic_E ( italic_i ). Summary of statistics (mean and median) can then be obtained for MSE(i)𝑀𝑆𝐸𝑖MSE(i)italic_M italic_S italic_E ( italic_i ).

The correlation matrix may also be produced for ρ^i,j=corr(y^(i)t+k|t,y^(j)t+k|t)subscript^𝜌𝑖𝑗𝑐𝑜𝑟𝑟^𝑦subscript𝑖𝑡conditional𝑘𝑡^𝑦subscript𝑗𝑡conditional𝑘𝑡\hat{\rho}_{i,j}=corr(\hat{y}(i)_{t+k|t},\hat{y}(j)_{t+k|t})over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_c italic_o italic_r italic_r ( over^ start_ARG italic_y end_ARG ( italic_i ) start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG ( italic_j ) start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT ) against the correlation of the data ρi,j=corr(y(i)t+k,y(j)t+k)subscript𝜌𝑖𝑗𝑐𝑜𝑟𝑟𝑦subscript𝑖𝑡𝑘𝑦subscript𝑗𝑡𝑘{\rho_{i,j}}=corr(y(i)_{t+k},y(j)_{t+k})italic_ρ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_c italic_o italic_r italic_r ( italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT , italic_y ( italic_j ) start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT ) (exact formulations are drawn in section 4.1 where we further discuss the correlation of non-stationary time series). To summarise the difference, we apply average absolute value to obtain the difference, denoted CD for correlation difference

CD(i)=150×j[50]|ρ^i,jρi,j|𝐶𝐷𝑖150subscript𝑗delimited-[]50subscript^𝜌𝑖𝑗subscript𝜌𝑖𝑗CD(i)=\frac{1}{50}\times\sum_{j\in[50]}|\hat{\rho}_{i,j}-\rho_{i,j}|italic_C italic_D ( italic_i ) = divide start_ARG 1 end_ARG start_ARG 50 end_ARG × ∑ start_POSTSUBSCRIPT italic_j ∈ [ 50 ] end_POSTSUBSCRIPT | over^ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT - italic_ρ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT |

Summary of statistics (mean and median) can then be obtained for CD(i)𝐶𝐷𝑖CD(i)italic_C italic_D ( italic_i ).

The cross-correlation matrix is also considered, as one may wish to observe the difference between lagged correlation amongst the forecasts. In particular, we produce ρ^i,j=corr(y^(i)t+k|t,y^(j)t+k+1|t)subscriptsuperscript^𝜌𝑖𝑗𝑐𝑜𝑟𝑟^𝑦subscript𝑖𝑡conditional𝑘𝑡^𝑦subscript𝑗𝑡𝑘conditional1𝑡\hat{\rho}^{*}_{i,j}=corr(\hat{y}(i)_{t+k|t},\hat{y}(j)_{t+k+1|t})over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_c italic_o italic_r italic_r ( over^ start_ARG italic_y end_ARG ( italic_i ) start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_y end_ARG ( italic_j ) start_POSTSUBSCRIPT italic_t + italic_k + 1 | italic_t end_POSTSUBSCRIPT ) and compare against the one by data ρi,j=corr(y(i)t+k,y(j)t+k+1)subscriptsuperscript𝜌𝑖𝑗𝑐𝑜𝑟𝑟𝑦subscript𝑖𝑡𝑘𝑦subscript𝑗𝑡𝑘1{\rho^{*}_{i,j}}=corr(y(i)_{t+k},y(j)_{t+k+1})italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = italic_c italic_o italic_r italic_r ( italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t + italic_k end_POSTSUBSCRIPT , italic_y ( italic_j ) start_POSTSUBSCRIPT italic_t + italic_k + 1 end_POSTSUBSCRIPT ). We then summarise the difference in the same fashion as was done in CD(i)𝐶𝐷𝑖CD(i)italic_C italic_D ( italic_i ). We denote this statistic as CCD(i)𝐶𝐶𝐷𝑖CCD(i)italic_C italic_C italic_D ( italic_i ) for cross correlation difference.

3.5. Summary of Performance

As a summary of the performance in forecasting, we provide evaluation metrics in Table 1 and Table 2 for the long term forecasting task and short term rolling forecasts respectively. Illustrations of these forecasts are plotted in Figure 2 and Figure 3, where we can see immediate difference between CVAE forecasts and their baseline models.

U-CVAE M-CVAE ARMA(1,1) VAR(1)
mean MSE 0.887 0.888 0.923 0.981
median MSE 0.876 0.884 0.922 1.001
mean CD 0.466 0.391 0.441 0.492
median CD 0.435 0.374 0.421 0.458
mean CCD 0.106 0.146 0.755 3.669
median CCD 0.096 0.136 0.837 3.644
Table 1. Performance of Long Term Forecasts
U-CVAE M-CVAE ARMA(1,1) VAR(1)
mean MSE 0.793 0.788 0.971 1.070
median MSE 0.737 0.761 0.989 1.118
mean CD 0.240 0.275 0.093 0.193
median CD 0.227 0.258 0.083 0.179
mean CCD 0.124 0.262 0.377 0.420
median CCD 0.105 0.271 0.392 0.421
Table 2. Performance of Short Term Rolling Forecasts

As a summary, we see that the CVAE forecasts do better job in both long term and rolling short term forecasting tasks, with the out-performance in MSE — more significantly so in the short term forecasts, and good fit of CCD in all cases. Correlation matrix fit well, though under-performs their baseline counterparts in short term forecasting tasks. Illustrations of correlation matrices are plotted in Figure 9 in the appendix.

From the illustrations, we can further appreciate such out-performance in two folds: quite significantly so in long term forecasts, the CVAE takes into account the advanced information in both modelling and forecasts, and are able to project spikes in a non-linear time series fashion that matches some of the spikes in the actual observation. Additionally, in short term forecasts, baseline models tend to have heavy reliance on their lagged dependent variables, which creates problems as to over- or under-forecasts in their forecasting horizon, these are partly mitigated in CVAE forecasts as they moderate these with the trained parameters of advanced information and other features.

There are further potential improvements in short-term forecasts: linear baselines do well in the correlation fitting — this may be seen in the lower part of Figure 9 and the CD entries in Table 2. Despite having low accuracies, linear models still preserve the correlation structures in their forecasts, resulting in a better fit — whereas the CVAE models, despite better fit, tend to have more variabilities in their forecasts, and consequentially tend to over-fit in correlation when comparing to the actual data. In cross correlation, however, CVAE models still significantly outperform linear baselines.

We further some of these interpretations by zooming into the forecasts and discuss alternative scenarios of the feature values in the next section.

3.6. Decoder as Generator: feature interpretation and scenario generation

In this section, we address two questions which help to appreciate the value of CVAE forecasts: How does RB affect the forecasts? And what’s the effect of the lagged dependent variable (similar to the IRF analysis in linear time series)? To engage with empirical data, we zoom into the long term forecast of ASML in March - April 2023 (illustrated in Figure 4) to answer the first question, and the short term rolling forecast of BNP in the same period (Figure 6) for the second question.

Refer to caption
Figure 4. Long Term Forecasts: A zoomed-in plot for all models in March - April 2023, for ticker ASML.AS

From Figure 4, we can closely observe the ability of CVAE-generated forecasts to match the spike on the 17th March 2023, which is benefited from the advanced information of rebalancing date indicators (RB). Their corresponding baselines stay flat as the convergence of stationary forecasts would yield, when forecast period becomes large (the last observation being the end of year 2022).

Further to this, we may analyse the counterfactual of CVAE-generated forecasts in the case where there would be no rebalancing events in that week. To do this, we may augment the advanced information X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT such that the RBt=(0,0,0)𝑅subscript𝐵𝑡000RB_{t}=(0,0,0)italic_R italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 0 , 0 , 0 ) for the period of forecasts. We then generate the new forecast paths using algorithm 2 with the augmented advanced information X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT.

Refer to caption
Figure 5. RB interpretation: ASML illustration (generated by U-CVAE)

In Figure 5, we plot the paths (in grey) and their mean (in red), 97.5% quantile (in blue) and 2.5% quantile (in black). On the left panel are the forecasts with the actual X0superscript𝑋0X^{0}italic_X start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT where rebalancing happens on Friday, whereas on the right are the newly generated forecasts under the counterfactual with absence of rebalancing. There are slight difference on Thursday (16th March) paths as the one with rebalancing moves up and slightly tighter on the day before rebalancing date, and a huge difference on Friday (17th March) where rebalancing kicks up the forecast on the left panel — the counterfactual in the absence of rebalancing date shows a relatively normal mean and upper and lower quantiles.

Refer to caption
Figure 6. Short Term Rolling Forecasts: A zoomed-in plot for all models in March - April 2023, for ticker BNP.PA

Using a similar convention, we may observe from Figure 6 that a spike in the data was observed on the 24th March. This comes in the episode of a higher-than-usual volume in March 2023 (mainly due to the European banking sector distress during that period). We zoom in to the last week of March (week commencing 27th March) for the analysis of lagged impact. As it is rolling forecasts, we can see U-CVAE and its baseline all picked up the last observed spike in their upcoming short term forecasts of 5 days. The spike on the 24th March was recorded at just below 5 — it is tempting to seek for the impulse of this extraordinary observation by comparing against a counterfactual of the observation being at zero or at the negative of what was observed (just above -5).

To do this, two alternative paths are generated by replacing the Xt1=ytsubscriptsuperscript𝑋1𝑡subscript𝑦𝑡X^{1}_{t}=y_{t}italic_X start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT part of algorithm 2 to our desired value of the counterfactual states. We plot such paths in Figure 7 with the augmented ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT values. As visualised, the U-CVAE generated forecasts have a similar shape as are in the stationary time series analysis, where the paths converge back to the longer-term stationarity (around 0) when facing a upward or downward shock — in reality, the upward shock was in effect, under which U-CVAE responded well, similar to how the baseline performed.

Refer to caption
Figure 7. Lagged volume feature interpretation: BNP illustration (generated by U-CVAE)

4. Further Discussions

4.1. Path correlation in non-stationary time series

Recall that for a given k, a forecast path can be written as yt+|t:=(yt+1|t,,yt+k|t)y_{t+\cdot|t}:=(y_{t+1|t},...,y_{t+k|t})italic_y start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT := ( italic_y start_POSTSUBSCRIPT italic_t + 1 | italic_t end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT ) and likewise for y(i)t+|ty(i)_{t+\cdot|t}italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT, which takes the i𝑖iitalic_i-th entry, namely y(i)t+|t:=(y(i)t+1|t,,y(i)t+k|t)y(i)_{t+\cdot|t}:=(y(i)_{t+1|t},...,y(i)_{t+k|t})italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT := ( italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t + 1 | italic_t end_POSTSUBSCRIPT , … , italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT ). The correlation statistics takes the form of

ρi,j=Cov(y(i)t+|t,y(j)t+|t)V(y(i)t+|t)V(y(j)t+|t)\rho_{i,j}=\frac{Cov(y(i)_{t+\cdot|t},y(j)_{t+\cdot|t})}{\sqrt{V(y(i)_{t+\cdot% |t})V(y(j)_{t+\cdot|t})}}italic_ρ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG italic_C italic_o italic_v ( italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT , italic_y ( italic_j ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_V ( italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT ) italic_V ( italic_y ( italic_j ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG

between paths y(i)t+|ty(i)_{t+\cdot|t}italic_y ( italic_i ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT and y(j)t+|ty(j)_{t+\cdot|t}italic_y ( italic_j ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT. Now, to obtain this statistics, we consider the correlation of the average paths (CAP), defined over the average path y¯t+|t\bar{y}_{t+\cdot|t}over¯ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT generated. This can be expressed as

(8) ρ^i,jCAP=Cov(y¯(i)t+|t,y¯(j)t+|t)V(y¯(i)t+|t)V(y¯(j)t+|t)\hat{\rho}^{CAP}_{i,j}=\frac{Cov(\bar{y}(i)_{t+\cdot|t},\bar{y}(j)_{t+\cdot|t}% )}{\sqrt{V(\bar{y}(i)_{t+\cdot|t})V(\bar{y}(j)_{t+\cdot|t})}}over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_C italic_A italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG italic_C italic_o italic_v ( over¯ start_ARG italic_y end_ARG ( italic_i ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_y end_ARG ( italic_j ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_V ( over¯ start_ARG italic_y end_ARG ( italic_i ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT ) italic_V ( over¯ start_ARG italic_y end_ARG ( italic_j ) start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG

Throughout this paper so far, we used CAP to report the correlation statistics. However, it can also be of interests to report another statistics, which concerns per-path correlation. To this end, we define the average correlation of paths (ACP), denoted ρ^i,jACPsubscriptsuperscript^𝜌𝐴𝐶𝑃𝑖𝑗\hat{\rho}^{ACP}_{i,j}over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_A italic_C italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT, as below:

(9) s,ρ^i,js:=Cov(y(i)t+|ts,y(j)t+|ts)V(y(i)t+|ts)V(y(j)t+|ts);ρ^i,jACP=sSρ^i,js|S|\forall s,\hat{\rho}^{s}_{i,j}:=\frac{Cov(y(i)^{s}_{t+\cdot|t},y(j)^{s}_{t+% \cdot|t})}{\sqrt{V(y(i)^{s}_{t+\cdot|t})V(y(j)^{s}_{t+\cdot|t})}};\ \ \hat{% \rho}^{ACP}_{i,j}=\frac{\sum_{s\in S}\hat{\rho}^{s}_{i,j}}{|S|}∀ italic_s , over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT := divide start_ARG italic_C italic_o italic_v ( italic_y ( italic_i ) start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT , italic_y ( italic_j ) start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG square-root start_ARG italic_V ( italic_y ( italic_i ) start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT ) italic_V ( italic_y ( italic_j ) start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + ⋅ | italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG ; over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_A italic_C italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT over^ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG start_ARG | italic_S | end_ARG

To empirically showcase the difference, we compute such statistics on the correlation and cross correlation between ASML and BNP, and plot the rolling estimations (expanding window of samples) on Figure 8 for short term rolling forecasts and under σ=1𝜎1\sigma=1italic_σ = 1 configuration. Clearly, ACP and CAP converge to different values, with CAP having less bias approximating the true correlation and ACP having a better try on approximating the true cross correlation. A generalised set of cross correlation matrices are plotted in Figure 10 in the appendix, where ACP tends to fit well with the actual data while CAP mostly over-estimates the values.

Refer to caption
Figure 8. Correlation estimates (left) and cross correlation (right), with red line indicating the true data estimation

It is in general not easy to derive the limiting distribution of the correlation variables when the sample size goes to infinity. Fixing k𝑘kitalic_k, under linear time series models (stationary ARMA or VAR in particular), each sample would be drawn from a Gaussian distribution, hence both covariance and variance would converge to χ2superscript𝜒2\chi^{2}italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in distribution. However, non-stationary time series does not have such convergence, making it hard to analytically showcase the limiting distribution.

Likewise, it may be tempting to discuss the limit when k𝑘kitalic_k goes to infinity. Under linear time series, the forecast will be gradually concentrating to a point as yt+k|t𝔼[Y] as ksubscript𝑦𝑡conditional𝑘𝑡𝔼delimited-[]𝑌 as 𝑘y_{t+k|t}\to\mathbb{E}[Y]\text{ as }k\to\inftyitalic_y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT → blackboard_E [ italic_Y ] as italic_k → ∞ under the stationarity assumption, which leads to the unbiasedness of correlation estimate when k𝑘k\to\inftyitalic_k → ∞, should all variables be stationary. However, non-stationary variables do not benefit from this, and ACP is conjectured to be different from CAP under the non-linear setting. One may derive the limiting distribution under simple regime-switching for this case.

4.2. Other Extensions

There are many other aspects of the CVAE forecasting schemes which are not discussed extensively in this paper due to the length restriction. First and foremost, we have alternative generative schemes that worth consideration and demonstration — instead of drawing from

𝔼ZN(0,Iq),X=xt[N(f^de(X,Z),σ2Id)]subscript𝔼formulae-sequencesimilar-to𝑍𝑁0subscript𝐼𝑞𝑋subscript𝑥𝑡delimited-[]𝑁superscript^𝑓𝑑𝑒𝑋𝑍superscript𝜎2subscript𝐼𝑑\mathbb{E}_{Z\sim N(0,I_{q}),X=x_{t}}[N(\hat{f}^{de}(X,Z),\sigma^{2}I_{d})]blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) , italic_X = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_N ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_X , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ]

we may wish to draw Z𝑍Zitalic_Z from conditional distribution based on the previous observation, that is, from

𝔼(ZZ|X=xt1,Y=yt1),X=xt[N(f^de(X,Z),σ2Id)]subscript𝔼formulae-sequencesimilar-to𝑍conditional𝑍𝑋subscript𝑥𝑡1𝑌subscript𝑦𝑡1𝑋subscript𝑥𝑡delimited-[]𝑁superscript^𝑓𝑑𝑒𝑋𝑍superscript𝜎2subscript𝐼𝑑\mathbb{E}_{(Z\sim Z|X=x_{t-1},Y=y_{t-1}),\ X=x_{t}}[N(\hat{f}^{de}(X,Z),% \sigma^{2}I_{d})]blackboard_E start_POSTSUBSCRIPT ( italic_Z ∼ italic_Z | italic_X = italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_Y = italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , italic_X = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_N ( over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_d italic_e end_POSTSUPERSCRIPT ( italic_X , italic_Z ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ]

This makes it possible to interact with the encoder, as the distribution of Z|X,Yconditional𝑍𝑋𝑌Z|X,Yitalic_Z | italic_X , italic_Y is learnt by the encoder function. In fact, in some literature, this is the main generating method (Li et al., 2022).

Additionally, as was introduced in section 2.3, advanced information is a more generalised concept of forecast measures, instead of just point forecasts — hence one could look into interval forecasting, in addition to our investigation of mean forecast. This will then be comparable to traditional linear models. An attempt is made in section 3.6 when plotting the upper and lower quantiles of the simulated paths, though more thorough extension would be desired.

There are many other aspects of machine learning techniques on the neural network architectures that could be extended, including the increase of latent dimension and alternative architectures such as convolutional neural networks. These could act as better forms of approximation to the true non-linear and non-stationary nature of the data.

5. Conclusion

In this paper, we first identify the class of problem of time series forecasting with advanced information, which can be related to many problems in time series and finance, including stationarity, panel data, and stock volume forecasting. A CVAE architecture is introduced to model such time series, We further investigate the case for daily stock volume forecasting, and found the CVAE generated forecasts more competitive than traditional linear models. The CVAE forecasts may be further generated for different scenarios of input features, creating a possibility to interpret feature inputs and generate various scenarios. Various extensions are then discussed in light of potential challenges which can be encountered in non-linear time series.

Refer to caption
Figure 9. Cross Correlation Matrix of Long Term Forecasts (Upper) and Correlation Matrix of Short Term Rolling Forecasts (Lower): Data (left), U-CVAE (centre), and ARMA(1,1) (right)

Appendix A Appendix

A.1. Technical remarks on the traninig of CVAE

A.1.1. ERM for training

The empirical risk minimisation for training the CVAE goes as follows. We aim to minimise 𝔼X,YD[log(P(Y|X))]subscript𝔼similar-to𝑋𝑌𝐷delimited-[]𝑃conditional𝑌𝑋\mathbb{E}_{X,Y\sim D}[\log(P(Y|X))]blackboard_E start_POSTSUBSCRIPT italic_X , italic_Y ∼ italic_D end_POSTSUBSCRIPT [ roman_log ( italic_P ( italic_Y | italic_X ) ) ] based on the assumptions in Equation 1 and Equation 2. Let P(Y|X,Z)𝑃conditional𝑌𝑋𝑍P(Y|X,Z)italic_P ( italic_Y | italic_X , italic_Z ) denote the probability distribution as specified in Equation 1, let P(Z)=N(0,Iq)𝑃𝑍𝑁0subscript𝐼𝑞P(Z)=N(0,I_{q})italic_P ( italic_Z ) = italic_N ( 0 , italic_I start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) be the distribution of Z𝑍Zitalic_Z, let Q(Z|X,Y)𝑄conditional𝑍𝑋𝑌Q(Z|X,Y)italic_Q ( italic_Z | italic_X , italic_Y ) be the distribution of Z𝑍Zitalic_Z conditional on X,Y𝑋𝑌X,Yitalic_X , italic_Y as specified by Equation 2, and let P(Z|X,Y)𝑃conditional𝑍𝑋𝑌P(Z|X,Y)italic_P ( italic_Z | italic_X , italic_Y ) be the conditional distribution obtained by Bayes rule, i.e. P(Z|X,Y)=P(Y|X,Z)P(Z|X)P(Y|X)𝑃conditional𝑍𝑋𝑌𝑃conditional𝑌𝑋𝑍𝑃conditional𝑍𝑋𝑃conditional𝑌𝑋P(Z|X,Y)=\frac{P(Y|X,Z)P(Z|X)}{P(Y|X)}italic_P ( italic_Z | italic_X , italic_Y ) = divide start_ARG italic_P ( italic_Y | italic_X , italic_Z ) italic_P ( italic_Z | italic_X ) end_ARG start_ARG italic_P ( italic_Y | italic_X ) end_ARG

Write KL(||)KL(\cdot||\cdot)italic_K italic_L ( ⋅ | | ⋅ ) as the KL divergence, then observe, by Bayes’ Rule

KL(Q(Z|X,Y)||P(Z|X,Y))\displaystyle KL(Q(Z|X,Y)||P(Z|X,Y))italic_K italic_L ( italic_Q ( italic_Z | italic_X , italic_Y ) | | italic_P ( italic_Z | italic_X , italic_Y ) )
=\displaystyle== 𝔼ZQ(Z|X,Y)[log(Q(Z|X,Y))log(P(Z|X))\displaystyle\mathbb{E}_{Z\sim Q(Z|X,Y)}[\log(Q(Z|X,Y))-\log(P(Z|X))blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_Z | italic_X , italic_Y ) end_POSTSUBSCRIPT [ roman_log ( italic_Q ( italic_Z | italic_X , italic_Y ) ) - roman_log ( italic_P ( italic_Z | italic_X ) )
+log(P(Y|X))log(P(Y|X,Z))]\displaystyle+\log(P(Y|X))-\log(P(Y|X,Z))]+ roman_log ( italic_P ( italic_Y | italic_X ) ) - roman_log ( italic_P ( italic_Y | italic_X , italic_Z ) ) ]
=\displaystyle== KL(Q(Z|X,Y)||P(Z|X))\displaystyle KL(Q(Z|X,Y)||P(Z|X))italic_K italic_L ( italic_Q ( italic_Z | italic_X , italic_Y ) | | italic_P ( italic_Z | italic_X ) )
+log(P(Y|X))𝔼ZQ(Z|X,Y)[log(P(Y|X,Z))]𝑃conditional𝑌𝑋subscript𝔼similar-to𝑍𝑄conditional𝑍𝑋𝑌delimited-[]𝑃conditional𝑌𝑋𝑍\displaystyle+\log(P(Y|X))-\mathbb{E}_{Z\sim Q(Z|X,Y)}[\log(P(Y|X,Z))]+ roman_log ( italic_P ( italic_Y | italic_X ) ) - blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_Z | italic_X , italic_Y ) end_POSTSUBSCRIPT [ roman_log ( italic_P ( italic_Y | italic_X , italic_Z ) ) ]

Rearranging get

log(P(Y|X))=𝑃conditional𝑌𝑋absent\displaystyle\log(P(Y|X))=roman_log ( italic_P ( italic_Y | italic_X ) ) = 𝔼ZQ(Z|X,Y)[log(P(Y|X,Z))]subscript𝔼similar-to𝑍𝑄conditional𝑍𝑋𝑌delimited-[]𝑃conditional𝑌𝑋𝑍\displaystyle\mathbb{E}_{Z\sim Q(Z|X,Y)}[\log(P(Y|X,Z))]blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_Z | italic_X , italic_Y ) end_POSTSUBSCRIPT [ roman_log ( italic_P ( italic_Y | italic_X , italic_Z ) ) ]
+KL(Q(Z|X,Y)||P(Z|X,Y))\displaystyle+KL(Q(Z|X,Y)||P(Z|X,Y))+ italic_K italic_L ( italic_Q ( italic_Z | italic_X , italic_Y ) | | italic_P ( italic_Z | italic_X , italic_Y ) )
KL(Q(Z|X,Y)||P(Z|X))\displaystyle-KL(Q(Z|X,Y)||P(Z|X))- italic_K italic_L ( italic_Q ( italic_Z | italic_X , italic_Y ) | | italic_P ( italic_Z | italic_X ) )
\displaystyle\geq 𝔼ZQ(Z|X,Y)[log(P(Y|X,Z))]subscript𝔼similar-to𝑍𝑄conditional𝑍𝑋𝑌delimited-[]𝑃conditional𝑌𝑋𝑍\displaystyle\mathbb{E}_{Z\sim Q(Z|X,Y)}[\log(P(Y|X,Z))]blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_Z | italic_X , italic_Y ) end_POSTSUBSCRIPT [ roman_log ( italic_P ( italic_Y | italic_X , italic_Z ) ) ]
KL(Q(Z|X,Y)||P(Z|X))\displaystyle-KL(Q(Z|X,Y)||P(Z|X))- italic_K italic_L ( italic_Q ( italic_Z | italic_X , italic_Y ) | | italic_P ( italic_Z | italic_X ) )

The last line is also known as Variational Lower Bound. Instead of interacting with log(P(Y|X))𝑃conditional𝑌𝑋\log(P(Y|X))roman_log ( italic_P ( italic_Y | italic_X ) ), we interact with the Variational Lower Bound. Same as the literature, we assume P(Z|X)=P(Z)𝑃conditional𝑍𝑋𝑃𝑍P(Z|X)=P(Z)italic_P ( italic_Z | italic_X ) = italic_P ( italic_Z ) for latent variable to be independent of the input (Sohn et al., 2015). This enables the KL divergence term to be written in explicit form, which is

KL(Q(Z|X,Y)||P(Z))\displaystyle KL(Q(Z|X,Y)||P(Z))italic_K italic_L ( italic_Q ( italic_Z | italic_X , italic_Y ) | | italic_P ( italic_Z ) )
=\displaystyle== μ(X,Y)22+tr(Σ(X,Y))log(det(Σ(X,Y)))2subscriptsuperscriptnorm𝜇𝑋𝑌22𝑡𝑟Σ𝑋𝑌Σ𝑋𝑌2\displaystyle\frac{||\mu(X,Y)||^{2}_{2}+tr(\Sigma(X,Y))-\log(\det(\Sigma(X,Y))% )}{2}divide start_ARG | | italic_μ ( italic_X , italic_Y ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_t italic_r ( roman_Σ ( italic_X , italic_Y ) ) - roman_log ( roman_det ( roman_Σ ( italic_X , italic_Y ) ) ) end_ARG start_ARG 2 end_ARG

Now, given dataset D={(xi,yi)}iI𝐷subscriptsubscript𝑥𝑖subscript𝑦𝑖𝑖𝐼D=\{(x_{i},y_{i})\}_{i\in I}italic_D = { ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT, we minimise the empirical version of the bound, namely

𝔼X,YD[𝔼ZQ(Z|X,Y)[log(P(Y|X,Z))]KL(Q(Z|X,Y)||P(Z))]\mathbb{E}_{X,Y\sim D}[\mathbb{E}_{Z\sim Q(Z|X,Y)}[\log(P(Y|X,Z))]-KL(Q(Z|X,Y)% ||P(Z))]blackboard_E start_POSTSUBSCRIPT italic_X , italic_Y ∼ italic_D end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT italic_Z ∼ italic_Q ( italic_Z | italic_X , italic_Y ) end_POSTSUBSCRIPT [ roman_log ( italic_P ( italic_Y | italic_X , italic_Z ) ) ] - italic_K italic_L ( italic_Q ( italic_Z | italic_X , italic_Y ) | | italic_P ( italic_Z ) ) ]

The first term can be further approximated with simulated samples z1i,,zSiQ(Z|xi,yi)isimilar-tosubscriptsuperscript𝑧𝑖1subscriptsuperscript𝑧𝑖𝑆𝑄conditional𝑍subscript𝑥𝑖subscript𝑦𝑖for-all𝑖z^{i}_{1},...,z^{i}_{S}\sim Q(Z|x_{i},y_{i})\forall iitalic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ∼ italic_Q ( italic_Z | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∀ italic_i

iIsSlog(P(yi|xi,zsi))|I||S|subscript𝑖𝐼subscript𝑠𝑆𝑃conditionalsubscript𝑦𝑖subscript𝑥𝑖subscriptsuperscript𝑧𝑖𝑠𝐼𝑆\sum_{i\in I}\sum_{s\in S}\frac{\log(P(y_{i}|x_{i},z^{i}_{s}))}{|I||S|}∑ start_POSTSUBSCRIPT italic_i ∈ italic_I end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_s ∈ italic_S end_POSTSUBSCRIPT divide start_ARG roman_log ( italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) end_ARG start_ARG | italic_I | | italic_S | end_ARG

And by Equation 1, the probability term can be further derived into

log(P(yi|xi,zsi))=c+yif(xi,zis)22σ2𝑃conditionalsubscript𝑦𝑖subscript𝑥𝑖subscriptsuperscript𝑧𝑖𝑠𝑐subscriptsuperscriptnormsubscript𝑦𝑖𝑓subscript𝑥𝑖superscriptsubscript𝑧𝑖𝑠22superscript𝜎2\log(P(y_{i}|x_{i},z^{i}_{s}))=c+\frac{||y_{i}-f(x_{i},z_{i}^{s})||^{2}_{2}}{% \sigma^{2}}roman_log ( italic_P ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) = italic_c + divide start_ARG | | italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

where c𝑐citalic_c is a constant term independent of xi,yi,zsisubscript𝑥𝑖subscript𝑦𝑖subscriptsuperscript𝑧𝑖𝑠x_{i},y_{i},z^{i}_{s}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT

Refer to caption
Figure 10. Cross correlation from data (left), ACP estimate of cross correlation (centre), and CAP estimate of cross correlation (right)

A.1.2. Architecture and optimisation techniques

As for the architecture of the neural networks (formally speaking the space F1,F2subscript𝐹1subscript𝐹2F_{1},F_{2}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), we use two layers of RELU network for the encoder function, with dimensions linking the input space (p×dsuperscript𝑝superscript𝑑\mathbb{R}^{p}\times\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT) to q×qsuperscript𝑞superscript𝑞\mathbb{R}^{q}\times\mathbb{R}^{q}blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT for the first layer, an untrained identity map from qqsuperscript𝑞superscript𝑞\mathbb{R}^{q}\to\mathbb{R}^{q}blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT to map to the expectation term and an untrained softplus from q(0,)qsuperscript𝑞superscript0𝑞\mathbb{R}^{q}\to(0,\infty)^{q}blackboard_R start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT → ( 0 , ∞ ) start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT to map the variance.

We use two layers of RELU and one layer of linear network for the decoder function. The dimensionality depends on the dimensionality of the input — when d=50𝑑50d=50italic_d = 50 (the MCVAE model), the dimensions for the two layers of RELU are input dimension to 64 (59×64superscript59superscript64\mathbb{R}^{59}\times\mathbb{R}^{64}blackboard_R start_POSTSUPERSCRIPT 59 end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT 64 end_POSTSUPERSCRIPT), followed by 64×64superscript64superscript64\mathbb{R}^{64}\times\mathbb{R}^{64}blackboard_R start_POSTSUPERSCRIPT 64 end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT 64 end_POSTSUPERSCRIPT and then linear layer 64×50superscript64superscript50\mathbb{R}^{64}\times\mathbb{R}^{50}blackboard_R start_POSTSUPERSCRIPT 64 end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT 50 end_POSTSUPERSCRIPT for output. When d=1𝑑1d=1italic_d = 1 (the UCVAE model), the layers are 27×16superscript27superscript16\mathbb{R}^{27}\times\mathbb{R}^{16}blackboard_R start_POSTSUPERSCRIPT 27 end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT and 16×8superscript16superscript8\mathbb{R}^{16}\times\mathbb{R}^{8}blackboard_R start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT for RELU, then a linear layer of 8×superscript8\mathbb{R}^{8}\times\mathbb{R}blackboard_R start_POSTSUPERSCRIPT 8 end_POSTSUPERSCRIPT × blackboard_R for output.

The optimisation procedure can be summarised as a combination of ADAM (default setting in tensorflow version 2.16.1) and validation early stop**. The training is hated when the validation loss exceeds 1% of the local minimum within the last 3 steps of the training.

A.1.3. Technical note regarding σ𝜎\sigmaitalic_σ calibration

There are various ways to estimate or calibrate σ𝜎\sigmaitalic_σ. As the focus of this paper is more on the point forecast yt+k|tsubscript𝑦𝑡conditional𝑘𝑡y_{t+k|t}italic_y start_POSTSUBSCRIPT italic_t + italic_k | italic_t end_POSTSUBSCRIPT instead of interval forecasting, we aim to explore the centre of sampling as opposed to the tails, hence calibration is used for the value of σ𝜎\sigmaitalic_σ. In section 3.5, we used σ=0.1𝜎0.1\sigma=0.1italic_σ = 0.1 for efficient sampling to produce overall results, and in section 3.6 and 4, we used σ=1𝜎1\sigma=1italic_σ = 1 to have wider range of samples and to further the investigation to correlation estimates.

References

  • (1)
  • Berliner (1996) L Mark Berliner. 1996. Hierarchical Bayesian time series models. In Maximum Entropy and Bayesian Methods: Santa Fe, New Mexico, USA, 1995 Proceedings of the Fifteenth International Workshop on Maximum Entropy and Bayesian Methods. Springer, 15–22.
  • Cont et al. (2013) Rama Cont, Arseniy Kukanov, and Sasha Stoikov. 2013. The Price Impact of Order Book Events. Journal of Financial Econometrics 12, 1 (2013), 47–88. https://doi.org/10.1093/jjfinec/nbt003
  • De Prado (2018) Marcos Lopez De Prado. 2018. Advances in financial machine learning. John Wiley & Sons.
  • Doersch (2016) Carl Doersch. 2016. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016).
  • Guéant et al. (2020) Olivier Guéant, Iuliia Manziuk, and Jiang Pu. 2020. Accelerated share repurchase and other buyback programs: what neural networks can bring. Quantitative Finance 20, 8 (2020), 1389–1404.
  • Hamdouche et al. (2022) Mohamed Hamdouche, Pierre Henry-Labordere, and Huyên Pham. 2022. Policy gradient learning methods for stochastic control with exit time and applications to share repurchase pricing. Applied Mathematical Finance 29, 6 (2022), 439–456.
  • Li et al. (2022) Yan Li, Xinjiang Lu, Yaqing Wang, and De**g Dou. 2022. Generative Time Series Forecasting with Diffusion, Denoise, and Disentanglement. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. 23009–23022.
  • Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning Structured Output Representation using Deep Conditional Generative Models. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2015/file/8d55a249e6baa5c06772297520da2051-Paper.pdf
  • STOXX (2024) STOXX. 2024. STOXX® INDEX METHODOLOGY GUIDE (PORTFOLIO BASED INDICES). https://www.stoxx.com/document/Indices/Common/Indexguide/stoxx_index_guide.pdf
  • Tsay and Chen (2018) Ruey S Tsay and Rong Chen. 2018. Nonlinear time series analysis. Vol. 891. John Wiley & Sons.
  • Yoon et al. (2019) **sung Yoon, Daniel Jarrett, and Mihaela van der Schaar. 2019. Time-series Generative Adversarial Networks. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32.