Few-Shot Learning Patterns in Financial Time-Series for Trend-Following Strategies

Kieran Wood
Oxford-Man Institute
University of Oxford
[email protected]
&Samuel Kessler
Oxford-Man Institute
University of Oxford
[email protected]
\ANDStephen J. Roberts
Oxford-Man Institute
University of Oxford
[email protected]
&Stefan Zohren
Oxford-Man Institute
University of Oxford
[email protected]
Abstract

Forecasting models for systematic trading strategies do not adapt quickly when financial market conditions rapidly change, as was seen in the advent of the COVID-19 pandemic in 2020202020202020, causing many forecasting models to take loss-making positions. To deal with such situations, we propose a novel time-series trend-following forecaster that can quickly adapt to new market conditions, referred to as regimes. We leverage recent developments from the deep learning community and use few-shot learning. We propose the Cross Attentive Time-Series Trend Network – X-Trend – which takes positions attending over a context set of financial time-series regimes. X-Trend transfers trends from similar patterns in the context set to make forecasts, then subsequently takes positions for a new distinct target regime. By quickly adapting to new financial regimes, X-Trend increases Sharpe ratio by 18.9%percent18.918.9\%18.9 % over a neural forecaster and 10-fold over a conventional Time-series Momentum strategy during the turbulent market period from 2018201820182018 to 2023202320232023. Our strategy recovers twice as quickly from the COVID-19 drawdown compared to the neural-forecaster. X-Trend can also take zero-shot positions on novel unseen financial assets obtaining a 5-fold Sharpe ratio increase versus a neural time-series trend forecaster over the same period. Furthermore, the cross-attention mechanism allows us to interpret the relationship between forecasts and patterns in the context set.

**footnotetext: Equal contribution.

Keywords Trend-Following  \cdot Time-Series Momentum  \cdot Few-Shot Learning  \cdot Deep Learning  \cdot Machine Learning  \cdot Transfer Learning  \cdot Change-point Detection  \cdot Quantitative Finance  \cdot Portfolio Construction

1 Introduction

When financial market conditions change the forecasting models that are used to take positions in the markets perform very poorly [1, 2]. The recent success of deep learning for learning representations from data has translated into better financial forecasting models [3]. However deep learning models also rely on large stationary datasets for representation learning. Financial markets can be highly non-stationary due to changing market conditions. When a financial market enters a new regime, augmenting the inputs with indicators of the time and severity of regime change improves returns [4]. This finding is important since it shows that there is a benefit to training deep learning models that have additional supervision about when regimes change. Furthermore, it raises an important question: can we selectively use historical patterns to transfer knowledge of the past to make forecasts for new regimes and new markets?

Refer to caption
Figure 1: An overview of the X-Trend few-shot learning trend-following model. 1) Each asset is segmented into regimes using a change-point detection algorithm. 2) The context set is constructed by randomly sampling regimes from different assets. The objective is to produce long/short positions given the target sequence while respecting causality with the context set. 3) Our X-Trend model uses a cross-attention layer for the target to leverage patterns in the context set. 4) The model produces a distribution over next-day returns. 5) The model outputs positions using Predictive probability density function To traded Position module (PTP). 6) We train our model by jointly optimizing a Sharpe ratio loss and negative log-likelihood.

In recent years it has been observed, that the risk-adjusted returns of conventional Time-series Momentum (TSMOM) models [5], which exploit trends in financial time-series, have deteriorated by 87.4%percent87.487.4\%87.4 % from 2018201820182018 to 2023202320232023 compared to the period from 1995199519951995 to 2000200020002000. This can potentially be attributed to a concept known as ‘factor crowding’ [6], where arbitrageurs trade the same assets based on similar factors. This causes market inefficiencies to quickly disappear and can increase the risk of liquidity-driven tail events [7]. To mitigate this, one alternative has been to explore new assets where we typically do not have sufficient data for deep learning approaches.

In deep learning, quick adaptation to new data or few-shot learning has seen recent advances in computer vision [8, 9, 10]. Few-shot learning involves training neural networks (NNs) such that they are able to adapt and learn from minimal data. Few-shot learners are tested on completely unseen classes of images using a few to no examples or are required to solve unseen reinforcement learning environments [11, 12, 13]. Few-shot learners have the desirable quality of being able to adapt and learn from very few data points. This is advantageous for a systematic trading strategy, allowing it to adapt quickly to new financial regimes or new markets. Broadly, we can categorize the market regimes, which systematic strategies aim to exploit, as trending or mean-reverting. Mean-reversion is a market phenomenon whereby, under certain circumstances, the price has a tendency to return its long-term mean [14]. A detailed study of the trending and mean-reverting financial market anomalies can be found in [15].

In this work, we leverage advances in few-shot learning and time-series momentum strategies to develop a model that can make predictions in new market regimes and unseen markets. In practice, our experiments backtest on continuous futures contracts of various asset classes: equities, foreign exchange, commodities and fixed income. Our model obtains significantly higher risk-adjusted returns in terms of Sharpe ratio [16], a measure of returns per unit volatility. Our model is able to learn transferable patterns, then subsequently learn to take positions in markets or regimes distinct from those used for training. The idea of universal patterns in financial markets, which are transferable, is motivated by the works  [17, 18, 19].

Our model uses a cross-attention mechanism over a context-set [20, 21]. Attention mechanisms [22] have been shown to improve returns by attending over an asset’s history [19], this work generalizes this finding by extending the temporal attention mechanism over other assets. This enables our model to transfer knowledge from the context set to enable better predictions for a new regime or new market with little data. Different regimes from financial assets are segmented using change-point detection methods [23, 24]. Additionally, the attention maps provide a degree of interpretability in the resulting predictions [25, 19]. We also use the latest insights from deep time-series momentum strategies to train our model to produce positive returns over our baselines [3]. We call our method the Cross Attentive Time-Series Trend Network or X-Trend for short. The code is available at https://github.com/kieranjwood/x-trend.

We summarize our contributions as follows:

  • We leverage few-shot learning, and change-point detection to develop an agent which is able to produce returns in futures markets with minimal data. Our X-Trend model is able to successfully respond to “momentum crashes” [1], and “momentum turning points” [2]. We improve Sharpe ratio by 18.9%percent18.918.9\%18.9 % in comparison to the benchmark neural forecaster over 2018201820182018 to 2023202320232023, an extremely turbulent period in various financial markets. Our X-Trend strategy recovers from the initial COVID-19 drawdown twice as quickly.

  • X-Trend learns to make predictions in challenging, low-resource, zero-shot settings where the model has never seen a financial asset during training. It improves over the loss-making neural forecaster to achieve an average Sharpe of 0.47 over the period 2018201820182018 to 2023202320232023. This turbulent period is particularly challenging for unseen financial assets.

  • X-Trend makes interpretable predictions. It is able to learn relationships between similar assets using an interpretable cross-attention mechanism over a context set of different assets. For a given target sequence, a similarity score with patterns in the context set via cross-attention can be visualized. Additionally, by outputting a forecast as an auxiliary output step, we reveal the relationship between optimal trading signal and forecast.

2 Preliminaries

Let us denote a time-series p1:t(i)subscriptsuperscript𝑝𝑖:1𝑡p^{(i)}_{1:t}italic_p start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT of daily close prices, where i𝑖iitalic_i denotes a particular asset from a basket of assets i𝑖i\in\mathcal{I}italic_i ∈ caligraphic_I and t𝑡titalic_t denotes a time index t{1,,T}𝑡1𝑇t\in\{1,\ldots,T\}italic_t ∈ { 1 , … , italic_T } where T𝑇Titalic_T is the final observation for the asset. We work with returns r1:t(i)subscriptsuperscript𝑟𝑖:1𝑡r^{(i)}_{1:t}italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT, which linearly de-trends the price series:

rtt,t(i)=pt(i)ptt(i)ptt(i),superscriptsubscript𝑟𝑡superscript𝑡𝑡𝑖superscriptsubscript𝑝𝑡𝑖superscriptsubscript𝑝𝑡superscript𝑡𝑖superscriptsubscript𝑝𝑡superscript𝑡𝑖r_{t-t^{\prime},t}^{(i)}=\frac{p_{t}^{(i)}-p_{t-t^{\prime}}^{(i)}}{p_{t-t^{% \prime}}^{(i)}},italic_r start_POSTSUBSCRIPT italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = divide start_ARG italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG , (1)

where tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the number of days we calculate returns over and for brevity we use rt(i)superscriptsubscript𝑟𝑡𝑖r_{t}^{(i)}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT to denote rt1,t(i)superscriptsubscript𝑟𝑡1𝑡𝑖r_{t-1,t}^{(i)}italic_r start_POSTSUBSCRIPT italic_t - 1 , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT.

Our objective is to trade a position zt(i)[1,1]subscriptsuperscript𝑧𝑖𝑡11z^{(i)}_{t}\in[-1,1]italic_z start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ - 1 , 1 ] which we hold for the next day t+1𝑡1t+1italic_t + 1 conditioned on a target sequence 𝐱lt:t(i)subscriptsuperscript𝐱𝑖:subscript𝑙𝑡𝑡\mathbf{x}^{(i)}_{-l_{t}:t}bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT of the ltsubscript𝑙𝑡l_{t}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT last days, where 𝐱t(i)𝒳subscriptsuperscript𝐱𝑖𝑡𝒳\mathbf{x}^{(i)}_{t}\in\mathcal{X}bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X is a vector of |𝒳|𝒳|\mathcal{X}|| caligraphic_X | factors measuring trends at different timescales or the relationship between trends at different timescales. These factors are constructed for time t𝑡titalic_t using returns data and price data, which we elaborate on in Section 2.2 and Section 2.3. We aim to choose positions such that we maximize portfolio returns:

Rt+1Port.=1Ni=1NRt+1(i),where Rt+1(i)=zt(i)σtgtσt(i)rt+1(i)C(i)σtgt|zt(i)σt(i)zt1(i)σt1(i)|,formulae-sequencesuperscriptsubscript𝑅𝑡1Port.1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑅𝑡1𝑖where superscriptsubscript𝑅𝑡1𝑖superscriptsubscript𝑧𝑡𝑖subscript𝜎tgtsuperscriptsubscript𝜎𝑡𝑖superscriptsubscript𝑟𝑡1𝑖superscript𝐶𝑖subscript𝜎tgtsuperscriptsubscript𝑧𝑡𝑖superscriptsubscript𝜎𝑡𝑖superscriptsubscript𝑧𝑡1𝑖superscriptsubscript𝜎𝑡1𝑖R_{t+1}^{\text{Port.}}=\frac{1}{N}\sum_{i=1}^{N}R_{t+1}^{(i)},\quad\text{where% }R_{t+1}^{(i)}=z_{t}^{(i)}~{}\frac{\sigma_{\mathrm{tgt}}}{\sigma_{t}^{(i)}}~{% }r_{t+1}^{(i)}-C^{(i)}~{}\sigma_{\mathrm{tgt}}\left|\frac{z_{t}^{(i)}}{\sigma_% {t}^{(i)}}-\frac{z_{t-1}^{(i)}}{\sigma_{t-1}^{(i)}}\right|,italic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT Port. end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , where italic_R start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUBSCRIPT roman_tgt end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT - italic_C start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT roman_tgt end_POSTSUBSCRIPT | divide start_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG | , (2)

for each of the assets i𝑖i\in\mathcal{I}italic_i ∈ caligraphic_I where N=||𝑁N=|\mathcal{I}|italic_N = | caligraphic_I | assets and C(i)superscript𝐶𝑖C^{(i)}italic_C start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is the transaction cost. We use volatility targeting [5, 26, 27], which introduces a leverage factor σtgt/σt(i)subscript𝜎tgtsuperscriptsubscript𝜎𝑡𝑖\sigma_{\mathrm{tgt}}/\sigma_{t}^{(i)}italic_σ start_POSTSUBSCRIPT roman_tgt end_POSTSUBSCRIPT / italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, where we normalize our holdings by the ex-ante volatility, σt(i)superscriptsubscript𝜎𝑡𝑖\sigma_{t}^{(i)}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, then scale by the annual target volatility σtgtsubscript𝜎tgt\sigma_{\mathrm{tgt}}italic_σ start_POSTSUBSCRIPT roman_tgt end_POSTSUBSCRIPT. Aligning our work with the literature [5, 26], σt(i)superscriptsubscript𝜎𝑡𝑖\sigma_{t}^{(i)}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT is calculated using an exponentially weighted moving standard deviation of returns over the span r60:tsubscript𝑟:60𝑡r_{-60:t}italic_r start_POSTSUBSCRIPT - 60 : italic_t end_POSTSUBSCRIPT, where the contribution has decreased to zero by 60 days in the past. For this paper we set C(i)=0superscript𝐶𝑖0C^{(i)}=0italic_C start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = 0 rather than assuming a cost for each asset, focusing on pure predictive power of our model. The volatility targeting approach to portfolio construction ensures that each asset contributes approximately equal risk to the portfolio111We assume a diagonal covariance, which is a reasonable assumption outside of the tail for the basket of futures contracts we consider in this paper, which is a typical basket for trend-following strategies.. As such, we perform regression to estimate the probability distribution of volatility scaled next-day returns, σtgtσt(i)rt+1(i)subscript𝜎tgtsuperscriptsubscript𝜎𝑡𝑖superscriptsubscript𝑟𝑡1𝑖~{}\frac{\sigma_{\mathrm{tgt}}}{\sigma_{t}^{(i)}}~{}r_{t+1}^{(i)}\in\mathbb{R}divide start_ARG italic_σ start_POSTSUBSCRIPT roman_tgt end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R.

2.1 Episodic Learning

We use episodic learning [8] which trains models in the same way as they are used for testing; models are trained to produce few-shot and zero-shot predictions. Traditional deep learning puts data from all assets together and is trained using mini-batch stochastic gradient descent [28, 29]. In episodic learning, we want to make forecasts given a sequence’s history for a specific asset, and leverage sequences from other assets for additional non-parametric similarity learning.

Our learner selects a position given target: 𝐱tlt+1:t(i)subscriptsuperscript𝐱𝑖:𝑡subscript𝑙𝑡1𝑡\mathbf{x}^{(i)}_{t-l_{t}+1:t}bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 : italic_t end_POSTSUBSCRIPT, which we shorten to 𝐱lt:t(i)subscriptsuperscript𝐱𝑖:subscript𝑙𝑡𝑡\mathbf{x}^{(i)}_{-l_{t}:t}bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT for brevity, and where i𝑖iitalic_i belongs to our test set tssubscript𝑡𝑠\mathcal{I}_{ts}caligraphic_I start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT of assets. Additionally, the learner leverages a context set of prices 𝒞={𝐱lc:tc(c)}c=1C𝒞superscriptsubscriptsubscriptsuperscript𝐱𝑐:subscript𝑙𝑐subscript𝑡𝑐𝑐1𝐶\mathcal{C}=\left\{\mathbf{x}^{(c)}_{-l_{c}:t_{c}}\right\}_{c=1}^{C}caligraphic_C = { bold_x start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT where c𝑐citalic_c belongs to our training set of assets trsubscript𝑡𝑟\mathcal{I}_{tr}caligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT, lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the length of each time-series in the context set and C=|𝒞|𝐶𝒞C=|\mathcal{C}|italic_C = | caligraphic_C | is the total number of contexts in the context set. In all circumstances, the context sets are time-series that have occurred before the time-series in the target: tc<tsubscript𝑡𝑐𝑡t_{c}<titalic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT < italic_t for all c𝑐citalic_c in the context set so that our predictions are causal. Throughout the paper we will work with two distinct problem scenarios:

  • Few-shot: where the context set can contain the same asset as the target (but in the past) trts=subscript𝑡𝑟subscript𝑡𝑠\mathcal{I}_{tr}\cap\mathcal{I}_{ts}=\mathcal{I}caligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ∩ caligraphic_I start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT = caligraphic_I.

  • Zero-shot: where the target set asset (at test time) is not contained in the context set trts=subscript𝑡𝑟subscript𝑡𝑠\mathcal{I}_{tr}\cap\mathcal{I}_{ts}=\emptysetcaligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT ∩ caligraphic_I start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT = ∅.

We sample a target time-series, 𝐱lt:t(i)superscriptsubscript𝐱:subscript𝑙𝑡𝑡𝑖\mathbf{x}_{-l_{t}:t}^{(i)}bold_x start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and we sample an associated context set 𝒞𝒞\mathcal{C}caligraphic_C per target point in the mini-batch. As opposed to our target problem, we also include the volatility scaled next-day returns as inputs in the context set, denoting the combined inputs as 𝝃lc:t(c)subscriptsuperscript𝝃𝑐:subscript𝑙𝑐𝑡\bm{\xi}^{(c)}_{-l_{c}:t}bold_italic_ξ start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT. We describe how we construct our context sets for the financial assets in Section 3.2.

2.2 Classical Momentum Approaches

Time-series Momentum (TSMOM) strategies [5], or trend-following strategies, are based on the idea that strong price trends have a tendency to persist. The phenomena of trend-following has been observed historically for more than a century222It should be noted that returns have started to suffer since the introduction of electronic trading [4] this century., outperforming a simple buy-and-hold Long strategy [30]. TSMOM strategies aim to forecast trends and then map them to a trading signal, or position. For instance, we can calculate returns over the past year (252252252252 trading days), rt252,t(i)subscriptsuperscript𝑟𝑖𝑡252𝑡r^{(i)}_{t-252,t}italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 252 , italic_t end_POSTSUBSCRIPT and map this to a position zt(i)superscriptsubscript𝑧𝑡𝑖z_{t}^{(i)}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT:

zt(i)=sgn(rt252,t(i)),superscriptsubscript𝑧𝑡𝑖sgnsubscriptsuperscript𝑟𝑖𝑡252𝑡z_{t}^{(i)}=\mathrm{sgn}(r^{(i)}_{t-252,t}),italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = roman_sgn ( italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 252 , italic_t end_POSTSUBSCRIPT ) , (3)

where sgn()sgn\mathrm{sgn}(\cdot)roman_sgn ( ⋅ ) is the sign function, with +11+1+ 1 corresponding to a full long position and 11-1- 1 a full short position [5]. The Long strategy takes a full long position at each time-step: zt(i)=1superscriptsubscript𝑧𝑡𝑖1z_{t}^{(i)}=1italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = 1.

We employ volatility scaling for portfolio construction Eq. 2. This approach to portfolio construction is often used in practice by Commodity Trading Advisors (CTAs), where trend-following is a major component of their overall strategy. While this TSMOM approach has proven to be successful over time [30], the 1111-year momentum indicator particularly suffers during periods where market conditions change rapidly. We call these regime changes throughout this paper (also referred to as momentum crashes [1]). An attempt to mitigate regime changes is to combine weighted signals at different timescales, for example, 1-month (21-day), 3-month (63-day), half-year (126-day) and 12-month momentum, zt(i)=t{21,63,126,252}wtsgn(rtt,t(i))superscriptsubscript𝑧𝑡𝑖subscriptsuperscript𝑡2163126252subscript𝑤superscript𝑡sgnsubscriptsuperscript𝑟𝑖𝑡superscript𝑡𝑡z_{t}^{(i)}=\sum_{t^{\prime}\in\{21,63,126,252\}}w_{t^{\prime}}~{}\mathrm{sgn}% (r^{(i)}_{t-t^{\prime},t})italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 21 , 63 , 126 , 252 } end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_sgn ( italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t end_POSTSUBSCRIPT ), where wt[0,1]subscript𝑤superscript𝑡01w_{t^{\prime}}\in[0,1]italic_w start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ [ 0 , 1 ] represents the weighting of each respective factor.

MACD (Moving Average Convergence Divergence) factors compare exponentially weighted signals at two different timescales [31]. These popular momentum indicators aim to balance the trade-off between capturing trends and responding to potential regime changes:

MACD(p1:t(i),S,L)MACDsuperscriptsubscript𝑝:1𝑡𝑖𝑆𝐿\displaystyle\mathrm{MACD}\left(p_{1:t}^{(i)},S,L\right)roman_MACD ( italic_p start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_S , italic_L ) =mt(i)std(m252:t(i))absentsubscriptsuperscript𝑚𝑖𝑡stdsuperscriptsubscript𝑚:252𝑡𝑖\displaystyle=\frac{m^{(i)}_{t}}{\mathrm{std}\left(m_{-252:t}^{(i)}\right)}= divide start_ARG italic_m start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_std ( italic_m start_POSTSUBSCRIPT - 252 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_ARG (4a)
wheremtwheresubscript𝑚𝑡\displaystyle\textrm{where}\,m_{t}where italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =EWMA(p1:t(i),S)EWMA(p1:t(i),L)std(p60:t(i)),absentEWMAsuperscriptsubscript𝑝:1𝑡𝑖𝑆EWMAsuperscriptsubscript𝑝:1𝑡𝑖𝐿stdsuperscriptsubscript𝑝:60𝑡𝑖\displaystyle=\frac{\mathrm{EWMA}\left(p_{1:t}^{(i)},S\right)-\mathrm{EWMA}% \left(p_{1:t}^{(i)},L\right)}{\mathrm{std}\left(p_{-60:t}^{(i)}\right)},= divide start_ARG roman_EWMA ( italic_p start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_S ) - roman_EWMA ( italic_p start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_L ) end_ARG start_ARG roman_std ( italic_p start_POSTSUBSCRIPT - 60 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) end_ARG , (4b)

where EWMA()EWMA\mathrm{EWMA}(\cdot)roman_EWMA ( ⋅ ) function is an exponentially weighted moving average. The inputs are a short timescale S𝑆Sitalic_S, with half-life of log(12)/log(11S)1211𝑆\log\left(\frac{1}{2}\right)/\log\left(1-\frac{1}{S}\right)roman_log ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG ) / roman_log ( 1 - divide start_ARG 1 end_ARG start_ARG italic_S end_ARG ) and a long timescale L𝐿Litalic_L, defined similarly. The MACD signal indicates buy if >0absent0>0> 0 and sell if <0absent0<0< 0. The magnitude provides a measure of conviction or signal strength. It is common to blend multiple MACD indicators at different timescales with a typical choice being (S,L){(8,24),(16,28),(32,96)}𝑆𝐿82416283296(S,L)\in\left\{(8,24),(16,28),(32,96)\right\}( italic_S , italic_L ) ∈ { ( 8 , 24 ) , ( 16 , 28 ) , ( 32 , 96 ) } [31]. Funds typically convert the MACD signal to a position via response function: yyexp(y2/4)/0.89maps-to𝑦𝑦superscript𝑦240.89y\mapsto y~{}\exp(-y^{2}/4)/0.89italic_y ↦ italic_y roman_exp ( - italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 4 ) / 0.89 [31].

2.3 Deep-learning Momentum Approaches

In financial markets, we often observe different trends and mean-reversions, which we can also think of as a collection of shorter-term trends, occurring concurrently at multiple timescales. Furthermore, the added complexity of regime change means that it is a daunting task to successfully blend various trading signals. This motivated Deep Momentum Networks (DMNs), as a solution to such a complex forecasting problem  [3, 19, 4, 32], which have been shown to outperform TSMOM in terms of risk-adjusted returns [3].

We opt to use factors that are commonly used in trend-following strategies [31, 30, 3]. Concretely, we use returns aggregated and normalized over different time scales:

r^tt,t(i)=rtt,t(i)/σt(i)t,subscriptsuperscript^𝑟𝑖𝑡superscript𝑡𝑡subscriptsuperscript𝑟𝑖𝑡superscript𝑡𝑡superscriptsubscript𝜎𝑡𝑖superscript𝑡\hat{r}^{(i)}_{t-t^{\prime},t}=r^{(i)}_{t-t^{\prime},t}/\sigma_{t}^{(i)}\sqrt{% t^{\prime}},over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t end_POSTSUBSCRIPT = italic_r start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t end_POSTSUBSCRIPT / italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT square-root start_ARG italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG , (5)

and include MACD indicators:

𝐱t(i)=Concat([r^tt,t(i)|t{1,21,63,126,252}],[MACD(p1:t(i),S,L)|(S,L)]).superscriptsubscript𝐱𝑡𝑖Concatdelimited-[]conditionalsubscriptsuperscript^𝑟𝑖𝑡superscript𝑡𝑡superscript𝑡12163126252delimited-[]conditionalMACDsuperscriptsubscript𝑝:1𝑡𝑖𝑆𝐿for-all𝑆𝐿\mathbf{x}_{t}^{(i)}=\mathrm{Concat}\left(\left[\hat{r}^{(i)}_{t-t^{\prime},t}% \,|\,t^{\prime}\in\{1,21,63,126,252\}\right],\left[\mathrm{MACD}\left(p_{1:t}^% {(i)},S,L\right)\,|\,\forall\,(S,L)\right]\right).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = roman_Concat ( [ over^ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t end_POSTSUBSCRIPT | italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 1 , 21 , 63 , 126 , 252 } ] , [ roman_MACD ( italic_p start_POSTSUBSCRIPT 1 : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_S , italic_L ) | ∀ ( italic_S , italic_L ) ] ) . (6)

The DMN framework simultaneously learns asset price trends and position sizes. The position sizes are estimated using a neural network:

zlt:t(i)=DMN(𝐱lt:t(i))=(tanhLinearg)(𝐱lt:t(i)),superscriptsubscript𝑧:subscript𝑙𝑡𝑡𝑖DMNsuperscriptsubscript𝐱:subscript𝑙𝑡𝑡𝑖Lineargsuperscriptsubscript𝐱:subscript𝑙𝑡𝑡𝑖z_{-l_{t}:t}^{(i)}=\mathrm{DMN}\left(\mathbf{x}_{-l_{t}:t}^{(i)}\right)=\left(% \tanh\circ~{}\mathrm{Linear}\circ~{}\mathrm{g}\right)\left(\mathbf{x}_{-l_{t}:% t}^{(i)}\right),italic_z start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = roman_DMN ( bold_x start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = ( roman_tanh ∘ roman_Linear ∘ roman_g ) ( bold_x start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) , (7)

where g:𝒳ltdh:𝑔superscript𝒳subscript𝑙𝑡superscriptsubscript𝑑g:\mathcal{X}^{l_{t}}\to\mathbb{R}^{d_{h}}italic_g : caligraphic_X start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is a neural network followed by a linear map**, Linear:dh:Linearsuperscriptsubscript𝑑\mathrm{Linear}:\mathbb{R}^{d_{h}}\to\mathbb{R}roman_Linear : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R, and tanh\tanhroman_tanh activation function such that traded position zt(i)(1,1)superscriptsubscript𝑧𝑡𝑖11z_{t}^{(i)}\in(-1,1)italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ ( - 1 , 1 ). The neural network hidden state dimension is denoted dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

The primary innovation of DMNs, as introduced in [3], was to output traded positions and directly optimize the Sharpe ratio: a risk-adjusted return metric measuring returns per unit volatility. Typically most fund managers or CTAs will have a predefined risk tolerance and aim to maximize returns given this constraint333The loss function can easily be tailored to instead use drawdown, VaR (Value at Risk), or even some combination of these, instead of volatility as the risk measure.. With the aim to optimize neural network parameters θ𝜃\thetaitalic_θ, the Sharpe loss function is defined as:

SharpeDMN()(θ)=252meanΩ[σtgtσt(i)rt+1(i)DMN(𝐱lt:t(i))]stdΩ[σtgtσt(i)rt+1(i)DMN(𝐱lt:t(i))],superscriptsubscriptSharpeDMN𝜃252subscriptmeanΩdelimited-[]subscript𝜎tgtsuperscriptsubscript𝜎𝑡𝑖superscriptsubscript𝑟𝑡1𝑖DMNsuperscriptsubscript𝐱:subscript𝑙𝑡𝑡𝑖subscriptstdΩdelimited-[]subscript𝜎tgtsuperscriptsubscript𝜎𝑡𝑖superscriptsubscript𝑟𝑡1𝑖DMNsuperscriptsubscript𝐱:subscript𝑙𝑡𝑡𝑖\mathcal{L}_{\mathrm{Sharpe}}^{\mathrm{DMN}(\cdot)}(\theta)=-\sqrt{252}\frac{% \mathrm{mean}_{\Omega}\left[\frac{\sigma_{\mathrm{tgt}}}{\sigma_{t}^{(i)}}~{}r% _{t+1}^{(i)}~{}\mathrm{DMN}\left(\mathbf{x}_{-l_{t}:t}^{(i)}\right)\right]}{% \mathrm{std}_{\Omega}\left[\frac{\sigma_{\mathrm{tgt}}}{\sigma_{t}^{(i)}}~{}r_% {t+1}^{(i)}~{}\mathrm{DMN}\left(\mathbf{x}_{-l_{t}:t}^{(i)}\right)\right]},caligraphic_L start_POSTSUBSCRIPT roman_Sharpe end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_DMN ( ⋅ ) end_POSTSUPERSCRIPT ( italic_θ ) = - square-root start_ARG 252 end_ARG divide start_ARG roman_mean start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT [ divide start_ARG italic_σ start_POSTSUBSCRIPT roman_tgt end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT roman_DMN ( bold_x start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ] end_ARG start_ARG roman_std start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT [ divide start_ARG italic_σ start_POSTSUBSCRIPT roman_tgt end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT roman_DMN ( bold_x start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ] end_ARG , (8)

where ΩΩ\Omegaroman_Ω is a batch of |Ω|Ω|\Omega|| roman_Ω | pairs (i,t)×𝒯𝑖𝑡𝒯(i,t)\in\mathcal{I}\times\mathcal{T}( italic_i , italic_t ) ∈ caligraphic_I × caligraphic_T. In practice, when training the model, we select b𝑏bitalic_b sequences such that Ω={{i}×((tlt+ls+1):t)|i,t𝒯}\Omega=\bigcup\{\{i\}\times((t-l_{t}+l_{s}+1):t)|i\in\mathcal{I},t\in\mathcal{% T}\}roman_Ω = ⋃ { { italic_i } × ( ( italic_t - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + 1 ) : italic_t ) | italic_i ∈ caligraphic_I , italic_t ∈ caligraphic_T }, with warm-up period lsltsubscript𝑙𝑠subscript𝑙𝑡l_{s}\leq l_{t}italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≤ italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. That is, our loss function ignores the first lssubscript𝑙𝑠l_{s}italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT predictions for each sampled sequence.

The Long Short-Term Memory cell (LSTM) [33] is a popular Recurrent Neural Network (RNN) tailored to modeling sequences and is the primary DMN component [3, 4]. In addition to the output for each time-step 𝐡t(i)(1,1)dhsubscriptsuperscript𝐡𝑖𝑡superscript11subscript𝑑\mathbf{h}^{(i)}_{t}\in(-1,1)^{d_{h}}bold_h start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( - 1 , 1 ) start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, the LSTM maintains 𝐜t(i)dhsubscriptsuperscript𝐜𝑖𝑡superscriptsubscript𝑑\mathbf{c}^{(i)}_{t}\in\mathbb{R}^{d_{h}}bold_c start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, a cell state, which stores longer-term information:

(𝐡t(i),𝐜t(i))=LSTM(𝐱t(i),𝐡t1(i),𝐜t1(i)).superscriptsubscript𝐡𝑡𝑖superscriptsubscript𝐜𝑡𝑖LSTMsuperscriptsubscript𝐱𝑡𝑖superscriptsubscript𝐡𝑡1𝑖superscriptsubscript𝐜𝑡1𝑖(\mathbf{h}_{t}^{(i)},\mathbf{c}_{t}^{(i)})=\mathrm{LSTM}(\mathbf{x}_{t}^{(i)}% ,\mathbf{h}_{t-1}^{(i)},\mathbf{c}_{t-1}^{(i)}).( bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = roman_LSTM ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) . (9)

The LSTM modulates information through a series of gates, clearing 𝐜t(i)subscriptsuperscript𝐜𝑖𝑡\mathbf{c}^{(i)}_{t}bold_c start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT when necessary, such as during regime change, and maintains 𝐡t(i)subscriptsuperscript𝐡𝑖𝑡\mathbf{h}^{(i)}_{t}bold_h start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as a localized summary of the sequence. The initialization (𝐡0(i),𝐜0(i))subscriptsuperscript𝐡𝑖0subscriptsuperscript𝐜𝑖0(\mathbf{h}^{(i)}_{0},\mathbf{c}^{(i)}_{0})( bold_h start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_c start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) can be specific per contract, and we provide details of our implementation Section 3.1.

2.4 Attention

The attention mechanism computes a weighted average of elements where the weights depend on an input query and the elements’ keys [22]. It dynamically decides on which input elements to focus on or “attend” to. Specifically, it computes a similarity between the query vector and key vectors. These vectors can either be model inputs, some deep-learning hidden state, or, in the case of our work, a hidden state summarising a sequence. If we have a query vector 𝐪dq𝐪superscriptsubscript𝑑𝑞\mathbf{q}\in\mathbb{R}^{d_{q}}bold_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and we want to compute the relative importance of each key in K={𝐤1,,𝐤|𝒞|}𝐾subscript𝐤1subscript𝐤𝒞K=\{\mathbf{k}_{1},\ldots,\mathbf{k}_{|\mathcal{C}|}\}italic_K = { bold_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_k start_POSTSUBSCRIPT | caligraphic_C | end_POSTSUBSCRIPT }, we calculate soft weights with a softmax function:

p𝐪(𝐤)=α(𝐪,𝐤)𝐤Kα(𝐪,𝐤),α(𝐪,𝐤)=exp(1dattWq𝐪,Wk𝐤),formulae-sequencesubscript𝑝𝐪𝐤𝛼𝐪𝐤subscriptsuperscript𝐤𝐾𝛼𝐪superscript𝐤𝛼𝐪𝐤1subscript𝑑attsubscript𝑊𝑞𝐪subscript𝑊𝑘𝐤p_{\mathbf{q}}(\mathbf{k})=\frac{\alpha(\mathbf{q},\mathbf{k})}{\sum_{\mathbf{% k}^{\prime}\in K}\alpha(\mathbf{q},\mathbf{k}^{\prime})},\quad\alpha(\mathbf{q% },\mathbf{k})=\exp\left(\frac{1}{\sqrt{d_{\text{att}}}}\langle W_{q}\mathbf{q}% ,W_{k}\mathbf{k}\rangle\right),italic_p start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_k ) = divide start_ARG italic_α ( bold_q , bold_k ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT bold_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_K end_POSTSUBSCRIPT italic_α ( bold_q , bold_k start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG , italic_α ( bold_q , bold_k ) = roman_exp ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT att end_POSTSUBSCRIPT end_ARG end_ARG ⟨ italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT bold_q , italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_k ⟩ ) , (10)

with learnable weight matrices W()datt×d()subscript𝑊superscriptsubscript𝑑attsubscript𝑑W_{(\cdot)}\in\mathbb{R}^{d_{\mathrm{att}}\times d_{(\cdot)}}italic_W start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_att end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and attention dimension dattsubscript𝑑attd_{\mathrm{att}}italic_d start_POSTSUBSCRIPT roman_att end_POSTSUBSCRIPT. The inner-product ,\langle\cdot,\cdot\rangle⟨ ⋅ , ⋅ ⟩ is the mechanism used to compute similarity. The primary benefit of attention is that it provides a direct connection with all of the keys and, furthermore, these weights are interpretable. We then use each weight p𝐪()subscript𝑝𝐪p_{\mathbf{q}}(\cdot)italic_p start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( ⋅ ) to scale the values V={𝐯1,,𝐯|𝒞|}𝑉subscript𝐯1subscript𝐯𝒞V=\{\mathbf{v}_{1},\ldots,\mathbf{v}_{|\mathcal{C}|}\}italic_V = { bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_v start_POSTSUBSCRIPT | caligraphic_C | end_POSTSUBSCRIPT }, where each value has a corresponding key, and aggregate as:

Att(𝐪,K,V)=j=1|𝒞|p𝐪(𝐤j)Wv𝐯j.Att𝐪𝐾𝑉superscriptsubscript𝑗1𝒞subscript𝑝𝐪subscript𝐤𝑗subscript𝑊𝑣subscript𝐯𝑗\mathrm{Att}(\mathbf{q},K,V)=\sum_{j=1}^{|\mathcal{C}|}p_{\mathbf{q}}(\mathbf{% k}_{j})~{}W_{v}~{}\mathbf{v}_{j}.roman_Att ( bold_q , italic_K , italic_V ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_C | end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT bold_q end_POSTSUBSCRIPT ( bold_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT bold_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT . (11)

We elaborate on the specific details of our implementation in Section 3.3. We can generalize equation 11 to multiple parallel attention heads to make the model bigger, allowing the model to capture representations for different patterns and time-scales. The cross-attention mechanism in X-Trend uses 4444 heads.

3 Cross-Attentive Time-Series Momentum Forecaster

3.1 Sequence Representations and Baseline Neural Forecaster

We want to create sequence summaries, of sequence length l𝑙litalic_l, with a learnable function Ξ:𝒳l×𝒮dh:Ξsuperscript𝒳𝑙𝒮superscriptsubscript𝑑\Xi:\mathcal{X}^{l}\times\mathcal{S}\to\mathbb{R}^{d_{h}}roman_Ξ : caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT × caligraphic_S → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is our hidden dimension. Our representation is created to summarise not just input sequences 𝐱lt:t(i)𝒳lsuperscriptsubscript𝐱:subscript𝑙𝑡𝑡𝑖superscript𝒳𝑙\mathbf{x}_{-l_{t}:t}^{(i)}\in\mathcal{X}^{l}bold_x start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ caligraphic_X start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT, but also side information s(i)𝒮superscript𝑠𝑖𝒮s^{(i)}\in\mathcal{S}italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ caligraphic_S, namely the category (ticker) of the futures contracts. We include this category because we know the dynamics of different contracts can vary significantly; for example crude oil compared to the 5-year treasury bond. For each time-step, and each asset, we input features 𝐱t(i)𝒳subscriptsuperscript𝐱𝑖𝑡𝒳\mathbf{x}^{(i)}_{t}\in\mathcal{X}bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X, as defined in Eq. 6.

We encode side information s(i)superscript𝑠𝑖s^{(i)}italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, with an entity embedding: a learnable map** of each category into dhsuperscriptsubscript𝑑\mathbb{R}^{d_{h}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, which can automatically learn to group similar assets in the embedding space [34]. We use a feedforward (FFN) network to fuse the entity embeddings and time-series representations. Throughout our paper we use ELUELU\mathrm{ELU}roman_ELU (Exponential Linear Unit) [35] activations which are continuously differentiable everywhere and avoid dead neurons (which are completely deactivated). We indicate the ability to optionally include static information via embeddings in blue:

FFN(𝐡t,s)FFNsubscript𝐡𝑡𝑠\displaystyle\mathrm{FFN}(\mathbf{h}_{t}{\color[rgb]{0,0,1}\definecolor[named]% {pgfstrokecolor}{rgb}{0,0,1},s})roman_FFN ( bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s ) =Linear3ELU(Linear1(𝐡t)+Linear2(Embedding(s)),\displaystyle=\mathrm{Linear}_{3}\circ\mathrm{ELU}\left(\mathrm{Linear}_{1}(% \mathbf{h}_{t}){\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 0,0,1}+\mathrm{Linear}_{2}(\mathrm{Embedding}(s)}\right),= roman_Linear start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ roman_ELU ( roman_Linear start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + roman_Linear start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( roman_Embedding ( italic_s ) ) , (12a)

where Linear()()subscriptLinear\mathrm{Linear}_{(\cdot)}(\cdot)roman_Linear start_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT ( ⋅ ) is a learnable linear transformation into dhsuperscriptsubscript𝑑\mathbb{R}^{d_{h}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐡tsubscript𝐡𝑡\mathbf{h}_{t}bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a vector which can either be a hidden-state or a vector. Additionally, we use the Variable Selection Network (VSN) [3], which weights out different lagged normalized returns and MACD features.

We associate a learnable nonlinear function FFNj:dh:subscriptFFN𝑗superscriptsubscript𝑑\mathrm{FFN}_{j}:\mathbb{R}\rightarrow\mathbb{R}^{d_{h}}roman_FFN start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : blackboard_R → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with the j𝑗jitalic_j-th element of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and scale by the associated learnable weight wt,jsubscript𝑤𝑡𝑗w_{t,j}italic_w start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT, of 𝐰t|𝒳|subscript𝐰𝑡superscript𝒳\mathbf{w}_{t}\in\mathbb{R}^{|\mathcal{X}|}bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_X | end_POSTSUPERSCRIPT:

𝐰tsubscript𝐰𝑡\displaystyle\mathbf{w}_{t}bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =SoftmaxFFN(𝐱t,s)absentSoftmaxFFNsubscript𝐱𝑡𝑠\displaystyle=\mathrm{Softmax}\circ\mathrm{FFN}(\mathbf{x}_{t},s)= roman_Softmax ∘ roman_FFN ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s ) (13a)
VSN(𝐱t)VSNsubscript𝐱𝑡\displaystyle\mathrm{VSN}(\mathbf{x}_{t})roman_VSN ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =j=1|𝒳|wt,jFFNj(xt,j),absentsuperscriptsubscript𝑗1𝒳subscript𝑤𝑡𝑗subscriptFFN𝑗subscript𝑥𝑡𝑗\displaystyle=\sum_{j=1}^{|\mathcal{X}|}w_{t,j}~{}\mathrm{FFN}_{j}(x_{t,j}),= ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_X | end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT roman_FFN start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ) , (13b)

where the j𝑗jitalic_j-th element of the softmax is defined as: Softmax(𝐱)j=exj/k=1|𝒳|exkSoftmaxsubscript𝐱𝑗superscript𝑒subscript𝑥𝑗superscriptsubscript𝑘1𝒳superscript𝑒subscript𝑥𝑘\mathrm{Softmax}(\mathbf{x})_{j}=e^{x_{j}}/\sum_{k=1}^{|\mathcal{X}|}e^{x_{k}}roman_Softmax ( bold_x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT / ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_X | end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT.

Our model consists both of an encoder and a decoder. It is important to note that in the decoder we also include the output of the encoder, 𝐲lt:t(i)𝒴ltsubscriptsuperscript𝐲𝑖:subscript𝑙𝑡𝑡superscript𝒴subscript𝑙𝑡\mathbf{y}^{(i)}_{-l_{t}:t}\in\mathcal{Y}^{l_{t}}bold_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT ∈ caligraphic_Y start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, as an input, where 𝒴dh𝒴superscriptsubscript𝑑\mathcal{Y}\subseteq\mathbb{R}^{d_{h}}caligraphic_Y ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (Section 3.4). Our encoder sequence representations, Ξ(,)Ξ\Xi(\cdot,\cdot)roman_Ξ ( ⋅ , ⋅ ), can be summarised as:

𝐱tsuperscriptsubscript𝐱𝑡\displaystyle\mathbf{x}_{t}^{\prime}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =VSN(𝐱t,s),absentVSNsubscript𝐱𝑡𝑠\displaystyle=\mathrm{VSN}(\mathbf{x}_{t},s),= roman_VSN ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s ) , (14a)
(𝐡t,𝐜t)subscript𝐡𝑡subscript𝐜𝑡\displaystyle(\mathbf{h}_{t},\mathbf{c}_{t})( bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =LSTM(𝐱t,𝐡t1,𝐜t1),absentLSTMsuperscriptsubscript𝐱𝑡subscript𝐡𝑡1subscript𝐜𝑡1\displaystyle=\mathrm{LSTM}(\mathbf{x}_{t}^{\prime},\mathbf{h}_{t-1},\mathbf{c% }_{t-1}),= roman_LSTM ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , (14b)
𝐚tsubscript𝐚𝑡\displaystyle\mathbf{a}_{t}bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =LayerNorm(𝐡t+𝐱t)absentLayerNormsubscript𝐡𝑡superscriptsubscript𝐱𝑡\displaystyle=\mathrm{LayerNorm}(\mathbf{h}_{t}+\mathbf{x}_{t}^{\prime})= roman_LayerNorm ( bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (14c)
Ξ(𝐱l:t,s)Ξsubscript𝐱:𝑙𝑡𝑠\displaystyle\Xi(\mathbf{x}_{-l:t},s)roman_Ξ ( bold_x start_POSTSUBSCRIPT - italic_l : italic_t end_POSTSUBSCRIPT , italic_s ) =LayerNorm(FFN2(𝐚t,s)+𝐚t).absentLayerNormsubscriptFFN2subscript𝐚𝑡𝑠subscript𝐚𝑡\displaystyle=\mathrm{LayerNorm}\left(\mathrm{FFN}_{2}(\mathbf{a}_{t},s)+% \mathbf{a}_{t}\right).= roman_LayerNorm ( roman_FFN start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s ) + bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (14d)

The LSTM initial state is learnable and specific to each contract, setting (𝐡0,𝐜0)=(FFN3Embedding(s),FFN4Embedding(s))subscript𝐡0subscript𝐜0subscriptFFN3Embedding𝑠subscriptFFN4Embedding𝑠(\mathbf{h}_{0},\mathbf{c}_{0})=(\mathrm{FFN}_{3}\circ\mathrm{Embedding}(s),% \mathrm{FFN}_{4}\circ\mathrm{Embedding}(s))( bold_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ( roman_FFN start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∘ roman_Embedding ( italic_s ) , roman_FFN start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∘ roman_Embedding ( italic_s ) ). The skip connections, in equation 19c and equation 19d, allow for the respective components to be suppressed, enabling the model to be as complex as necessary. See Fig. 4 for an overview of our model.

The Baseline neural forecaster which we compare to uses Ξ(,)Ξ\Xi(\cdot,\cdot)roman_Ξ ( ⋅ , ⋅ ) as a model i.e. g()𝑔g(\cdot)italic_g ( ⋅ ) from equation 7. The baseline architecture consists of only the decoder. The X-trend model adds an encoder and the cross-attentive steps.

3.2 Context Set Construction

When backtesting, we randomly sample a context set of size |𝒞|𝒞|\mathcal{C}|| caligraphic_C | sequences from the past444During training we do not enforce causality when we construct our context set, as the aim is to teach the model to best transfer patterns; however, the training set only contains information prior to the test date to ensure our model is causal at test time. to enforce causality at the time of prediction for our target problem. We explore three different approaches to constructing our context sequences and choosing the point tcsubscript𝑡𝑐t_{c}italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT at which we condition on, illustrating these in Fig. 2:

  1. 1.

    Final hidden state and random sequences of fixed length. We sample random sequences across time and assets of fixed length lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and we condition on the final hidden states of the sequences, which summarises the sequence.

  2. 2.

    Time-equivalent hidden state. We sample random context sequences that are the same length as the target sequence: lc=ltsubscript𝑙𝑐subscript𝑙𝑡l_{c}=l_{t}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For each time-step in the target sequence, we condition on the time-equivalent hidden states i.e. the k𝑘kitalic_k-th target step attends to the k𝑘kitalic_k-th context steps. This allows the model to incorporate additional adjacent regimes by conditioning on different representations at each time-step of the target sequence.

  3. 3.

    Change-point detection (CPD) segmented representations. We use a Gaussian Process change-point detection algorithm, detailed in Appendix A, to segment the context set into regimes. An example of this segmentation for the British Pound can be seen in Fig. 3. We randomly sample |𝒞|𝒞|\mathcal{C}|| caligraphic_C | change-point segmented sequences and condition on the final hidden states of these change-point time-series segments. We limit to a maximum sequence size. We test both 21212121 day (1111 month) and 63636363 day (3333 month) maximum length sizes, using a higher CPD severity threshold for the 63636363 day version, calibrating a change-point threshold in both cases such that the average sequence length is approximately half of the maximum length.

Refer to caption
Figure 2: An illustration of the different ways the target is able to attend to the context set. F, every hidden state in the target sequence is able to attend to the final hidden states of the contexts. T, the time equivalent hidden state in the target is able to attend to the corresponding hidden state in the contexts. C, every hidden state in the target sequence is able to attend to the final hidden state in the change-point segmented contexts. The dark arrows illustrate the context time-steps the 4444-th target time-step attends to.
Refer to caption
Figure 3: A time-series segmented with change-point detection to create sequences for the context set. Different colours are different regimes. This example shows the British Pound Sterling continuous, ratio-adjusted, futures contract. Here, for illustrative purposes, regimes are segmented with a change-point threshold of LC/(LM+LC)0.99subscript𝐿𝐶subscript𝐿𝑀subscript𝐿𝐶0.99L_{C}/(L_{M}+L_{C})\geq 0.99italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT / ( italic_L start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) ≥ 0.99, where LMsubscript𝐿𝑀L_{M}italic_L start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT is the likelihood of fitting a Gaussian Process characterized by a Matérn 3/2 kernel, and LCsubscript𝐿𝐶L_{C}italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is another characterized by a Change-point kernel. Details of this procedure can be found in Appendix A.

3.3 Cross Attention

In order for our predictions to leverage a context set we use a cross-attention mechanism between our target sequence and a context set of sequences. This allows our target to attend to different sequences from different assets across time and across different regimes. The rationale is that this context set contains a much broader range of patterns in comparison to the near-term history of the same asset, which was successfully leveraged with attention by [19] (Section 6).

We create the set of key vectors, Ktsubscript𝐾𝑡K_{t}italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and set of value vectors, Vtsubscript𝑉𝑡V_{t}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT:

Kt={Ξkey(𝐱lc:tc(c),s(c))}c=1|𝒞|,Vt={Ξvalue(𝝃lc:tc(c),s(c))}c=1|𝒞|,formulae-sequencesubscript𝐾𝑡superscriptsubscriptsubscriptΞkeysubscriptsuperscript𝐱𝑐:subscript𝑙𝑐subscript𝑡𝑐superscript𝑠𝑐𝑐1𝒞subscript𝑉𝑡superscriptsubscriptsubscriptΞvaluesubscriptsuperscript𝝃𝑐:subscript𝑙𝑐subscript𝑡𝑐superscript𝑠𝑐𝑐1𝒞K_{t}=\left\{\Xi_{\text{key}}(\mathbf{x}^{(c)}_{-l_{c}:t_{c}},s^{(c)})\right\}% _{c=1}^{|\mathcal{C}|},\quad V_{t}=\left\{\Xi_{\text{value}}(\bm{\xi}^{(c)}_{-% l_{c}:t_{c}},s^{(c)})\right\}_{c=1}^{|\mathcal{C}|},italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { roman_Ξ start_POSTSUBSCRIPT key end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_C | end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { roman_Ξ start_POSTSUBSCRIPT value end_POSTSUBSCRIPT ( bold_italic_ξ start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_C | end_POSTSUPERSCRIPT , (15)

where 𝐱lc:tc(c)subscriptsuperscript𝐱𝑐:subscript𝑙𝑐subscript𝑡𝑐\mathbf{x}^{(c)}_{-l_{c}:t_{c}}bold_x start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes context sequence without the target returns. For a given query, we calculate attention weights for each key (see Eq. 10). Then adapting Eq. 11, we weight the value vectors by the attention weights and expand to multiple concurrent heads to increase the representational space. We leverage the context set via the following steps:

𝐪tsubscript𝐪𝑡\displaystyle\mathbf{q}_{t}bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Ξquery(𝐱lt:t(i),s(i))absentsubscriptΞquerysuperscriptsubscript𝐱:subscript𝑙𝑡𝑡𝑖superscript𝑠𝑖\displaystyle=\Xi_{\mathrm{query}}(\mathbf{x}_{-l_{t}:t}^{(i)},s^{(i)})= roman_Ξ start_POSTSUBSCRIPT roman_query end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) (query representation) (16)
Vtsuperscriptsubscript𝑉𝑡\displaystyle V_{t}^{\prime}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =FFN1Att1(Vt,Vt,Vt)absentsubscriptFFN1subscriptAtt1subscript𝑉𝑡subscript𝑉𝑡subscript𝑉𝑡\displaystyle=\mathrm{FFN}_{1}\circ\mathrm{Att}_{1}(V_{t},V_{t},V_{t})= roman_FFN start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ roman_Att start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (self-attention) (17)
𝐲t(i)subscriptsuperscript𝐲𝑖𝑡\displaystyle\mathbf{y}^{(i)}_{t}bold_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =LayerNormFFN2Att2(𝐪t,Kt,Vt).absentLayerNormsubscriptFFN2subscriptAtt2subscript𝐪𝑡subscript𝐾𝑡superscriptsubscript𝑉𝑡\displaystyle=\mathrm{LayerNorm}\circ\mathrm{FFN}_{2}\circ\mathrm{Att}_{2}(% \mathbf{q}_{t},K_{t},V_{t}^{\prime}).= roman_LayerNorm ∘ roman_FFN start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∘ roman_Att start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . (cross-attention) (18)

We illustrate this in Fig. 4. The self-attention step, which outputs the updated set of values Vtsuperscriptsubscript𝑉𝑡V_{t}^{\prime}italic_V start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, helps to identify similarities between regimes within the context set.

Refer to caption
Figure 4: Encoder and decoder X-Trend-G model. The FFN, VSN, Self-Attention and Cross-Attention components are all applied element-wise to each time-step. Sequences in the context set are mapped to representations via Ξkey(,)subscriptΞkey\Xi_{\text{key}}(\cdot,\cdot)roman_Ξ start_POSTSUBSCRIPT key end_POSTSUBSCRIPT ( ⋅ , ⋅ ) and Ξvalue(,)subscriptΞvalue\Xi_{\text{value}}(\cdot,\cdot)roman_Ξ start_POSTSUBSCRIPT value end_POSTSUBSCRIPT ( ⋅ , ⋅ ). For the key inputs we exclude next-day returns and use 𝐱lc:tc(c)superscriptsubscript𝐱:subscript𝑙𝑐subscript𝑡𝑐𝑐\mathbf{x}_{-l_{c}:t_{c}}^{(c)}bold_x start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT instead of 𝝃lc:tc(c)superscriptsubscript𝝃:subscript𝑙𝑐subscript𝑡𝑐𝑐\bm{\xi}_{-l_{c}:t_{c}}^{(c)}bold_italic_ξ start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT. Contexts are then passed to the cross-attention as keys and values with a representation of the target sequence 𝐱t(i)superscriptsubscript𝐱superscript𝑡𝑖\mathbf{x}_{t^{\prime}}^{(i)}bold_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT which we want to make forecasts as the query. It should be noted that we have a separate instance of keys Ktsubscript𝐾superscript𝑡K_{t^{\prime}}italic_K start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and values Vtsubscript𝑉superscript𝑡V_{t^{\prime}}italic_V start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for the query qtsubscript𝑞superscript𝑡q_{t^{\prime}}italic_q start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT at each time-step t(tlt+1:t)t^{\prime}\in(t-l_{t}+1:t)italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( italic_t - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 : italic_t ), which we detail in Fig. 2. The decoder then takes the target sequence and the output representation from the encoder, 𝐲t(i)superscriptsubscript𝐲superscript𝑡𝑖\mathbf{y}_{t^{\prime}}^{(i)}bold_y start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. It outputs a position for the Sharpe stream and the forecast stream, which we label (μt,σt)subscript𝜇superscript𝑡subscript𝜎superscript𝑡(\mu_{t^{\prime}},\sigma_{t^{\prime}})( italic_μ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ), for the maximum likelihood. Side information s(i)superscript𝑠𝑖s^{(i)}italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT regarding the target asset is also passed as input to the decoder for few-shot learning only, not zero-shot learning. If we are not using the joint loss function, we instead output for the Sharpe stream after the second last FFN.

3.4 Decoder and Loss Function

Similarly to our sequence representations in the encoder, Eq. 14, we summarize our target sequence in the decoder. This time our sequence representation, ΞDec:𝒳lt×𝒮×𝒞dh:subscriptΞDecsuperscript𝒳subscript𝑙𝑡𝒮𝒞superscriptsubscript𝑑\Xi_{{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}% \pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}% \mathrm{Dec}}}:\mathcal{X}^{l_{t}}\times\mathcal{S}~{}{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}\times~{}\mathcal{C}}\to\mathbb{R}^{d% _{h}}roman_Ξ start_POSTSUBSCRIPT roman_Dec end_POSTSUBSCRIPT : caligraphic_X start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × caligraphic_S × caligraphic_C → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, fuses the output of the encoder, 𝐲lt:t(i)subscriptsuperscript𝐲𝑖:subscript𝑙𝑡𝑡\mathbf{y}^{(i)}_{-l_{t}:t}bold_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT. Highlighting the additional components in magenta, our decoder representation is computed as:

𝐱tsuperscriptsubscript𝐱𝑡\displaystyle\mathbf{x}_{t}^{\prime}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =LayerNormFFN1Concat(VSN(𝐱t(i),s(i)),𝐲t(i)),absentLayerNormsubscriptFFN1ConcatVSNsubscriptsuperscript𝐱𝑖𝑡superscript𝑠𝑖subscriptsuperscript𝐲𝑖𝑡\displaystyle={\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}% \mathrm{LayerNorm}\circ\mathrm{FFN}_{1}\circ\mathrm{Concat}(}\mathrm{VSN}(% \mathbf{x}^{(i)}_{t},s^{(i)}){\color[rgb]{1,0,1}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}% \pgfsys@color@cmyk@fill{0}{1}{0}{0},\mathbf{y}^{(i)}_{t})},= roman_LayerNorm ∘ roman_FFN start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∘ roman_Concat ( roman_VSN ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) , bold_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (19a)
(𝐡t,𝐜t)subscript𝐡𝑡subscript𝐜𝑡\displaystyle(\mathbf{h}_{t},\mathbf{c}_{t})( bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =LSTM(𝐱t,𝐡t1,𝐜t1),absentLSTMsuperscriptsubscript𝐱𝑡subscript𝐡𝑡1subscript𝐜𝑡1\displaystyle=\mathrm{LSTM}(\mathbf{x}_{t}^{\prime},\mathbf{h}_{t-1},\mathbf{c% }_{t-1}),= roman_LSTM ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_h start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_c start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , (19b)
𝐚tsubscript𝐚𝑡\displaystyle\mathbf{a}_{t}bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =LayerNorm(𝐡t+𝐱t)absentLayerNormsubscript𝐡𝑡superscriptsubscript𝐱𝑡\displaystyle=\mathrm{LayerNorm}(\mathbf{h}_{t}+\mathbf{x}_{t}^{\prime})= roman_LayerNorm ( bold_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (19c)
ΞDec(𝐱lt:t(i),s(i),𝐲lt:t(i))subscriptΞDecsubscriptsuperscript𝐱𝑖:subscript𝑙𝑡𝑡superscript𝑠𝑖subscriptsuperscript𝐲𝑖:subscript𝑙𝑡𝑡\displaystyle\Xi_{{\color[rgb]{1,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,0,1}\pgfsys@color@cmyk@stroke{0}{1}{0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0}% \mathrm{Dec}}}(\mathbf{x}^{(i)}_{-l_{t}:t},s^{(i)}{\color[rgb]{1,0,1}% \definecolor[named]{pgfstrokecolor}{rgb}{1,0,1}\pgfsys@color@cmyk@stroke{0}{1}% {0}{0}\pgfsys@color@cmyk@fill{0}{1}{0}{0},\mathbf{y}^{(i)}_{-l_{t}:t}})roman_Ξ start_POSTSUBSCRIPT roman_Dec end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT ) =LayerNorm(FFN2(𝐚t,s(i))+𝐚t).absentLayerNormsubscriptFFN2subscript𝐚𝑡superscript𝑠𝑖subscript𝐚𝑡\displaystyle=\mathrm{LayerNorm}\left(\mathrm{FFN}_{2}(\mathbf{a}_{t},s^{(i)})% +\mathbf{a}_{t}\right).= roman_LayerNorm ( roman_FFN start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) + bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (19d)

We want to model the next-day volatility scaled return. This is a regression task where we parameterize the predictive mean via μ:𝒳lt×𝒮×𝒞:𝜇superscript𝒳subscript𝑙𝑡𝒮𝒞\mu:\mathcal{X}^{l_{t}}\times\mathcal{S}\times\mathcal{C}\to\mathbb{R}italic_μ : caligraphic_X start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × caligraphic_S × caligraphic_C → blackboard_R and volatility via σ:𝒳lt×𝒮×𝒞+:𝜎superscript𝒳subscript𝑙𝑡𝒮𝒞superscript\sigma:\mathcal{X}^{l_{t}}\times\mathcal{S}\times\mathcal{C}\to\mathbb{R}^{+}italic_σ : caligraphic_X start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × caligraphic_S × caligraphic_C → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT; the likelihood is p(σtgtσt(i)rt+1(i)𝐱lt:t(i),s(i),𝒞)=𝒩(σtgtσt(i)rt+1(i);μ(𝐱lt:t(i),s(i),𝒞),σ(𝐱lt:t(i),s(i),𝒞))𝑝conditionalsubscript𝜎tgtsuperscriptsubscript𝜎𝑡𝑖superscriptsubscript𝑟𝑡1𝑖subscriptsuperscript𝐱𝑖:subscript𝑙𝑡𝑡superscript𝑠𝑖𝒞𝒩subscript𝜎tgtsuperscriptsubscript𝜎𝑡𝑖superscriptsubscript𝑟𝑡1𝑖𝜇subscriptsuperscript𝐱𝑖:subscript𝑙𝑡𝑡superscript𝑠𝑖𝒞𝜎subscriptsuperscript𝐱𝑖:subscript𝑙𝑡𝑡superscript𝑠𝑖𝒞p\left(\frac{\sigma_{\mathrm{tgt}}}{\sigma_{t}^{(i)}}r_{t+1}^{(i)}\mid\mathbf{% x}^{(i)}_{-l_{t}:t},s^{(i)},\mathcal{C}\right)=\mathcal{N}\left(\frac{\sigma_{% \mathrm{tgt}}}{\sigma_{t}^{(i)}}~{}r_{t+1}^{(i)};\mu(\mathbf{x}^{(i)}_{-l_{t}:% t},s^{(i)},\mathcal{C}),\sigma(\mathbf{x}^{(i)}_{-l_{t}:t},s^{(i)},\mathcal{C}% )\right)italic_p ( divide start_ARG italic_σ start_POSTSUBSCRIPT roman_tgt end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∣ bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , caligraphic_C ) = caligraphic_N ( divide start_ARG italic_σ start_POSTSUBSCRIPT roman_tgt end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ; italic_μ ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , caligraphic_C ) , italic_σ ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , caligraphic_C ) ). Thus our loss function is to minimize the log-likelihood under a Gaussian distribution of future returns:

MLE(θ)subscriptMLE𝜃\displaystyle\mathcal{L}_{\mathrm{MLE}}(\theta)caligraphic_L start_POSTSUBSCRIPT roman_MLE end_POSTSUBSCRIPT ( italic_θ ) =1|Ω|(t,i)Ωlogp(σtgtσt(i)rt+1(i)|𝐱lt:t(i),s(i),𝒞)absent1Ωsubscript𝑡𝑖Ω𝑝conditionalsubscript𝜎tgtsuperscriptsubscript𝜎𝑡𝑖superscriptsubscript𝑟𝑡1𝑖subscriptsuperscript𝐱𝑖:subscript𝑙𝑡𝑡superscript𝑠𝑖𝒞\displaystyle=-\frac{1}{|\Omega|}\sum_{(t,i)\in\Omega}\log p\left(\frac{\sigma% _{\mathrm{tgt}}}{\sigma_{t}^{(i)}}~{}r_{t+1}^{(i)}|\mathbf{x}^{(i)}_{-l_{t}:t}% ,s^{(i)},\mathcal{C}\right)= - divide start_ARG 1 end_ARG start_ARG | roman_Ω | end_ARG ∑ start_POSTSUBSCRIPT ( italic_t , italic_i ) ∈ roman_Ω end_POSTSUBSCRIPT roman_log italic_p ( divide start_ARG italic_σ start_POSTSUBSCRIPT roman_tgt end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT | bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , caligraphic_C ) (20)

It is important to note that the Gaussianity of returns in financial time-series is a convenient approximation, which in practice does not always hold, especially in the tail. However, since we are focused on optimizing the Sharpe ratio in this paper555Due to the non-Gaussianity of returns observed in practice, the Sharpe ratio often may not be the best metric for measuring risk-adjusted returns., predictive mean and volatility estimates of next-day return outputs are very useful since Sharpe is dependent on realized returns and volatility.

Rather than directly optimizing Sharpe with a DMN, we propose to jointly optimize the likelihood for next-day predictions and the Sharpe. The Sharpe loss requires an additional neural network head, Predictive distribution (mean and standard deviation) To Position: head PTPG:2(1,1):subscriptPTPGsuperscript211\mathrm{PTP}_{\mathrm{G}}:\mathbb{R}^{2}\to(-1,1)roman_PTP start_POSTSUBSCRIPT roman_G end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → ( - 1 , 1 ). This is a single FFN followed by a tanh\tanhroman_tanh activation. Our joint loss function is:

JointMLE(θ)=αMLE(θ)+SharpePTPG()(θ),superscriptsubscriptJointMLE𝜃𝛼subscriptMLE𝜃superscriptsubscriptSharpesubscriptPTPG𝜃\mathcal{L}_{\mathrm{Joint}}^{\mathrm{MLE}}(\theta)=\alpha~{}\mathcal{L}_{% \mathrm{MLE}}(\theta)+\mathcal{L}_{\mathrm{Sharpe}}^{\mathrm{PTP_{G}}(\cdot)}(% \theta),caligraphic_L start_POSTSUBSCRIPT roman_Joint end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_MLE end_POSTSUPERSCRIPT ( italic_θ ) = italic_α caligraphic_L start_POSTSUBSCRIPT roman_MLE end_POSTSUBSCRIPT ( italic_θ ) + caligraphic_L start_POSTSUBSCRIPT roman_Sharpe end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_PTP start_POSTSUBSCRIPT roman_G end_POSTSUBSCRIPT ( ⋅ ) end_POSTSUPERSCRIPT ( italic_θ ) , (21)

where SharpePTPG()(θ)superscriptsubscriptSharpesubscriptPTPG𝜃\mathcal{L}_{\mathrm{Sharpe}}^{\mathrm{PTP_{G}}(\cdot)}(\theta)caligraphic_L start_POSTSUBSCRIPT roman_Sharpe end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_PTP start_POSTSUBSCRIPT roman_G end_POSTSUBSCRIPT ( ⋅ ) end_POSTSUPERSCRIPT ( italic_θ ) is Eq. 8 applied to the PTP outputs and α𝛼\alphaitalic_α is a tunable hyperparameter to balance the two loss functions.

As an alternative to assuming Gaussianity of returns, we can instead perform Quantile Regression (QRE). The aim of QRE is to learn the full probability distribution of next-day returns. QRE has proven successful for multi-step time-series regression [36]. For quantiles η𝜂\eta\in\mathcal{H}italic_η ∈ caligraphic_H, pairs (t,i)Ω𝑡𝑖Ω(t,i)\in\Omega( italic_t , italic_i ) ∈ roman_Ω and target r~t+1(i)=σtgtσt(i)rt+1(i)subscriptsuperscript~𝑟𝑖𝑡1subscript𝜎tgtsuperscriptsubscript𝜎𝑡𝑖superscriptsubscript𝑟𝑡1𝑖\tilde{r}^{(i)}_{t+1}=\frac{\sigma_{\mathrm{tgt}}}{\sigma_{t}^{(i)}}r_{t+1}^{(% i)}over~ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = divide start_ARG italic_σ start_POSTSUBSCRIPT roman_tgt end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, our QRE loss function is:

QRE(θ)=1|Ω×|Ω×[η(r~t+1(i)Qη(𝐱lt:t(i),s(i),𝒞))++(1η)(Qη(𝐱lt:t(i),s(i),𝒞)r~t+1(i))+],subscriptQRE𝜃1ΩsubscriptΩdelimited-[]𝜂subscriptsubscriptsuperscript~𝑟𝑖𝑡1subscript𝑄𝜂subscriptsuperscript𝐱𝑖:subscript𝑙𝑡𝑡superscript𝑠𝑖𝒞1𝜂subscriptsubscript𝑄𝜂subscriptsuperscript𝐱𝑖:subscript𝑙𝑡𝑡superscript𝑠𝑖𝒞subscriptsuperscript~𝑟𝑖𝑡1\mathcal{L}_{\text{QRE}}(\theta)=\frac{1}{|\Omega\times\mathcal{H}|}\sum_{% \Omega\times\mathcal{H}}\left[\eta~{}\left(\tilde{r}^{(i)}_{t+1}-Q_{\eta}(% \mathbf{x}^{(i)}_{-l_{t}:t},s^{(i)},\mathcal{C})\right)_{+}+\left(1-\eta\right% )\left(Q_{\eta}(\mathbf{x}^{(i)}_{-l_{t}:t},s^{(i)},\mathcal{C})-\tilde{r}^{(i% )}_{t+1}\right)_{+}\right],caligraphic_L start_POSTSUBSCRIPT QRE end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG | roman_Ω × caligraphic_H | end_ARG ∑ start_POSTSUBSCRIPT roman_Ω × caligraphic_H end_POSTSUBSCRIPT [ italic_η ( over~ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , caligraphic_C ) ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT + ( 1 - italic_η ) ( italic_Q start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , caligraphic_C ) - over~ start_ARG italic_r end_ARG start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ] , (22)

where ()+=max(0,)subscript0(\cdot)_{+}=\max(0,\cdot)( ⋅ ) start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = roman_max ( 0 , ⋅ ) and Q:𝒳lt×𝒮×𝒞||:𝑄superscript𝒳subscript𝑙𝑡𝒮𝒞superscriptQ:\mathcal{X}^{l_{t}}\times\mathcal{S}\times\mathcal{C}\to\mathbb{R}^{|% \mathcal{H}|}italic_Q : caligraphic_X start_POSTSUPERSCRIPT italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × caligraphic_S × caligraphic_C → blackboard_R start_POSTSUPERSCRIPT | caligraphic_H | end_POSTSUPERSCRIPT is a neural network head which we replace μ(,,)𝜇\mu(\cdot,\cdot,\cdot)italic_μ ( ⋅ , ⋅ , ⋅ ) and σ(,,)𝜎\sigma(\cdot,\cdot,\cdot)italic_σ ( ⋅ , ⋅ , ⋅ ) with. Our set of quantiles, ={0.01,0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99}0.010.050.10.20.30.40.50.60.70.80.90.950.99\mathcal{H}=\{0.01,0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,0.95,0.99\}caligraphic_H = { 0.01 , 0.05 , 0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 , 0.95 , 0.99 }, includes quantiles in the left and right tail – the market movements which typically have the largest impact on our strategy risk-adjusted returns. We define the joint QRE loss function, JointQRE()superscriptsubscriptJointQRE\mathcal{L}_{\mathrm{Joint}}^{\mathrm{QRE}}(\cdot)caligraphic_L start_POSTSUBSCRIPT roman_Joint end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_QRE end_POSTSUPERSCRIPT ( ⋅ ) as:

JointQRE(θ)=αQRE(θ)+SharpePTPQ()(θ).superscriptsubscriptJointQRE𝜃𝛼subscriptQRE𝜃superscriptsubscriptSharpesubscriptPTPQ𝜃\displaystyle\mathcal{L}_{\mathrm{Joint}}^{\mathrm{QRE}}(\theta)=\alpha~{}% \mathcal{L}_{\mathrm{QRE}}(\theta)+\mathcal{L}_{\mathrm{Sharpe}}^{\mathrm{PTP_% {Q}}(\cdot)}(\theta).caligraphic_L start_POSTSUBSCRIPT roman_Joint end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_QRE end_POSTSUPERSCRIPT ( italic_θ ) = italic_α caligraphic_L start_POSTSUBSCRIPT roman_QRE end_POSTSUBSCRIPT ( italic_θ ) + caligraphic_L start_POSTSUBSCRIPT roman_Sharpe end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_PTP start_POSTSUBSCRIPT roman_Q end_POSTSUBSCRIPT ( ⋅ ) end_POSTSUPERSCRIPT ( italic_θ ) . (23)

with the PTP for the Sharpe loss again a FFN, PTPQ:||(1,1):subscriptPTPQsuperscript11\mathrm{PTP}_{\mathrm{Q}}:\mathbb{R}^{|\mathcal{H}|}\to(-1,1)roman_PTP start_POSTSUBSCRIPT roman_Q end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT | caligraphic_H | end_POSTSUPERSCRIPT → ( - 1 , 1 ).

Utilizing the outputs of our encoder, 𝐲lt:t(i)subscriptsuperscript𝐲𝑖:subscript𝑙𝑡𝑡\mathbf{y}^{(i)}_{-l_{t}:t}bold_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT, our decoder outputs a trading signal as follows:

zt=PTPΞDec(𝐱lt:t(i),s(i),𝐲lt:t(i)).subscript𝑧𝑡PTPsubscriptΞDecsubscriptsuperscript𝐱𝑖:subscript𝑙𝑡𝑡superscript𝑠𝑖subscriptsuperscript𝐲𝑖:subscript𝑙𝑡𝑡z_{t}=\mathrm{PTP}\circ\Xi_{\mathrm{Dec}}(\mathbf{x}^{(i)}_{-l_{t}:t},s^{(i)},% \mathbf{y}^{(i)}_{-l_{t}:t}).italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_PTP ∘ roman_Ξ start_POSTSUBSCRIPT roman_Dec end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , bold_y start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT ) . (24)

We refer to the Gaussian MLE (Maximum Likelihood Estimation) variant of the architecture as X-Trend-G, the QRE variant as X-Trend-Q and the Sharpe loss variant as X-Trend.

It is important to note that in the zero-shot setting we exclude the ticker-type embedding of s(i)superscript𝑠𝑖s^{(i)}italic_s start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT for the target sequence. This is because we are trading a previously unseen asset and we have not yet seen any contract-specific dynamics. We do however still include this information in the context set, where the aim is to quickly identify similarities with previously seen contracts.

4 Experiments

The cross-attention mechanism is key to obtaining low error forecasts for few-shot learning for a toy dataset of Gaussian Process (GP) draws [37]. We implemented the recurrent attentive neural process [38] for this toy dataset, and we ablate certain components and we show that the cross-attention mechanism is key to obtaining low-error few-shot forecasts. This result motivates using a cross-attention for momentum. See Appendix B for further details.

We backtest our X-Trend variants on a portfolio of 50505050 of the most liquid, continuous futures contracts666The futures contracts are chained together using the backwards ratio-adjusted method., \mathcal{I}caligraphic_I, over the period from 1990 to 2023, extracted from the Pinnacle Data Corp CLC Database777https://pinnacledata2.com/clc.html. The futures contracts we have selected are amongst the most liquid and typical for backtesting TSMOM strategies [5, 30, 3, 19]. To test-out-of-sample across the entire history, we use an expanding window approach, where we initially train on 1990199019901990 to 1995199519951995, test out-of-sample on the period from 1995199519951995 to 2000200020002000, expand the training window to from 1990199019901990 to 2000200020002000, then test out-of-sample on the subsequent 5555 years and so on. We take particular note of performance over the 2020202020202020 to 2022202220222022 period, which covers the COVID-19 crisis, exhibiting dynamics that are significantly different to the training set.

For our zero-shot experiments, we randomly selected 20202020 of the 50505050 Pinnacle dataset assets as the test set, tssubscript𝑡𝑠\mathcal{I}_{ts}caligraphic_I start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT, leaving the other 30303030, trsubscript𝑡𝑟\mathcal{I}_{tr}caligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT for the context set and training. Despite the fact these futures contracts had historical data available in reality, we constructed this experiment as an artificial zero-shot setting to provide insight into transferability to unseen assets. Furthermore, this low-resource setting is particularly challenging because trsubscript𝑡𝑟\mathcal{I}_{tr}caligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT is also small at only 30303030 contracts. The dataset is described in further detail in Appendix D. The details of training the neural networks are provided in Appendix C.

5 Results

Refer to caption
Figure 5: Few-shot setting cumulative strategy returns (left) and drawdown plot (right), averaged across 10 full repeats and an additional portfolio volatility re-scaling step to 15% volatility. We only plot drawdown of the Base Learner and X-Trend-Q, the primary comparison, to reduce clutter.
Refer to caption
Figure 6: Zero-shot setting cumulative strategy returns (left) and drawdown plot (right), averaged across 10 full repeats and an additional portfolio volatility re-scaling step to 15% volatility.
Context Time-step Average Annual Sharpe Ratio
Method Loss Final/Time/CPD |𝒞|𝒞|\mathcal{C}|| caligraphic_C | lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT 20182018201820182023202320232023 20132013201320132023202320232023 19951995199519952023202320232023
Reference Long 0.48 0.40 0.60
TSMOM 0.23 0.71 1.01
MACD 0.27 0.45 0.71
Baseline Sharpe 2.27 1.93 2.91
J-Gauss 2.43 (+7.2%) 2.06 (+7.0%) 3.04 (+4.5%)
J-QRE 2.26 (-0.5%) 1.96 (+1.6%) 2.89 (-0.9%)
X-Trend Sharpe T 10 126 2.28 (+0.2%) 1.97 (+2.4%) 2.93 (+0.5%)
Sharpe F 10 21 2.35 (+3.4%) 1.99 (+3.2%) 3.01 (+3.4%)
Sharpe F 20 21 2.38 (+4.9%) 1.99 (+3.4%) 3.02 (+3.6%)
Sharpe F 30 21 2.25 (-0.9%) 1.99 (+3.1%) 3.03 (+4.0%)
Sharpe F 10 63 2.31 (+1.7%) 1.97 (+2.3%) 3.11 (+6.9%)
Sharpe C 10 21 2.30 (+1.1%) 2.02 (+4.7%) 3.04 (+4.5%)
Sharpe C 20 21 2.65 (+16.9%) 2.17 (+12.5%) 3.17 (+8.8%)
Sharpe C 30 21 2.38 (+5.0%) 2.02 (+4.6%) 3.08 (+5.8%)
Sharpe C 10 63** 2.50 (+10.1%) 2.14 (+11.2%) 3.11 (+6.8%)
X-Trend-G J-Gauss T 10 126 2.47 (+8.8%) 2.14 (+11.1%) 3.16 (+8.5%)
J-Gauss F 10 21 2.25 (-1.0%) 1.90 (-1.3%) 3.10 (+6.5%)
J-Gauss F 20 21 2.52 (+11.0%) 2.10 (+8.8%) 3.05 (+4.8%)
J-Gauss F 30 21 2.26 (-0.5%) 2.08 (+7.9%) 3.04 (+4.2%)
J-Gauss F 10 63 2.42 (+6.5%) 2.07 (+7.4%) 3.17 (+8.8%)
J-Gauss C 10 21 2.42 (+6.8%) 2.09 (+8.1%) 3.06 (+5.1%)
J-Gauss C 20 21 2.42 (+6.5%) 2.03 (+5.2%) 3.11 (+6.8%)
J-Gauss C 30 21 2.51 (+10.4%) 2.12 (+9.8%) 3.07 (+5.4%)
J-Gauss C 10 63** 2.32 (+2.1%) 2.00 (+3.6%) 3.18 (+9.2%)
X-Trend-Q J-QRE T 10 126 2.53 (+11.6%) 2.12 (+10.1%) 3.08 (+5.9%)
J-QRE F 10 21 2.38 (+4.9%) 1.98 (+2.5%) 3.07 (+5.3%)
J-QRE F 20 21 2.21 (-2.8%) 1.86 (-3.4%) 2.94 (+0.8%)
J-QRE F 30 21 2.42 (+6.6%) 2.05 (+6.5%) 3.03 (+4.2%)
J-QRE F 10 63 2.49 (+9.6%) 2.04 (+5.8%) 3.06 (+5.1%)
J-QRE C 10 21 2.26 (-0.5%) 2.03 (+5.3%) 3.07 (+5.3%)
J-QRE C 20 21 2.53 (+11.6%) 2.08 (+7.7%) 3.07 (+5.5%)
J-QRE C 30 21 2.4 (+5.6%) 2.02 (+4.5%) 3.01 (+3.2%)
J-QRE C 10 63** 2.70 (+18.9%) 2.14 (+10.9%) 3.11 (+6.8%)
Table 1: Few-shot learning strategy results, backtested out-of-sample on a portfolio of 50 liquid continuous futures contracts, averaged over ten seeds. We provide an ablation of the X-Trend architectural innovations and provide results for different hyperparameters. It should be noted that lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the maximum context length for the CPD segmented version. **We use a higher CPD threshold, such that our context ‘regimes’ are longer.
Context Time-step Average Annual Sharpe Ratio
Method Loss Final/Time/CPD |𝒞|𝒞|\mathcal{C}|| caligraphic_C | lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT 20182018201820182023202320232023 20132013201320132023202320232023 19951995199519952023202320232023
Reference Long 0.28 0.02 0.28
TSMOM -0.26 0.05 0.61
MACD -0.14 0.11 0.32
Baseline Sharpe -0.11 0.02 1.00
J-Gauss 0.140.140.140.14 0.190.190.190.19 1.251.251.251.25
J-QRE 0.160.160.160.16 0.100.100.100.10 1.261.261.261.26
X-Trend Sharpe C 20 21 0.130.130.130.13 0.180.180.180.18 1.171.171.171.17
J-Gauss C 20 21 0.470.47\mathbf{0.47}bold_0.47 0.470.47\mathbf{0.47}bold_0.47 1.441.44\mathbf{1.44}bold_1.44
J-QRE C 10 63** 0.120.120.120.12 0.180.180.180.18 1.271.271.271.27
Table 2: Zero-shot learning strategy results, backtested on a portfolio of 20 liquid continuous futures contracts, averaged over ten seeds. We provide an ablation of the X-Trend architectural innovations and provide results for different hyperparameters. It should be noted that lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the maximum context length for the CPD segmented version. **We use a higher CPD threshold, such that our context ‘regimes’ are longer.

5.1 Few-shot Setting

We plot the few-shot strategy returns over the past 10101010 years (2013201320132013 to 2023202320232023) in Fig. 5, which is where we start to see larger gains when backtesting our X-Trend strategy, in comparison to the baseline. In particular, we draw attention to the results over the past five years (2018201820182018 to 2023202320232023), which is an extremely interesting period for few-shot learning because it exhibits significant market turbulence, previously unseen market dynamics, and numerous regime shifts. It includes the Bull market of 2018201820182018/2019201920192019, followed by the COVID-19 pandemic in 2020202020202020/2021202120212021 and then the beginning of the Russia-Ukraine war in 2022202220222022. During this 5555-year period X-Trend improves upon the Sharpe of the baseline strategy by 16.916.916.916.9% and X-Trend-Q improves upon the baseline by 18.918.918.918.9%. It is evident that the cross-attention step is the primary driver of the improvement in risk-adjusted returns. We observe that X-Trend-Q and X-Trend outperform X-Trend-G (Table 1), which suggests that the model is able to learn and benefit from a more complex returns distribution, compared to assuming Gaussian returns. We argue that this is likely because we explicitly force the model to pay attention to large movements by performing QRE on quantiles in the left (and right) tail of the returns distribution.

The few-shot results in Table 1 show the impact of changing the context set size, context sequence (maximum) length, and different methodologies for selecting the context hidden state to attend to. Notably, for X-Trend, we demonstrate that by segmenting the context set with CPD, we improve Sharpe by a further 11.311.311.311.3%, showing the benefit of constructing context sets with regimes. In effect, we are constructing a context set with only the most informative observations.

In Fig. 5 we plot the strategy drawdown of the X-Trend-Q strategy compared to our baseline learner over the period 2018201820182018 to 2023202320232023, focusing on the COVID-19 drawdown – the most significant “momentum crash” or drawdown in the TSMOM strategies over the entire 33333333 years of historical data. Noting that the drawdown begins on 29292929 Jan 2020 for both strategies, we observe a maximum drawdown (MDD) of 26262626% for X-Trend-Q compared to an MDD of 34343434% for the baseline. Furthermore, the drawdown ends after 162162162162 trading days for X-Trend-Q compared to 254254254254 days for the Baseline which is an entire year and almost twice as long as X-Trend-Q. This quick recovery demonstrates the ability of our agent to adapt to new regimes.

5.2 Zero-shot Setting

For our zero-shot experiments in Table 2. The strategy improvement over the baseline is more pronounced than in the few-shot setting. We plot the strategy returns over the entire backtest (1995199519951995 to 2023202320232023) in Fig. 6, to demonstrate viability across time. We note a degradation of performance 2020202020202020 to 2023202320232023 and, again, we plot drawdowns over the period 2018201820182018 to 2023202320232023, to draw attention to the COVID-19 “momentum crash”. Despite the fact that the strategy degrades, trading unseen assets during this period is an extremely challenging setting and not only is the strategy still profitable, it outperforms all benchmarks.

Unlike the few-shot setting, it is evident that the joint loss function is the key driver of the improvement of results in the zero-shot setting. Furthermore, in the zero-shot setting, X-Trend-G outperforms X-Trend-Q, indicating that the simpler assumption of Gaussian returns is favourable in a low-resource setting.

5.3 Dissecting the X-Trend Architecture

In Fig. 7 and  Fig. 8 we explore the relationship between our predictive returns output distribution and the PTPPTP\mathrm{PTP}roman_PTP trading signal. In both examples, we observe that the predictive mean and median only make small movements above and below zero. This indicates that the model has learnt that we are operating in a low signal-to-noise environment, aiming to capitalize on slight market inefficiencies. Despite observing that there are small spikes and dips in predictive volatility, the estimate typically remains close to 1111, which suggests that our volatility targeting step, which targets 1111 in this example, is functioning as expected.

For X-Trend-G (Fig. 7), we observe that there is an almost linear relationship between the predictive mean and trading signal, indicating that the strategy uses the predictive mean as a measure of conviction and sizes the position accordingly. The relationship between predictive volatility and position is less clear. This suggests that it is the predictive mean driving the strategy and the predictive volatility is being used for volatility scaling. We can observe that the volatility scaling pre-processing job of altering leverage based on the 60606060-day ex-ante volatility is doing a good job because we can observe that the 95959595% confidence interval is fairly stable around ±1.96plus-or-minus1.96\pm 1.96± 1.96 when targeting a daily volatility of 1111.

For X-Trend-Q (Fig. 8), we observe that the trading signal deviates more from the predictive median than X-Trend-G, indicating that it incorporates the full predictive distribution into the trading signal. This is likely the reason that the Sharpe for X-Trend-Q outperforms X-Trend-G by 7.17.17.17.1% in the few-shot setting over the turbulent period of 2018201820182018 to 2023202320232023. This supports our hypothesis that, for the best results, Gaussianity of returns should not be assumed in the few-shot setting. The assumption of Gaussian returns may be more appropriate in the low-resource setting.

Refer to caption
Figure 7: The relationship between X-Trend-G predictive mean and volatility with the PTP trading signal for the Wheat continuous futures contract. Here, we use leverage to target a daily volatility of 1, for clarity. We have provided a 95% confidence interval in the top plot to illustrate our predictive standard deviation.
Refer to caption
Figure 8: The relationship between X-Trend-Q predictive median, inter-quartile range, and 95% confidence interval with the PTP trading signal for S&P 500 Mini continuous futures contract. The bottom plot is a zoomed version of the top plot, with the trading signal superimposed. Here, we target daily volatility of 1, for clarity.

We provide an illustrative example of how we can interpret the cross-attention weights in Fig. 9 for the natural gas futures contract in 2022202220222022; a period exhibiting significant trends due to the Russia-Ukraine conflict. We examine 3333 points in time, which correspond to 3333 different regimes, and in all cases, the top 3333 attention weights are highly intuitive. The target point at the beginning of an uptrend correctly identifies another commodity uptrend with the highest weight, with the other top 2222 being a commodity mean-reversion and a large equity uptrend. The target point at the beginning of the large downtrend clearly identifies another commodity sequence with a large downtrend, with almost double the weighting of the next highest. The target point during the beginning of a reversal identifies a slight downtrend with reversion, an extremely short downtrend, and an uptrend with significant reversion for the top three weights. Interestingly, in this case, the model actually identifies similarities with equities sequences instead of commodities.

Refer to caption
Figure 9: An illustrative example of the top 3333 attention weights for the target natural gas futures contract in 2022, a period exhibiting significant trends due to the Russia-Ukraine conflict. The +++ symbols align the query (target) with associated hidden states for keys in the context sets, using colours to match attention weights with the time of forecast. The points we focus on are: the beginning of a significant uptrend (top), the beginning of a significant downtrend (middle), and a reversion (bottom). This example uses X-Trend with change-point segmented context sequences, |𝒞|=20𝒞20|\mathcal{C}|=20| caligraphic_C | = 20 (meaning a uniform attention pattern would set all weights to 5%percent55\%5 %), and max length lmax=20subscript𝑙max20l_{\mathrm{max}}=20italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 20. We list the ticker, hidden-state date, and attention weight for each context sequence. The context contracts plotted are Soybean Oil (ZL), Rough Rice (ZR), CAC 40 index (CA), Platinum (ZP), Cocoa (CC), Mexican Peso (MP), Nikkei index (NK) and FTSE 100 index (LX).

6 Related Works

Transfer Learning. We can transfer features learned from one dataset to enable learning from new datasets more quickly. Transferring NN weights can also enable learning from new smaller datasets. This can be done by fine-tuning the weights of a NN on a new task with SGD or linear-probing which freezes lower layers of the network and only updates the top layers [39].

Few-shot learning. Few-shot learning is characterized by enabling NNs to be able to learn to make predictions using little data. One key idea for enabling this is to train the few-shot learning agent in the same way that it is used for testing. This is the idea of episodic learning where if we wish to train a model to make predictions using only k𝑘kitalic_k examples or shots then we need to train the model by performing k𝑘kitalic_k-shot learnings [8, 13]. Non-parametric methods are a natural fit for few-shot learning [40]. Learning distances between a few context points and a new target point has been fruitful for few-shot image classification [10, 41, 42]. Neural processes learn to sample functions like Gaussian Process [43, 37, 20]. Automatic context construction for few-shot learning using change-point detection methods has also been explored for image datasets [44]. Recently there have been works that challenge some of the common assumptions for few-shot image classification [45, 46, 47].

Few-shot learning for time-series. Neural Processes [37] which parameterize a distribution over functions have been employed for few-shot learning for time-series forecasting [48]. Neural processes algorithmically use a latent variable to enable drawing a new function for each point by sampling; the latent variable can employ autoregressive transition dynamics for time-series [49]. One can condition the target time-series predictions on a context of time-series with a cross-attention mechanism [38]. The cross-attention mechanism has been shown to be very effective for Neural Processes [20]. Gradient-based few-shot learning has also been successfully employed for time-series [50]. From the literature and from our own initial experiments on toy data (Appendix B) we found the cross-attention very effective for few-shot learning for time-series.

Deep-learning for financial time-series. Building on the work of the vanilla DMNs, [3], the Momentum Transformer [19], which is a variant of the Temporal Fusion Transformer [25], incorporates an attention mechanism to attend to prior LSTM hidden-states from the same sequence. The attention pattern naturally segments the time-series into regimes, where significant importance is placed on the final hidden state of each regime, motivating the cross-attention step in this paper. Other works demonstrate that causal convolutional filters can be be utilized to automatically generate features [51, 32], as an alternative to the momentum factor approach used in this paper. The work of [18] utilizes convolutions followed by an LSTM to generate features from limit order book (LOB) data. There is growing evidence to suggest that we can benefit from a cross-section of assets to assist our forecasting, where the work by [32] implements convolutional filters across assets and [52] uses a graph learning model to reveal momentum spillover. Meta-learning has been used in finance to construct a partial index portfolio to track a benchmark index where the asset allocation is meta-learned [53].

For a broad review of the financial machine learning literature, we direct readers to [54]. For a more general overview of deep-learning techniques for time-series forecasting we direct readers to [55].

7 Conclusions and Future Work

We introduce X-Trend: the Cross Attentive Time-Series Trend Network. It leverages few-shot learning and change-point detection to enable adaptation to new financial regimes. We show that it is able to recover from the COVID-19 draw-down almost twice as quickly as an equivalent neural time-series agent. Over the 5555-year period 2018201820182018 to 2023202320232023, we are able to improve risk-adjusted returns by 18.9%percent18.918.9\%18.9 % compared to the baseline agent and around 10-fold compared to a conventional Time-series Momentum (TSMOM) strategy. This boost in performance is largely driven by our cross-attention step which transfers trends, from similar patterns in a context set. Furthermore, our model can generate profitable zero-shot trading signals in an extremely challenging low-resource setting where we trade a previously unseen asset we achieve a Sharpe of 0.470.470.470.47 compared to loss-making time-series momentum baselines, both deep-learning based and conventional TSMOM. In this work we withheld assets from a standard dataset to test zero-shot performance; a future avenue of work could be applying this framework to an emerging asset class such as cryptocurrencies.

We illustrate the importance of constructing a good context set, where we improve Sharpe by 11.3%percent11.311.3\%11.3 % after segmenting the sequences with change-point detection. A future avenue of work would be to further investigate context set construction. One possibility could be considering a cross-sectional approach where we attend across a universe of assets at the same time, motivated by the works [32, 56] with the option of including lead-lag [57]. We could consider generating synthetic data for the context set [58]. Another direction of work, inspired by the Neural Process literature, is to reconcile optimizing a Sharpe ratio with optimizing the evidence lower bound for variational inference required for latent variable Neural Process time-series models [38, 48]. Finally, this work could be combined with other innovations that expand upon the standard Deep Momentum Network framework such as bringing transaction costs into the loss function [3] and including change-point features in the decoder [4]. We could also use self-attention in the temporal dimension [19, 25], introducing it in the decoder, and automatically generating features from our assets [32].

8 Acknowledgements

The authors would like to thank the Oxford-Man Institute of Quantitative Finance for its generous support. SR would like to thank the U.K. Royal Academy of Engineering.

References

  • [1] Kent Daniel and Tobias J. Moskowitz. Momentum crashes. Journal of Financial Economics, 122(2):221 – 247, 2016.
  • [2] Ashish Garg, Christian L Goulding, Campbell R Harvey, and Michele Mazzoleni. Momentum turning points. Available at SSRN 3489539, 2021.
  • [3] Bryan Lim, Stefan Zohren, and Stephen Roberts. Enhancing time-series momentum strategies using deep neural networks. The Journal of Financial Data Science, 1(4):19–38, 2019.
  • [4] Kieran Wood, Stephen Roberts, and Stefan Zohren. Slow momentum with fast reversion: A trading strategy using deep learning and changepoint detection. The Journal of Financial Data Science, 4(1):111–129, 2022.
  • [5] Tobias J Moskowitz, Yao Hua Ooi, and Lasse Heje Pedersen. Time series momentum. Journal of financial economics, 104(2):228–250, 2012.
  • [6] Nick Baltas. The impact of crowding in alternative risk premia investing. Financial Analysts Journal, 75(3):89–104, 2019.
  • [7] Gregory W Brown, Philip Howard, and Christian T Lundblad. Crowded trades and tail risk. The Review of Financial Studies, 35(7):3231–3271, 2022.
  • [8] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. Advances in neural information processing systems, 29, 2016.
  • [9] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.
  • [10] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
  • [11] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779, 2016.
  • [12] Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763, 2016.
  • [13] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126–1135. PMLR, 2017.
  • [14] James M Poterba and Lawrence H Summers. Mean reversion in stock prices: Evidence and implications. Journal of financial economics, 22(1):27–59, 1988.
  • [15] Dimitri Vayanos and Paul Woolley. An institutional theory of momentum and reversal. The Review of Financial Studies, 26(5):1087–1145, 2013.
  • [16] William F. Sharpe. The sharpe ratio. The Journal of Portfolio Management, 21(1):49–58, 1994.
  • [17] Justin Sirignano and Rama Cont. Universal features of price formation in financial markets: Perspectives from deep learning. SSRN, 2018.
  • [18] Zihao Zhang, Stefan Zohren, and Stephen Roberts. Deeplob: Deep convolutional neural networks for limit order books. IEEE Transactions on Signal Processing, 67(11):3001–3012, 2019.
  • [19] Kieran Wood, Sven Giegerich, Stephen Roberts, and Stefan Zohren. Trading with the momentum transformer: An intelligent and interpretable architecture. arXiv preprint arXiv:2112.08534, 2021.
  • [20] Hyunjik Kim, Andriy Mnih, Jonathan Schwarz, Marta Garnelo, Ali Eslami, Dan Rosenbaum, Oriol Vinyals, and Yee Whye Teh. Attentive neural processes. arXiv preprint arXiv:1901.05761, 2019.
  • [21] Carl Doersch, Ankush Gupta, and Andrew Zisserman. Crosstransformers: spatially-aware few-shot transfer. Advances in Neural Information Processing Systems, 33:21981–21993, 2020.
  • [22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [23] Roman Garnett, Michael A Osborne, Steven Reece, Alex Rogers, and Stephen J Roberts. Sequential bayesian prediction in the presence of changepoints and faults. The Computer Journal, 53(9):1430–1446, 2010.
  • [24] Yunus Saatçi, Ryan D Turner, and Carl E Rasmussen. Gaussian process change point models. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 927–934, 2010.
  • [25] B Lim, SO Arik, N Loeff, and T Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting. arxiv. arXiv preprint arXiv:1912.09363, 2019.
  • [26] Abby Y. Kim, Yiuman Tse, and John K. Wald. Time series momentum and volatility scaling. Journal of Financial Markets, 30:103 – 124, 2016.
  • [27] Campbell R. Harvey, Edward Hoyle, Russell Korgaonkar, Sandy Rattray, Matthew Sargaison, and Otto van Hemert. The impact of volatility targeting. SSRN, 2018.
  • [28] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  • [29] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
  • [30] Brian Hurst, Yao Hua Ooi, and Lasse Heje Pedersen. A century of evidence on trend-following investing. The Journal of Portfolio Management, 44(1):15–29, 2017.
  • [31] Jamil Baz, Nicolas Granger, Campbell R. Harvey, Nicolas Le Roux, and Sandy Rattray. Dissecting investment strategies in the cross section and time series. SSRN, 2015.
  • [32] Tom Liu, Stephen Roberts, and Stefan Zohren. Deep inception networks: A general end-to-end framework for multi-asset quantitative strategies. arXiv preprint arXiv:2307.05522, 2023.
  • [33] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [34] Cheng Guo and Felix Berkhahn. Entity embeddings of categorical variables. arXiv preprint arXiv:1604.06737, 2016.
  • [35] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
  • [36] Ruofeng Wen, Kari Torkkola, Balakrishnan Narayanaswamy, and Dhruv Madeka. A multi-horizon quantile recurrent forecaster. arXiv preprint arXiv:1711.11053, 2017.
  • [37] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. arXiv preprint arXiv:1807.01622, 2018.
  • [38] Shenghao Qin, Jiacheng Zhu, Jimmy Qin, Wenshuo Wang, and Ding Zhao. Recurrent attentive neural process for sequential data. arXiv preprint arXiv:1910.09323, 2019.
  • [39] Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022.
  • [40] Massimiliano Patacchiola, Jack Turner, Elliot J Crowley, Michael O’Boyle, and Amos J Storkey. Bayesian meta-learning for the few-shot setting via deep kernels. Advances in Neural Information Processing Systems, 33:16108–16118, 2020.
  • [41] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199–1208, 2018.
  • [42] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. arXiv preprint arXiv:1904.04232, 2019.
  • [43] Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, and SM Ali Eslami. Conditional neural processes. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2018.
  • [44] James Harrison, Apoorva Sharma, Chelsea Finn, and Marco Pavone. Continuous meta-learning without tasks. Advances in neural information processing systems, 33:17571–17581, 2020.
  • [45] Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B Tenenbaum, and Phillip Isola. Rethinking few-shot image classification: a good embedding is all you need? In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, pages 266–282. Springer, 2020.
  • [46] Guneet S Dhillon, Pratik Chaudhari, Avinash Ravichandran, and Stefano Soatto. A baseline for few-shot image classification. arXiv preprint arXiv:1909.02729, 2019.
  • [47] Steinar Laenen and Luca Bertinetto. On episodes, prototypical networks, and few-shot learning. Advances in Neural Information Processing Systems, 34:24581–24592, 2021.
  • [48] Timon Willi, Jonathan Masci, Jürgen Schmidhuber, and Christian Osendorfer. Recurrent neural processes. arXiv preprint arXiv:1906.05915, 2019.
  • [49] Gautam Singh, Jaesik Yoon, Youngsung Son, and Sung** Ahn. Sequential neural processes. Advances in Neural Information Processing Systems, 32, 2019.
  • [50] Gerald Woo, Chenghao Liu, Doyen Sahoo, Akshat Kumar, and Steven Hoi. Deeptime: Deep time-index meta-learning for non-stationary time-series forecasting. arXiv preprint arXiv:2207.06046, 2022.
  • [51] **gwen Jiang, Bryan T Kelly, and Dacheng Xiu. (re-) imag (in) ing price trends. Chicago Booth Research Paper, (21-01), 2020.
  • [52] Xingyue Stacy Pu, Stephen Roberts, Xiaowen Dong, and Stefan Zohren. Network momentum across asset classes. Stephen and Dong, Xiaowen and Zohren, Stefan, Network Momentum across Asset Classes (August 7, 2023), 2023.
  • [53] Yongxin Yang and Timothy Hospedales. Partial index tracking: A meta-learning approach. In Conference on Lifelong Learning Agents, pages 415–436. PMLR, 2023.
  • [54] Bryan T Kelly and Dacheng Xiu. Financial machine learning. Technical report, National Bureau of Economic Research, 2023.
  • [55] Bryan Lim and Stefan Zohren. Time-series forecasting with deep learning: a survey. Philosophical Transactions of the Royal Society A, 379(2194):20200209, 2021.
  • [56] Wee Ling Tan, Stephen Roberts, and Stefan Zohren. Spatio-temporal momentum: Jointly learning time-series and cross-sectional strategies. arXiv preprint arXiv:2302.10175, 2023.
  • [57] Yichi Zhang, Mihai Cucuringu, Alexander Y Shestopaloff, and Stefan Zohren. Robust detection of lead-lag relationships in lagged multi-factor models. arXiv preprint arXiv:2305.06704, 2023.
  • [58] Magnus Wiese, Robert Knobloch, Ralf Korn, and Peter Kretschmer. Quant gans: deep generation of financial time series. Quantitative Finance, 20(9):1419–1440, 2020.
  • [59] Taesup Kim, Jaesik Yoon, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sung** Ahn. Bayesian model-agnostic meta-learning. arXiv preprint arXiv:1806.03836, 2018.
  • [60] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [61] Ronald J. Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks, 1989.
  • [62] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • [63] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15:1929–1958, 2014.
  • [64] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In Autodiff Workshop – Advances in Neural Information Processing (NeurIPS), 2017.

Supplementary Material

Appendix A Gaussian Process Change Point Segmentation

Even after using returns to linearly de-trending the price-series and by employing volatility scaling to target a consistent volatility, we still encounter occasional and significant periods of disequilibrium or regime change. The work in [4] explores how a Gaussian Process (GP) based online change-point detection (CPD)  [23, 24] module can be inserted to help address any transitions in regime, which motivates the episodic approach taken in this paper. The work by  [4], assumes that there is a single change-point in some pre-specified lookback window (LBW), then calculates the change-point location and severity, therefore an approximately stationary regime will correspond with a very low severity. The severity is computed by the improvement in log marginal-likelihood using a change-point covariance kernel, in comparison to a simple Gaussian Processes, 𝒢𝒫(,)𝒢𝒫\mathcal{GP}(\cdot,\cdot)caligraphic_G caligraphic_P ( ⋅ , ⋅ )888A typical choice could be an Ornstein–Uhlenbeck process, which is the Matérn Kernel 1/2 kernel. For consistency with the motivating work [4], we use the Matérn 3/2 kernel.. The Change-point kernel assumes that there are instead two underlying Gaussian Processes of the same kernel, 𝒢𝒫1(,)𝒢subscript𝒫1\mathcal{GP}_{1}(\cdot,\cdot)caligraphic_G caligraphic_P start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ , ⋅ ) and 𝒢𝒫2(,)𝒢subscript𝒫2\mathcal{GP}_{2}(\cdot,\cdot)caligraphic_G caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ , ⋅ ) (which we refer to together as 𝒢𝒫C𝒢subscript𝒫C\mathcal{GP}_{\mathrm{C}}caligraphic_G caligraphic_P start_POSTSUBSCRIPT roman_C end_POSTSUBSCRIPT), either side of the assumed change-point with a soft transition from the first to the second. That is, we measure the benefit of using two separate kernels instead of one. We leave the location of the transition tCPDsubscript𝑡CPDt_{\mathrm{CPD}}italic_t start_POSTSUBSCRIPT roman_CPD end_POSTSUBSCRIPT as a free variable, which we tune when we maximize the likelihood. In this work we use a LBW of llbw=21subscript𝑙lbw21l_{\mathrm{lbw}}=21italic_l start_POSTSUBSCRIPT roman_lbw end_POSTSUBSCRIPT = 21 as a compromise between speed of detection and robustness to noise [4]. This is not to be confused with the fact that we can have regimes longer than llbwsubscript𝑙lbwl_{\mathrm{lbw}}italic_l start_POSTSUBSCRIPT roman_lbw end_POSTSUBSCRIPT because we identify regime change when the severity threshold ν𝜈\nuitalic_ν is reached. We detail our CPD algorithm in Algorithm 1. We use a change-point severity of ν=0.9𝜈0.9\nu=0.9italic_ν = 0.9 for the experiments where we set maximum regime length as lmax=21subscript𝑙max21l_{\mathrm{max}}=21italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 21 and we set ν=0.95𝜈0.95\nu=0.95italic_ν = 0.95 for the experiments where we use lmax=63subscript𝑙max63l_{\mathrm{max}}=63italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 63. We disregard regimes with lengths less than lmin=5subscript𝑙5l_{\min}=5italic_l start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 5.

Data: price series p1:T(i)superscriptsubscript𝑝:1𝑇𝑖p_{1:T}^{(i)}italic_p start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, CPD LBW llbwsubscript𝑙lbwl_{\mathrm{lbw}}italic_l start_POSTSUBSCRIPT roman_lbw end_POSTSUBSCRIPT, CPD threshold ν𝜈\nuitalic_ν, min. segment length lminsubscript𝑙l_{\min}italic_l start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , max. segment length lmaxsubscript𝑙l_{\max}italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT
Result: segmented price series {pt0:t1(i)}(t0,t1)subscriptsubscriptsuperscript𝑝𝑖:subscript𝑡0subscript𝑡1subscript𝑡0subscript𝑡1\{p^{(i)}_{t_{0}:t_{1}}\}_{(t_{0},t_{1})\in\mathcal{R}}{ italic_p start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT : italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∈ caligraphic_R end_POSTSUBSCRIPT
1 Initialize: tT𝑡𝑇t\leftarrow Titalic_t ← italic_T, t1Tsubscript𝑡1𝑇t_{1}\leftarrow Titalic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← italic_T, regimes \mathcal{R}\leftarrow\emptysetcaligraphic_R ← ∅ ;
2 while t0𝑡0t\geq 0italic_t ≥ 0 do
3       Fit 𝒢𝒫𝒢𝒫\mathcal{GP}caligraphic_G caligraphic_P with Matérn 3/2 kernel on pllbw:tsubscript𝑝:subscript𝑙lbw𝑡p_{-l_{\mathrm{lbw}}:t}italic_p start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT roman_lbw end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT and calculate marginal likelihood, LMsubscript𝐿𝑀L_{M}italic_L start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT;
4       Fit 𝒢𝒫C𝒢subscript𝒫C\mathcal{GP}_{\mathrm{C}}caligraphic_G caligraphic_P start_POSTSUBSCRIPT roman_C end_POSTSUBSCRIPT with Change-point kernel on pllbw:tsubscript𝑝:subscript𝑙lbw𝑡p_{-l_{\mathrm{lbw}}:t}italic_p start_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT roman_lbw end_POSTSUBSCRIPT : italic_t end_POSTSUBSCRIPT and calculate marginal likelihood, LCsubscript𝐿𝐶L_{C}italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and change-point location hyperparameter tCPDsubscript𝑡CPDt_{\mathrm{CPD}}italic_t start_POSTSUBSCRIPT roman_CPD end_POSTSUBSCRIPT;
5       if LCLM+LCνsubscript𝐿𝐶subscript𝐿𝑀subscript𝐿𝐶𝜈\frac{L_{C}}{L_{M}+L_{C}}\geq\nudivide start_ARG italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_ARG ≥ italic_ν then
6             t0tcsubscript𝑡0subscript𝑡𝑐t_{0}\leftarrow\lceil t_{c}\rceilitalic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← ⌈ italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⌉ ;
7             if t1t0lminsubscript𝑡1subscript𝑡0subscript𝑙t_{1}-t_{0}\geq l_{\min}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ italic_l start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT then
8                   {(t0,t1)}subscript𝑡0subscript𝑡1\mathcal{R}\leftarrow\mathcal{R}\cup\{(t_{0},t_{1})\}\;caligraphic_R ← caligraphic_R ∪ { ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) }
9             end if
10            ttc1𝑡subscript𝑡𝑐1t\leftarrow\lfloor t_{c}\rfloor-1italic_t ← ⌊ italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⌋ - 1 # so that ‘good’ representation isn’t corrupted;
11             t1tsubscript𝑡1𝑡t_{1}\leftarrow titalic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← italic_t
12      else
13             tt1𝑡𝑡1t\leftarrow t-1italic_t ← italic_t - 1;
14             if t1t>lmaxsubscript𝑡1𝑡subscript𝑙t_{1}-t>l_{\max}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_t > italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT then
15                   tt1lmax𝑡subscript𝑡1subscript𝑙t\leftarrow t_{1}-l_{\max}italic_t ← italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT;
16             end if
17            if t1t=lmaxsubscript𝑡1𝑡subscript𝑙t_{1}-t=l_{\max}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_t = italic_l start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT then
18                   {(t,t1)}𝑡subscript𝑡1\mathcal{R}\leftarrow\mathcal{R}\cup\{(t,t_{1})\}caligraphic_R ← caligraphic_R ∪ { ( italic_t , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) };
19                   t1tsubscript𝑡1𝑡t_{1}\leftarrow titalic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← italic_t;
20             end if
21            
22       end if
23      
24 end while
Algorithm 1 Time-series CPD segmentation
Refer to caption
Figure 10: Recurrent attentive neural process. The keys and values are the context set hidden states. We have a self-attention mechanism over hidden states in the values. The LSTM encoders for the values and keys have separate parameters. The queries are encoded by a separate LSTM with parameters ψ𝜓\psiitalic_ψ. The cross-attention mechanism outputs a similarity between the target xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the context 𝒞𝒞\mathcal{C}caligraphic_C representation of the context set: r𝑟ritalic_r and the latent variable s𝑠sitalic_s is sampled to produce a summary of the context set to condition an LSTM decoder to make predictions on a target set xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Appendix B Recurrent Attentive Neural Process Experiments

B.1 Model Architecture

We use the recurrent attentive neural process [38] as a benchmark to study the importance of certain model components. In this section, we will summarize the model and provide an overview in Fig. 10.

We use |𝒞|𝒞|\mathcal{C}|| caligraphic_C | contexts which are sequences of length l𝑙litalic_l. The targets x1:ksubscriptsuperscript𝑥:1𝑘x^{*}_{1:k}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT have a length k𝑘kitalic_k and are required to produce a prediction y:=xk+1y*:=x^{*}_{k+1}italic_y ∗ := italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT. Causally the contexts are all observed at time-steps prior to the targets. The contexts are passed into an LSTM encoder with parameters θ𝜃\thetaitalic_θ. The final hidden states 𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are passed into a self-attention module over contexts before the cross-attention branch.

We perform cross attention between the context encodings 𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and the targets x1:ksubscriptsuperscript𝑥:1𝑘x^{*}_{1:k}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 : italic_k end_POSTSUBSCRIPT. The targets are encoded by a separate LSTM network with parameters ψ𝜓\psiitalic_ψ and the final hidden state hhitalic_h is the query vector in the cross-attention model. The encoded contexts 𝐡i,iCsubscript𝐡𝑖for-all𝑖𝐶\mathbf{h}_{i},\forall i\in Cbold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ∀ italic_i ∈ italic_C are the values. The keys are the contexts, which are encoded similarly to the values but with another LSTM encoder with parameters φ𝜑\varphiitalic_φ. The output of the cross attention is r𝑟ritalic_r and is used to condition the decoder [59, 38].

The encodings 𝐡isubscript𝐡𝑖\mathbf{h}_{i}bold_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the encoder are then aggregated with a sum operation direct-sum\bigoplus in Fig. 10. These encodings are then passed into two separate linear layers which are used to parameterize the mean and variance of a latent variable s𝑠sitalic_s which is distributed according to a Gaussian distribution and trained using the reparameterization trick [60]. The context encodings r𝑟ritalic_r and the draw from s𝑠sitalic_s, Z𝑍Zitalic_Z are stacked on top of the targets xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT before producing predictions with an LSTM decoder network with parameters ϕitalic-ϕ\phiitalic_ϕ [38, 48].

B.2 Results

We generate a dataset of GP draws for the recurrent attentive neural process [38] Fig. 10 to produce forecasts. The GP curves are generated using an RBF (Radial Basis Function) kernel with a length scale of 0.40.40.40.4 and noise variance of 1.01.01.01.0. Each context is drawn from a different GP draw. The length of each sequence in the context 𝒞𝒞\mathcal{C}caligraphic_C, the input xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and output ycsubscript𝑦𝑐y_{c}italic_y start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are temporally connected sequences of length lCU[10,30]similar-tosubscript𝑙𝐶U1030l_{C}\sim\textrm{U}[10,30]italic_l start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∼ U [ 10 , 30 ]. Likewise, the target is taken from a separate GP draw where the input xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and the output yTsubscript𝑦𝑇y_{T}italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are temporally connected sequences of length lTU[10,30]similar-tosubscript𝑙𝑇U1030l_{T}\sim\textrm{U}[10,30]italic_l start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ U [ 10 , 30 ]. The target decoder is trained for 50505050k iterations using teacher forcing and during testing unrolls the forecast using the previous prediction as a new input [61]. We use a Gaussian likelihood and assess test performance using the MSE.

From Fig. 11 we can see that the baseline LSTM forecaster with no context 𝒞𝒞\mathcal{C}caligraphic_C, suffers from overfitting after training for a fixed number of iterations. The full sequential NP with latent variable, self-attention over the context hidden states, and cross-attention between target and contexts obtains good forecasting performance for all different experimental settings for different context set sizes.

When we ablate away certain components, we see that performance remains the same when removing the latent variable and the self-attention. However, If we remove the cross-attention between the contexts and the target sequences then the sequential NP underfits severely. This underfitting becomes more severe as more context sequences are used to condition the sequential NP (Fig. 11).

Refer to caption
Figure 11: The final test MSE for different model choices for increasing number of context sequences. The results are a mean ±1plus-or-minus1\pm 1± 1 standard error over 5555 seeds.

Appendix C Training Details

We calibrate our model using the training data by optimizing the Sharpe loss function via minibatch Stochastic Gradient Descent (SGD) [28], using the Adam optimizer [62]. We employ dropout [63], which helps to prevent the model from overfitting by randomly removing hidden nodes in the training phase. We list the fixed model parameters for each architecture in Table 4, including early stop** patience. We keep the last 10% of the training data, for each asset, as a validation set. We implement 10 iterations of random grid search, as an outer optimization loop, to select the best hyperparameters, based on the validation set. The hyperparameter search grid for each architecture is listed in Table 4. We perform 10 full repeats of the outer optimization loop and ensemble the 10 models for our experiments to reduce noise.

Our model was implemented in the deep-learning framework PyTorch [64]. It was trained on a NVIDIA GeForce RTX 3090 GPU.

Table 3: Hyperparameter Search Range.
Hyperparameters Random Grid
Dropout Rate 0.3, 0.4, 0.5
Hidden Layer Size, dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT 64, 128
Minibatch Size, b𝑏bitalic_b 64, 128
Max Gradient Norm 102,100,102superscript102superscript100superscript10210^{-2},~{}10^{0},~{}10^{2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Table 4: Training Fixed Parameters.
Parameter Value
Learning Rate 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
Target, training warm-up steps, lssubscript𝑙𝑠l_{s}italic_l start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT 63636363
Target, total LSTM steps, ltsubscript𝑙𝑡l_{t}italic_l start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 126126126126
Early stop** patience 10
Maximum SGD iterations 100
Number attention heads 4
Attention dimension, dattsubscript𝑑attd_{\mathrm{att}}italic_d start_POSTSUBSCRIPT roman_att end_POSTSUBSCRIPT dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT
Joint loss weight, αXTrendGsubscript𝛼XTrendG\alpha_{\mathrm{X-Trend-G}}italic_α start_POSTSUBSCRIPT roman_X - roman_Trend - roman_G end_POSTSUBSCRIPT 1
Joint loss weight, αXTrendQsubscript𝛼XTrendQ\alpha_{\mathrm{X-Trend-Q}}italic_α start_POSTSUBSCRIPT roman_X - roman_Trend - roman_Q end_POSTSUBSCRIPT 5

Appendix D Datasets

We backtest our X-Trend variants on a portfolio of 50 liquid, continuous futures contracts over the period 1990–2023, extracted from the Pinnacle Data Corp CLC Database. The futures contracts are chained together using the backwards ratio-adjusted method. For our few-shot experiments, we use the same assets for both training and testing, using all assets in both Table 6 and Table 6. For our zero-shot experiments, we randomly selected 20 of the 50 Pinnacle assets as the target set tssubscript𝑡𝑠\mathcal{I}_{ts}caligraphic_I start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT, which we detail in  Table 6, leaving the other 30 for trsubscript𝑡𝑟\mathcal{I}_{tr}caligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT, which we detail in Table 6. It should be noted that we chose to not include any fixed income contracts in the zero-shot portfolio, due to the fact that they are more correlated than the other futures contracts.

Table 5: Assets in trsubscript𝑡𝑟\mathcal{I}_{tr}caligraphic_I start_POSTSUBSCRIPT italic_t italic_r end_POSTSUBSCRIPT for zero-shot experiments and part of \mathcal{I}caligraphic_I for few-shot experiments.
Identifier Description
Commodities (CM)
CC COCOA
DA MILK III, composite
LB LUMBER
SB SUGAR #11
ZA PALLADIUM, electronic
ZC CORN, electronic
ZF FEEDER CATTLE, electronic
ZI SILVER, electronic
ZO OATS, electronic
ZR ROUGH RICE, electronic
ZU CRUDE OIL, electronic
ZW WHEAT, electronic
ZZ LEAN HOGS, electronic
Equities (EQ)
EN NASDAQ, MINI
ES S&P 500, MINI
MD S&P 400 (Mini electronic)
SC S&P 500, composite
SP S&P 500, day session
XX DOW JONES STOXX 50
YM Mini Dow Jones ($5.00)
Fixed Income (FI)
DT EURO BOND (BUND)
FB T-NOTE, 5yr composite
TY T-NOTE, 10yr composite
UB EURO BOBL
US T-BONDS, composite
Foreign Exchange (FX)
AN AUSTRALIAN $$, composite
DX US DOLLAR INDEX
FN EURO, composite
JN JAPANESE YEN, composite
SN SWISS FRANC, composite
Table 6: Assets in tssubscript𝑡𝑠\mathcal{I}_{ts}caligraphic_I start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT for zero-shot experiments and additional assets for \mathcal{I}caligraphic_I in few-shot experiments. For zero-shot experiments |ts|=20subscript𝑡𝑠20|\mathcal{I}_{ts}|=20| caligraphic_I start_POSTSUBSCRIPT italic_t italic_s end_POSTSUBSCRIPT | = 20 and for few-shot experiments ||=5050|\mathcal{I}|=50| caligraphic_I | = 50.
Identifier Description
Commodities (CM)
GI GOLDMAN SAKS C. I.
JO ORANGE JUICE
KC COFFEE
KW WHEAT, KC
NR ROUGH RICE
ZG GOLD, electronic
ZH HEATING OIL, electronic
ZK COPPER, electronic
ZL SOYBEAN OIL, electronic
ZN NATURAL GAS, electronic
ZP PLATINUM, electronic
ZT LIVE CATTLE, electronic
Equities (EQ)
CA CAC40 INDEX
ER RUSSELL 2000, MINI
LX FTSE 100 INDEX
NK NIKKEI INDEX
XU DOW JONES EUROSTOXX50
Foreign Exchange (FX)
BN BRITISH POUND, composite
CN CANADIAN $$, composite
MP MEXICAN PESO