aaaffiliationtext: Concordia University, Department of Mathematics and Statistics, Montréal, Canadabbaffiliationtext: Quantact Laboratory, Centre de Recherches Mathématiques, Montréal, Canadaccaffiliationtext: Digital Insurance And Long term risk (DIALog) research chair, France

Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradientsthanks: We thank the Natural Sciences and Engineering Research Council of Canada (Davar: FIN-ML NSERC CREATE program, Godin: RGPIN-2017-06837 and RGPIN-2024-04593, Garrido: RGPIN-2017-06643) for their financial support.

Parisa Davar Frédéric Godin Corresponding author.
    Email addresses:[email protected] (Parisa Davar), [email protected] (Jose Garrido), [email protected] (Frédéric Godin).
Jose Garrido
( June 28, 2024)
Abstract

This paper tackles the problem of mitigating catastrophic risk (which is risk with very low frequency but very high severity) in the context of a sequential decision making process. This problem is particularly challenging due to the scarcity of observations in the far tail of the distribution of cumulative costs (negative rewards). A policy gradient algorithm is developed, that we call POTPG. It is based on approximations of the tail risk derived from extreme value theory. Numerical experiments highlight the out-performance of our method over common benchmarks, relying on the empirical distribution. An application to financial risk management, more precisely to the dynamic hedging of a financial option, is presented.


Keywords: Risk-aware reinforcement learning, catastrophic risk, extreme value theory, peaks-over-threshold (POT), hedging.

1   Introduction

Reinforcement learning (RL) consists in a set of methods allowing to optimize sequential decision processes through interactions with an environment. In traditional RL (see Sutton and Barto,, 2018), the primary objective is to maximize expected rewards. However, a subset of RL techniques, referred to as risk-aware reinforcement learning, aim to take risk also into account (i.e. departure from the expected case), see for example Wu and Lin, (1999), Borkar, (2001), Tamar et al., (2012), La and Ghavamzadeh, (2013), Chow et al., (2018), Greenberg et al., (2022) and Vijayan and Prashanth, (2023). Integrating risk mitigation within the RL framework is of paramount importance in several areas, as policies producing high expected rewards together with a high risk might be unacceptable in certain circumstances. Financial risk management is an example key area where risk-aware RL methods are developed, see for instance Buehler et al., (2019), Carbonneau and Godin, (2021), Cao et al., (2023) and Wu and Jaimungal, (2023) for a few examples.

The present work is concerned with problems involving the minimization of catastrophic risk in a sequential decision process, which represents outcomes that are very rare but of extreme magnitude. Since such extreme events can cause very undesirable outcomes depending on the area of application, such as financial ruin, health-impeding consequences or accidents, mitigating their impact is very important. Another example of application is the measurement of capital requirements in finance or insurance, which are based on the average of outcomes in the very worst-case scenarios; minimizing capital requirements for a financial institution is a key determinant of its probability, as capital is costly to hold.

Here, extreme risk is quantified through risk measures, reflecting the far tail of the distribution of total costs incurred by the agent. In particular, we consider the special case of the conditional Value-at-Risk, CVaRα, which represents the average outcome among the worst possible set of scenarios with probability 1α1𝛼1-\alpha1 - italic_α. The main motivation of the work is that CVaR with a very high confidence level α𝛼\alphaitalic_α is very poorly approximated with the empirical distribution, due to the scarcity of observations in the far tail. Such paucity can be caused either by the lack of extreme observations, or the inability to generate a sufficient number of scenarios that include enough extreme data points in a reasonable time frame.111Importance sampling (IS) methods can sometimes help with this issue, if a scenario generator is used. However, suitably improving performance with IS requires knowing the direction in which to tilt risk driver (i.e. states) distributions to produce outcomes. Such information is not necessarily known in the context of highly complex and non-linear dynamics (for instance the optimization of large financial portfolio with non-linear instruments such as exotic options) and methods alternative to IS would be required in such cases. In most acute cases, extreme outcomes might even be outside the data range, by not having materialized yet. The scarcity of tail observations can be exacerbated if the cost outcomes from the problem have fat-tailed distributions.

Our main contribution is to develop a policy gradient method for risk-aware RL problems that is tailor-made for cases where catastrophic-level risk must be minimized. We refer to our algorithm as POTPG, as it relies on the peaks-over-threshold (POT) approach of extreme value theory (EVT) that allows extrapolating the far tail behavior of a distribution through asymptotic approximations leveraging the distribution from large (but not extreme) outcomes. To the best of our knowledge, we are the first to incorporate EVT results within reinforcement learning algorithms to tackle general sequential decision problems; our work can be seen as an extension of Troop et al., (2022) that explores catastrophic risk minimization within the multi-armed bandits framework, but did not tackle the more general Markov decision problem setting.

The paper is divided as follows. Section 2 describes the risk-aware sequential decision making problem considered here, and provides a conventional policy gradient algorithm to tackle the problem. Section 3 proposes our POT policy gradient (POTPG) approach, a modified policy gradient algorithm based on extreme value theory estimates of the tail risk of a distribution. This algorithm is tailor-made to tackle catastrophic-level risk minimization. Section 4 benchmarks the performance of POTPG against the conventional approach in a controlled environment, whereas Section 5 assesses its performance in a financial risk management application, namely option hedging optimization. The paper concludes with some final remarks in Section 6. The Python code to replicate the various numerical experiments of this paper is available at https://github.com/parisadavar/EVT-policy-gradient-RL.

2   A risk-aware reinforcement learning problem and policy gradients

We herein consider the framework of Markov decision processes222This work generalizes to non-Markovian state transition dynamics in a straightforward way. to represent sequential decision problems. Such problems are represented by a set of time steps 𝒯={0,,T}𝒯0𝑇\mathcal{T}=\{0,\ldots,T\}caligraphic_T = { 0 , … , italic_T }, a state space 𝒮𝒮\mathcal{S}caligraphic_S, an action space 𝒜𝒜\mathcal{A}caligraphic_A, a cost space 𝒞𝒞\mathcal{C}caligraphic_C and a sequence of transition probabilities characterizing the joint distribution of the next-step reward and state, given the current state and action, namely [St+1s,Ct+1c|At=a,St=s]\mathbb{P}[S_{t+1}\leq s^{\prime},C_{t+1}\leq c|A_{t}=a,S_{t}=s]blackboard_P [ italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≤ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_C start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≤ italic_c | italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_s ] for s,s𝒮𝑠superscript𝑠𝒮s,s^{\prime}\in\mathcal{S}italic_s , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_S, a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C and t𝒯\{T}𝑡\𝒯𝑇t\in\mathcal{T}\backslash\{T\}italic_t ∈ caligraphic_T \ { italic_T }.

Without loss of generality, deterministic policies π:𝒮𝒜:𝜋𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A}italic_π : caligraphic_S → caligraphic_A are considered in this work. Such framework gives rises to random state-action-cost sequences of the form S0,A0,C1subscript𝑆0subscript𝐴0subscript𝐶1S_{0},A_{0},C_{1}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, S1,A1,C2,subscript𝑆1subscript𝐴1subscript𝐶2S_{1},A_{1},C_{2},\ldotsitalic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , …, ST1,AT1,CTsubscript𝑆𝑇1subscript𝐴𝑇1subscript𝐶𝑇S_{T-1},A_{T-1},C_{T}italic_S start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, where at any time point t𝑡titalic_t the agent takes action At=π(St)subscript𝐴𝑡𝜋subscript𝑆𝑡A_{t}=\pi(S_{t})italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) when encountering state Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, at a cost of Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, then the next-stage state St+1subscript𝑆𝑡1S_{t+1}italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT is drawn randomly from the probability measure \mathbb{P}blackboard_P.

2.1   A risk-aware reinforcement learning problem

The risk-aware reinforcement learning problem considered here333Other formulations of risk-aware RL problems exist, such as maximizing the expected rewards under some risk constraints (see for instance Prashanth et al.,, 2022), or using dynamic risk-measures leading to time-consistent dynamic programs (see Saeed Marzban and Li,, 2023; Coache et al.,, 2023). is to find the optimal policy minimizing the risk associated with the cumulative discounted cost: denoting costs as Ct(π)subscriptsuperscript𝐶𝜋𝑡C^{(\pi)}_{t}italic_C start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to highlight their dependence on policy π𝜋\piitalic_π, the problem considered can be written as

min𝜋ρ(t=1TγtCt(π))𝜋𝜌subscriptsuperscript𝑇𝑡1superscript𝛾𝑡subscriptsuperscript𝐶𝜋𝑡\underset{\pi}{\min}\,\rho\left(\sum^{T}_{t=1}\gamma^{t}C^{(\pi)}_{t}\right)underitalic_π start_ARG roman_min end_ARG italic_ρ ( ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT ( italic_π ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (2.1)

for some discount factor γ(0,1]𝛾01\gamma\in(0,1]italic_γ ∈ ( 0 , 1 ] and a risk measure ρ𝜌\rhoitalic_ρ map** random variables into perceived risk. In the classic non-risk-aware case (see for instance Sutton and Barto,, 2018), the risk measure is the expectation operator, namely ρ=𝔼𝜌𝔼\rho=\mathbb{E}italic_ρ = blackboard_E. However, more general risk measures can be used to depict preferences of risk-aware agents. Since this work is concerned with catastrophic risk mitigation, we consider the specific case of the CVaR risk measure (Rockafellar and Uryasev,, 2002) depicting tail risk and defined as

CVaRα(X)=11αα1qγ(X)𝑑γ,with qα(X)=inf{x|F(x)α},formulae-sequencesubscriptCVaR𝛼𝑋11𝛼superscriptsubscript𝛼1subscript𝑞𝛾𝑋differential-d𝛾with subscript𝑞𝛼𝑋infimumconditional-set𝑥𝐹𝑥𝛼\displaystyle\text{CVaR}_{\alpha}(X)=\frac{1}{1-\alpha}\int_{\alpha}^{1}q_{% \gamma}(X)d\gamma,\quad\text{with }\,q_{\alpha}(X)=\inf\{x\in\mathbb{R}|F(x)% \geq\alpha\},CVaR start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) = divide start_ARG 1 end_ARG start_ARG 1 - italic_α end_ARG ∫ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_X ) italic_d italic_γ , with italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) = roman_inf { italic_x ∈ blackboard_R | italic_F ( italic_x ) ≥ italic_α } ,

where F𝐹Fitalic_F is the cumulative distribution function (CDF) of X𝑋Xitalic_X and qα(X)subscript𝑞𝛼𝑋q_{\alpha}(X)italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) is its quantile at level α𝛼\alphaitalic_α. In what follows we write qα=qα(X)subscript𝑞𝛼subscript𝑞𝛼𝑋q_{\alpha}=q_{\alpha}(X)italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ). When X𝑋Xitalic_X is an absolutely continuous random variable, as in this work, then CVaR has the alternative representation CVaRα(X)=𝔼[X|Xqα]subscriptCVaR𝛼𝑋𝔼delimited-[]conditional𝑋𝑋subscript𝑞𝛼\text{CVaR}_{\alpha}(X)=\mathbb{E}[X|X\geq q_{\alpha}]CVaR start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) = blackboard_E [ italic_X | italic_X ≥ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ], which can be interpreted as the average outcome among the set of the 100(1α)%100percent1𝛼100(1-\alpha)\%100 ( 1 - italic_α ) % worst-case scenarios. This work considers catastrophic risk minimization, and as such, we consider very high levels for α𝛼\alphaitalic_α, i.e. α𝛼\alphaitalic_α very close to one.

2.2   A policy gradient solution approach

A natural approach to solve (2.1) is policy gradient methods. Policies are first restricted to a set of parametric policies πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with parameter vector θ𝜃\thetaitalic_θ. In that case, Problem (2.1) reduces to solving

θ=argmax𝜃J(θ)with J(θ)=CVaRα(t=1TγtCt(πθ)).formulae-sequencesuperscript𝜃𝜃𝐽𝜃with 𝐽𝜃subscriptCVaR𝛼subscriptsuperscript𝑇𝑡1superscript𝛾𝑡subscriptsuperscript𝐶subscript𝜋𝜃𝑡\theta^{*}=\underset{\theta}{\arg\max}\ J(\theta)\quad\text{with }\,\,J(\theta% )=\text{CVaR}_{\alpha}\left(\sum^{T}_{t=1}\gamma^{t}C^{(\pi_{\theta})}_{t}% \right).italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = underitalic_θ start_ARG roman_arg roman_max end_ARG italic_J ( italic_θ ) with italic_J ( italic_θ ) = CVaR start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (2.2)

A common solution approach to the above problem is to use batch stochastic gradient descent, which leads to a sequence of parameter vectors {θ(j)}j1subscriptsuperscript𝜃𝑗𝑗1\{\theta^{(j)}\}_{j\geq 1}{ italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j ≥ 1 end_POSTSUBSCRIPT obtained through

θ(j+1)=θ(j)+ηjJ^(θ(j)),superscript𝜃𝑗1superscript𝜃𝑗subscript𝜂𝑗^𝐽superscript𝜃𝑗\theta^{(j+1)}=\theta^{(j)}+\eta_{j}\widehat{\nabla J}(\theta^{(j)}),italic_θ start_POSTSUPERSCRIPT ( italic_j + 1 ) end_POSTSUPERSCRIPT = italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG ∇ italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ,

with {ηn}subscript𝜂𝑛\{\eta_{n}\}{ italic_η start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } representing the learning schedule and J^(θ(j))^𝐽superscript𝜃𝑗\widehat{\nabla J}(\theta^{(j)})over^ start_ARG ∇ italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) representing a suitable (stochastic) approximation of the gradient. Here we use the celebrated ADAM algorithm of Kingma and Ba, (2014), with a step size parameter of 0.010.010.010.01 to determine learning rate sequences {η(j)}j1subscriptsuperscript𝜂𝑗𝑗1\{\eta^{(j)}\}_{j\geq 1}{ italic_η start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j ≥ 1 end_POSTSUBSCRIPT.

To approximate the gradient, a (forward) finite difference approach is used here: for some small ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0,

Jθi^(θ(j))^𝐽subscript𝜃𝑖superscript𝜃𝑗\displaystyle\widehat{\frac{\partial J}{\partial\theta_{i}}}(\theta^{(j)})over^ start_ARG divide start_ARG ∂ italic_J end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) \displaystyle\approx J^(θ(j)+ϵ𝟙i)J^(θ(j))ϵ,^𝐽superscript𝜃𝑗italic-ϵsubscript1𝑖^𝐽superscript𝜃𝑗italic-ϵ\displaystyle\frac{\widehat{J}(\theta^{(j)}+\epsilon\mathds{1}_{i})-\widehat{J% }(\theta^{(j)})}{\epsilon},divide start_ARG over^ start_ARG italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + italic_ϵ blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - over^ start_ARG italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_ϵ end_ARG , (2.3)
J^(θ(j))^𝐽superscript𝜃𝑗\displaystyle\widehat{\nabla J}(\theta^{(j)})over^ start_ARG ∇ italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) =\displaystyle== [Jθ1^(θ(j))Jθp^(θ(j))],superscriptdelimited-[]^𝐽subscript𝜃1superscript𝜃𝑗^𝐽subscript𝜃𝑝superscript𝜃𝑗top\displaystyle\left[\widehat{\frac{\partial J}{\partial\theta_{1}}}(\theta^{(j)% })\ldots\widehat{\frac{\partial J}{\partial\theta_{p}}}(\theta^{(j)})\right]^{% \top},[ over^ start_ARG divide start_ARG ∂ italic_J end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) … over^ start_ARG divide start_ARG ∂ italic_J end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (2.4)

with p𝑝pitalic_p being the dimension of the parameter vector and 𝟙isubscript1𝑖\mathds{1}_{i}blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT being the dummy vector containing zeroes, except for its ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT element that is equal to one. The objective function J^(θ)^𝐽𝜃\widehat{J}(\theta)over^ start_ARG italic_J end_ARG ( italic_θ ) is approximated by sampling n𝑛nitalic_n independent copies X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\ldots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of the cumulative costs X=t=1TγtCt(πθ)𝑋subscriptsuperscript𝑇𝑡1superscript𝛾𝑡subscriptsuperscript𝐶subscript𝜋𝜃𝑡X=\sum^{T}_{t=1}\gamma^{t}C^{(\pi_{\theta})}_{t}italic_X = ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT obtained from a Monte-Carlo simulation, if a simulator of the environment is available, or alternatively through the application of the policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to real data, either in an online or offline fashion.

The most natural approach to obtain the estimate of the objective function consists in assuming that the empirical distribution of cumulative costs obtained through the mini-batch is close to the true distribution. Such method is referred to as the sample averaging method and relies on

J^=CVaR^α(X)=i=1nXi𝟙{Xiq^α}j=1n𝟙{Xjq^α},^𝐽subscript^𝐶𝑉𝑎𝑅𝛼𝑋superscriptsubscript𝑖1𝑛subscript𝑋𝑖subscript1subscript𝑋𝑖subscript^𝑞𝛼superscriptsubscript𝑗1𝑛subscript1subscript𝑋𝑗subscript^𝑞𝛼\widehat{J}=\widehat{CVaR}_{\alpha}(X)=\frac{\sum_{i=1}^{n}X_{i}\mathds{1}_{\{% X_{i}\geq\hat{q}_{\alpha}\}}}{\sum_{j=1}^{n}\mathds{1}_{\{X_{j}\geq\hat{q}_{% \alpha}\}}},over^ start_ARG italic_J end_ARG = over^ start_ARG italic_C italic_V italic_a italic_R end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT } end_POSTSUBSCRIPT end_ARG , (2.5)

where, denoting by X(1),,X(n)subscript𝑋1subscript𝑋𝑛X_{(1)},\ldots,X_{(n)}italic_X start_POSTSUBSCRIPT ( 1 ) end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT ( italic_n ) end_POSTSUBSCRIPT the order statistics of the sample (i.e. the sample sorted in increasing order), q^αsubscript^𝑞𝛼\hat{q}_{\alpha}over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT is the empirical quantile given by:

q^α=VaR^α(X)=inf{x|F^(x)α}=X(αn),subscript^𝑞𝛼subscript^𝑉𝑎𝑅𝛼𝑋infimumconditional-set𝑥^𝐹𝑥𝛼subscript𝑋𝛼𝑛\hat{q}_{\alpha}=\widehat{VaR}_{\alpha}(X)=\inf\{x\in\mathbb{R}|\hat{F}(x)\geq% \alpha\}=X_{(\lceil{\alpha n}\rceil)},over^ start_ARG italic_q end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = over^ start_ARG italic_V italic_a italic_R end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) = roman_inf { italic_x ∈ blackboard_R | over^ start_ARG italic_F end_ARG ( italic_x ) ≥ italic_α } = italic_X start_POSTSUBSCRIPT ( ⌈ italic_α italic_n ⌉ ) end_POSTSUBSCRIPT , (2.6)

with F^^𝐹\widehat{F}over^ start_ARG italic_F end_ARG being the empirical CDF of X𝑋Xitalic_X given by F^(x)=1ns=1n𝟙{Xsx}^𝐹𝑥1𝑛superscriptsubscript𝑠1𝑛subscript1subscript𝑋𝑠𝑥\widehat{F}(x)=\frac{1}{n}\sum_{s=1}^{n}\mathds{1}_{\{X_{s}\leq x\}}over^ start_ARG italic_F end_ARG ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 start_POSTSUBSCRIPT { italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≤ italic_x } end_POSTSUBSCRIPT.

Unfortunately, when α𝛼\alphaitalic_α is high and very close to one, the scarcity of observations can make the sample averaging approach very unstable in estimating the objective function, a problem which is exacerbated if the distribution of X𝑋Xitalic_X is heavy-tailed. This justifies the development of the EVT-based estimator described in the next section.

3   Integrating extreme value theory estimates into policy gradients

This section first discusses the construction of CVaR estimates based on the peaks–over–threshold (POT) approach rooted in extreme value theory (EVT).444Alternative methods also based on EVT such as that of Bairakdar et al., (2024) could also have been contemplated. The POT approach is discussed more in-depth in Coles et al., (2001) or McNeil et al., (2015). The procedure integrating such estimates into policy gradient approaches is subsequently detailed.

3.1   Estimation of CVaR with the peaks-over-threshold approach

A wide set of distributions satisfy the following condition.

Definition 3.1.

A CDF F𝐹Fitalic_F is said to be in the maximum domain of attraction of the generalized extreme value distribution (GEVD) Hξsubscript𝐻𝜉H_{\xi}italic_H start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT with parameter ξ𝜉\xiitalic_ξ,555The CDF of the GEVD is given by Hξ(x)=exp((1+ξx)1ξ)subscript𝐻𝜉𝑥superscript1𝜉𝑥1𝜉H_{\xi}(x)=\exp(-(1+\xi x)^{\frac{-1}{\xi}})italic_H start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) = roman_exp ( - ( 1 + italic_ξ italic_x ) start_POSTSUPERSCRIPT divide start_ARG - 1 end_ARG start_ARG italic_ξ end_ARG end_POSTSUPERSCRIPT ) if ξ0𝜉0\xi\neq 0italic_ξ ≠ 0, or Hξ(x)=exp(ex),subscript𝐻𝜉𝑥superscript𝑒𝑥H_{\xi}(x)=\exp(-e^{-x}),italic_H start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) = roman_exp ( - italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT ) , if ξ=0𝜉0\xi=0italic_ξ = 0, with support {x:1+ξx>0}conditional-set𝑥1𝜉𝑥0\{x:1+\xi x>0\}{ italic_x : 1 + italic_ξ italic_x > 0 }. denoted FMDA(Hξ)𝐹𝑀𝐷𝐴subscript𝐻𝜉F\in MDA(H_{\xi})italic_F ∈ italic_M italic_D italic_A ( italic_H start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ), if there exist a sequence a positive numbers {an}nsubscriptsubscript𝑎𝑛𝑛\{a_{n}\}_{n\in\mathbb{N}}{ italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT and a sequence of real numbers {bn}nsubscriptsubscript𝑏𝑛𝑛\{b_{n}\}_{n\in\mathbb{N}}{ italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT, such that

limnFn(anx+bn)=Hξ(x).subscript𝑛superscript𝐹𝑛subscript𝑎𝑛𝑥subscript𝑏𝑛subscript𝐻𝜉𝑥\lim_{n\to\infty}F^{n}(a_{n}x+b_{n})=H_{\xi}(x).roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_x + italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_H start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ( italic_x ) . (3.1)

Note that Fnsuperscript𝐹𝑛F^{n}italic_F start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the CDF of the maximum of n𝑛nitalic_n i.i.d. copies of a random variable with CDF F𝐹Fitalic_F.

The FMDA(Hξ)𝐹𝑀𝐷𝐴subscript𝐻𝜉F\in MDA(H_{\xi})italic_F ∈ italic_M italic_D italic_A ( italic_H start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) property characterizes the asymptotic behavior of distribution F𝐹Fitalic_F. Indeed, define Fusubscript𝐹𝑢F_{u}italic_F start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT, the distribution of excesses above threshold u𝑢uitalic_u, as

Fu(y)=P(Xuy|X>u)=P(Xy+u|X>u)=F(y+u)F(u)1F(u).subscript𝐹𝑢𝑦𝑃𝑋𝑢𝑦ket𝑋𝑢𝑃𝑋𝑦𝑢ket𝑋𝑢𝐹𝑦𝑢𝐹𝑢1𝐹𝑢F_{u}(y)=P(X-u\leq y|X>u)=P(X\leq y+u|X>u)=\frac{F(y+u)-F(u)}{1-F(u)}.italic_F start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_y ) = italic_P ( italic_X - italic_u ≤ italic_y | italic_X > italic_u ) = italic_P ( italic_X ≤ italic_y + italic_u | italic_X > italic_u ) = divide start_ARG italic_F ( italic_y + italic_u ) - italic_F ( italic_u ) end_ARG start_ARG 1 - italic_F ( italic_u ) end_ARG . (3.2)

Define also the generalized Pareto distribution (GPD) as follows.

Definition 3.2.

The GPD with scale parameter σ𝜎\sigmaitalic_σ and shape parameter ξ𝜉\xiitalic_ξ has a CDF

Gξ,σ(x)={1(1+ξxσ)1ξ,if ξ0,1exσ,if ξ=0,subscript𝐺𝜉𝜎𝑥cases1superscript1𝜉𝑥𝜎1𝜉if 𝜉01superscript𝑒𝑥𝜎if 𝜉0G_{\xi,\sigma}(x)=\begin{cases}1-(1+\frac{\xi x}{\sigma})^{\frac{-1}{\xi}},&% \mbox{if }\xi\neq 0,\\ 1-e^{\frac{-x}{\sigma}},&\mbox{if }\xi=0,\end{cases}italic_G start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL 1 - ( 1 + divide start_ARG italic_ξ italic_x end_ARG start_ARG italic_σ end_ARG ) start_POSTSUPERSCRIPT divide start_ARG - 1 end_ARG start_ARG italic_ξ end_ARG end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_ξ ≠ 0 , end_CELL end_ROW start_ROW start_CELL 1 - italic_e start_POSTSUPERSCRIPT divide start_ARG - italic_x end_ARG start_ARG italic_σ end_ARG end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_ξ = 0 , end_CELL end_ROW (3.3)

where the support is x0𝑥0x\geq 0italic_x ≥ 0, for ξ0𝜉0\xi\geq 0italic_ξ ≥ 0, and 0xσξ0𝑥𝜎𝜉0\leq x\leq-\frac{\sigma}{\xi}0 ≤ italic_x ≤ - divide start_ARG italic_σ end_ARG start_ARG italic_ξ end_ARG, for ξ0𝜉0\xi\leq 0italic_ξ ≤ 0, and a probability density function (PDF)

gξ,σ(x)={1σ(1+ξxσ)1ξ1,if ξ0,1σexσ,if ξ=0.subscript𝑔𝜉𝜎𝑥cases1𝜎superscript1𝜉𝑥𝜎1𝜉1if 𝜉01𝜎superscript𝑒𝑥𝜎if 𝜉0g_{\xi,\sigma}(x)=\begin{cases}\frac{1}{\sigma}(1+\frac{\xi x}{\sigma})^{\frac% {-1}{\xi}-1},&\mbox{if }\xi\neq 0,\\ \frac{1}{\sigma}e^{\frac{-x}{\sigma}},&\mbox{if }\xi=0.\end{cases}italic_g start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_x ) = { start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG ( 1 + divide start_ARG italic_ξ italic_x end_ARG start_ARG italic_σ end_ARG ) start_POSTSUPERSCRIPT divide start_ARG - 1 end_ARG start_ARG italic_ξ end_ARG - 1 end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_ξ ≠ 0 , end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG italic_e start_POSTSUPERSCRIPT divide start_ARG - italic_x end_ARG start_ARG italic_σ end_ARG end_POSTSUPERSCRIPT , end_CELL start_CELL if italic_ξ = 0 . end_CELL end_ROW (3.4)

Then the following result from Balkema and De Haan, (1974) or Pickands III, (1975) states that when FMDA(Hξ)𝐹𝑀𝐷𝐴subscript𝐻𝜉F\in MDA(H_{\xi})italic_F ∈ italic_M italic_D italic_A ( italic_H start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ), the excess distribution Fusubscript𝐹𝑢F_{u}italic_F start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is well-approximated asymptotically by a GPD distribution when u𝑢uitalic_u is near to the essential supremum of distribution F𝐹Fitalic_F.

Theorem 3.1 (Pickands–Balkema–de Haan).

If FMDA(Hξ)𝐹𝑀𝐷𝐴subscript𝐻𝜉F\in MDA(H_{\xi})italic_F ∈ italic_M italic_D italic_A ( italic_H start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ), there exists a positive measurable function σ(u)𝜎𝑢\sigma(u)italic_σ ( italic_u ) such that

limuy0supy0[0,y0u]|Fu(y)Gξ,σ(u)(y)|,subscript𝑢subscript𝑦0subscriptsupremumsubscript𝑦00subscript𝑦0𝑢subscript𝐹𝑢𝑦subscript𝐺𝜉𝜎𝑢𝑦\lim_{u\to y_{0}}\sup_{y_{0}\in[0,y_{0}-u]}|F_{u}(y)-G_{\xi,\sigma(u)}(y)|,roman_lim start_POSTSUBSCRIPT italic_u → italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ [ 0 , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_u ] end_POSTSUBSCRIPT | italic_F start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_y ) - italic_G start_POSTSUBSCRIPT italic_ξ , italic_σ ( italic_u ) end_POSTSUBSCRIPT ( italic_y ) | , (3.5)

where y0=sup{y;F(y)<1}subscript𝑦0𝑠𝑢𝑝formulae-sequence𝑦𝐹𝑦1y_{0}=sup\{y\in\mathbb{R};F(y)<1\}\leq\inftyitalic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_s italic_u italic_p { italic_y ∈ blackboard_R ; italic_F ( italic_y ) < 1 } ≤ ∞ and Gξ,σ(u)(y)subscript𝐺𝜉𝜎𝑢𝑦G_{\xi,\sigma(u)}(y)italic_G start_POSTSUBSCRIPT italic_ξ , italic_σ ( italic_u ) end_POSTSUBSCRIPT ( italic_y ).

As described in Section 7.2 of McNeil et al., (2015), such a result allows defining the following approximation for the CVaR of the variable X𝑋Xitalic_X with CDF F𝐹Fitalic_F, which is based on the assumption that Fu(y)Gξ,σ(u)(y)subscript𝐹𝑢𝑦subscript𝐺𝜉𝜎𝑢𝑦F_{u}(y)\approx G_{\xi,\sigma(u)}(y)italic_F start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_y ) ≈ italic_G start_POSTSUBSCRIPT italic_ξ , italic_σ ( italic_u ) end_POSTSUBSCRIPT ( italic_y ) for y>u𝑦𝑢y>uitalic_y > italic_u, i.e. if u𝑢uitalic_u is large enough.666Note that the condition ξ1𝜉1\xi\leq 1italic_ξ ≤ 1 is required for the CVaR to exist, otherwise the GPD distribution has an infinite expectation.

Corollary 3.1.

Assume that FMDA(Hξ)𝐹𝑀𝐷𝐴subscript𝐻𝜉F\in MDA(H_{\xi})italic_F ∈ italic_M italic_D italic_A ( italic_H start_POSTSUBSCRIPT italic_ξ end_POSTSUBSCRIPT ) for some ξ[0,1)𝜉01\xi\in[0,1)italic_ξ ∈ [ 0 , 1 ), and that qα>usubscript𝑞𝛼𝑢q_{\alpha}>uitalic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT > italic_u. Let σ=σ(u)𝜎𝜎𝑢\sigma=\sigma(u)italic_σ = italic_σ ( italic_u ) satisfy conditions of Theorem 3.1. Then for su,α=1F(u)1αsubscript𝑠𝑢𝛼1𝐹𝑢1𝛼s_{u,\alpha}=\frac{1-F(u)}{1-\alpha}italic_s start_POSTSUBSCRIPT italic_u , italic_α end_POSTSUBSCRIPT = divide start_ARG 1 - italic_F ( italic_u ) end_ARG start_ARG 1 - italic_α end_ARG,

CVaRα(X)cu,α={u+σ1ξ(1+su,αξ1ξ),if ξ0,u+σ(logsu,α+1),if ξ=0.𝐶𝑉𝑎subscript𝑅𝛼𝑋subscript𝑐𝑢𝛼cases𝑢𝜎1𝜉1subscriptsuperscript𝑠𝜉𝑢𝛼1𝜉if 𝜉0𝑢𝜎subscript𝑠𝑢𝛼1if 𝜉0{CVaR}_{\alpha}(X)\approx c_{u,\alpha}=\begin{cases}u+\frac{\sigma}{1-\xi}(1+% \frac{s^{\xi}_{u,\alpha}-1}{\xi}),&\mbox{if }\xi\neq 0,\\ u+\sigma(\log s_{u,\alpha}+1),&\mbox{if }\xi=0.\end{cases}italic_C italic_V italic_a italic_R start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) ≈ italic_c start_POSTSUBSCRIPT italic_u , italic_α end_POSTSUBSCRIPT = { start_ROW start_CELL italic_u + divide start_ARG italic_σ end_ARG start_ARG 1 - italic_ξ end_ARG ( 1 + divide start_ARG italic_s start_POSTSUPERSCRIPT italic_ξ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_u , italic_α end_POSTSUBSCRIPT - 1 end_ARG start_ARG italic_ξ end_ARG ) , end_CELL start_CELL if italic_ξ ≠ 0 , end_CELL end_ROW start_ROW start_CELL italic_u + italic_σ ( roman_log italic_s start_POSTSUBSCRIPT italic_u , italic_α end_POSTSUBSCRIPT + 1 ) , end_CELL start_CELL if italic_ξ = 0 . end_CELL end_ROW (3.6)

This points toward the following procedure, called the peaks-over-threshold approach to estimate CVaRα(X)𝐶𝑉𝑎subscript𝑅𝛼𝑋{CVaR}_{\alpha}(X)italic_C italic_V italic_a italic_R start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) based on a sample of i.i.d. copies X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\ldots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of X𝑋Xitalic_X:

  1. 1.

    Select a proper threshold u𝑢uitalic_u.

  2. 2.

    Calculate sample values of excesses over threshold u𝑢uitalic_u, denoted Y1,,Yksubscript𝑌1subscript𝑌𝑘Y_{1},\ldots,Y_{k}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and defined as Yi=X(n+1i)usubscript𝑌𝑖subscript𝑋𝑛1𝑖𝑢Y_{i}=X_{(n+1-i)}-uitalic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT ( italic_n + 1 - italic_i ) end_POSTSUBSCRIPT - italic_u, where k𝑘kitalic_k is the number of sample observations Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT above u𝑢uitalic_u.

  3. 3.

    Fit a GPD distribution to the sample Y1,,Yksubscript𝑌1subscript𝑌𝑘Y_{1},\ldots,Y_{k}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to get estimates (ξ^,σ^)^𝜉^𝜎(\hat{\xi},\hat{\sigma})( over^ start_ARG italic_ξ end_ARG , over^ start_ARG italic_σ end_ARG ).

  4. 4.

    Replace (ξ,σ)𝜉𝜎(\xi,\sigma)( italic_ξ , italic_σ ) and F(u)𝐹𝑢F(u)italic_F ( italic_u ) with respective estimates (ξ^,σ^)^𝜉^𝜎(\hat{\xi},\hat{\sigma})( over^ start_ARG italic_ξ end_ARG , over^ start_ARG italic_σ end_ARG ) and F^(u)^𝐹𝑢\hat{F}(u)over^ start_ARG italic_F end_ARG ( italic_u ) into (3.6) to get an approximation for CVaRα(X)𝐶𝑉𝑎subscript𝑅𝛼𝑋{CVaR}_{\alpha}(X)italic_C italic_V italic_a italic_R start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ).

Since for any fixed u𝑢uitalic_u, excesses Y1,,Yksubscript𝑌1subscript𝑌𝑘Y_{1},\ldots,Y_{k}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are independent, Step 3 can be performed through maximum likelihood777De-biasing procedures could additionally be applied to adjust maximum likelihood estimates, such as in Troop et al., (2021). by solving numerically

(ξ^,σ^)=argmaxξ,σi=1klngξ,σ(Yi).^𝜉^𝜎𝜉𝜎superscriptsubscript𝑖1𝑘subscript𝑔𝜉𝜎subscript𝑌𝑖(\hat{\xi},\hat{\sigma})=\underset{\xi,\sigma}{\arg\max}\sum_{i=1}^{k}\ln g_{% \xi,\sigma}(Y_{i}).( over^ start_ARG italic_ξ end_ARG , over^ start_ARG italic_σ end_ARG ) = start_UNDERACCENT italic_ξ , italic_σ end_UNDERACCENT start_ARG roman_arg roman_max end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_ln italic_g start_POSTSUBSCRIPT italic_ξ , italic_σ end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (3.7)

Alternatively, a method-of-moments (MOM) estimator matching the first two moments888Here the MOM estimator requires that ξ<1/2𝜉12\xi<1/2italic_ξ < 1 / 2 to ensure that the variance of the GPD be finite. of the GPD distribution with those of the empirical distribution of excesses would lead to999This is because if YGPD(ξ,σ)similar-to𝑌GPD𝜉𝜎Y\sim\text{GPD}(\xi,\sigma)italic_Y ∼ GPD ( italic_ξ , italic_σ ), then 𝔼(Y)=σ1ξ𝔼𝑌𝜎1𝜉\mathbb{E}(Y)=\frac{\sigma}{1-\xi}blackboard_E ( italic_Y ) = divide start_ARG italic_σ end_ARG start_ARG 1 - italic_ξ end_ARG if ξ<1𝜉1\xi<1italic_ξ < 1 and Var(Y)=σ2(1ξ)2(12ξ)Var𝑌superscript𝜎2superscript1𝜉212𝜉\text{Var}(Y)=\frac{\sigma^{2}}{(1-\xi)^{2}(1-2\xi)}Var ( italic_Y ) = divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_ξ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - 2 italic_ξ ) end_ARG if ξ<1/2𝜉12\xi<1/2italic_ξ < 1 / 2. Estimators in (3.8) are obtained by equating 𝔼(Y)𝔼𝑌\mathbb{E}(Y)blackboard_E ( italic_Y ) and Var(Y)Var𝑌\text{Var}(Y)Var ( italic_Y ) with Y¯¯𝑌\bar{Y}over¯ start_ARG italic_Y end_ARG and S2superscript𝑆2S^{2}italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, respectively.

ξ^=S2Y¯22S2,σ^=Y¯(S2+Y¯22S2),formulae-sequence^𝜉superscript𝑆2superscript¯𝑌22superscript𝑆2^𝜎¯𝑌superscript𝑆2superscript¯𝑌22superscript𝑆2\hat{\xi}=\frac{S^{2}-\bar{Y}^{2}}{2S^{2}},\quad\hat{\sigma}=\bar{Y}\left(% \frac{S^{2}+\bar{Y}^{2}}{2S^{2}}\right),over^ start_ARG italic_ξ end_ARG = divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - over¯ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , over^ start_ARG italic_σ end_ARG = over¯ start_ARG italic_Y end_ARG ( divide start_ARG italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + over¯ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (3.8)

with Y¯=1kj=1kYj¯𝑌1𝑘subscriptsuperscript𝑘𝑗1subscript𝑌𝑗\bar{Y}=\frac{1}{k}\sum^{k}_{j=1}Y_{j}over¯ start_ARG italic_Y end_ARG = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and S2=1kj=1k(YjY¯)2superscript𝑆21𝑘subscriptsuperscript𝑘𝑗1superscriptsubscript𝑌𝑗¯𝑌2S^{2}=\frac{1}{k}\sum^{k}_{j=1}(Y_{j}-\bar{Y})^{2}italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG ∑ start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - over¯ start_ARG italic_Y end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

The task in Step 1111, namely the selection of a suitable choice of threshold u𝑢uitalic_u is challenging, as it entails seeking a proper bias-variance trade-off. Indeed, if u𝑢uitalic_u is too low, the distribution tail behavior might not be well-approximated by its asymptotic GPD distribution, leading to high bias. Conversely, choosing a u𝑢uitalic_u that is too large will imply a low number of excesses, which will lead to high variance for the GDP parameter estimators. A common approach in the literature is to manually select u𝑢uitalic_u through visual inspection of the so-called Hill plot (see McNeil et al.,, 2015). However, such a method is not appropriate in our setup since the choice of threshold u𝑢uitalic_u needs to be repeated a very large number of times through the learning phase. As such, we rely on the Bader et al., (2018) algorithm that performs automated selection of the threshold based on a sequence of Anderson-Darling goodness-of-fit tests. Such a procedure tests for a set of candidate values u(1)<<u()superscript𝑢1superscript𝑢u^{(1)}<\ldots<u^{(\ell)}italic_u start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT < … < italic_u start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT, and the smallest among these is selected as the threshold, which leads to a proper fit of the GPD to excesses over u𝑢uitalic_u. The implementation from Troop et al., (2021) of the procedure is considered here and is detailed in Appendix Appendix A     The automated threshold selection procedure. This modifications allows stabilizing estimates, for instance by not allowing estimated values of ξ𝜉\xiitalic_ξ too close to one (to avoid the CVaR estimate exploding) and by using the sample averaging estimator as fallback, when none of the thresholds lead to a satisfactory fit of the GPD.

3.2   Our proposed EVT policy gradient algorithm

The POT-based CVaR estimation method from Section 3.1 is now integrated into the policy gradient estimation formula in (2.3) to obtain a complete policy gradient learning procedure for the policy parameters θ𝜃\thetaitalic_θ. This procedure, which we call the POTPG algorithm (standing for for peaks-over-threshold policy gradient), is summarized in the Algorithm 1 box below.

Algorithm 1 POTPG
ϵitalic-ϵ\epsilonitalic_ϵ (finite difference step), n (number of episodes), M (number of iterations)
Initialize θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT through random sampling
for j=0,,M1𝑗0𝑀1j=0,\ldots,M-1italic_j = 0 , … , italic_M - 1  do
     Sample n𝑛nitalic_n episodes of the MDP with policy πθjsubscript𝜋subscript𝜃𝑗\pi_{\theta_{j}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and denote by Xi=t=1TγtCt(πθ(j))subscript𝑋𝑖subscriptsuperscript𝑇𝑡1superscript𝛾𝑡subscriptsuperscript𝐶subscript𝜋superscript𝜃𝑗𝑡X_{i}=\sum^{T}_{t=1}\gamma^{t}C^{(\pi_{\theta^{(j)}})}_{t}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the total discounted costs for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT episode,
     Based on sample X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\ldots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, obtain the estimates ξ^^𝜉\hat{\xi}over^ start_ARG italic_ξ end_ARG, σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG and F^(u)^𝐹𝑢\hat{F}(u)over^ start_ARG italic_F end_ARG ( italic_u ) where the automated threshold selection method of Appendix Appendix A     The automated threshold selection procedure is applied to determine u𝑢uitalic_u,
     Obtain the EVT-based estimate of J^(θ(j))^𝐽superscript𝜃𝑗\widehat{J}(\theta^{(j)})over^ start_ARG italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) through (3.6),
     for i=1,,p𝑖1𝑝i=1,\ldots,pitalic_i = 1 , … , italic_p  do
         Sample n𝑛nitalic_n episodes of the MDP with policy πθ(j)+ϵ𝟙isubscript𝜋superscript𝜃𝑗italic-ϵsubscript1𝑖\pi_{\theta^{(j)}+\epsilon\mathds{1}_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + italic_ϵ blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, and denote by Xi=t=1TγtCt(πθ(j)+ϵ𝟙i)subscript𝑋𝑖subscriptsuperscript𝑇𝑡1superscript𝛾𝑡subscriptsuperscript𝐶subscript𝜋superscript𝜃𝑗italic-ϵsubscript1𝑖𝑡X_{i}=\sum^{T}_{t=1}\gamma^{t}C^{(\pi_{\theta^{(j)}+\epsilon\mathds{1}_{i}})}_% {t}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_C start_POSTSUPERSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + italic_ϵ blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the total discounted costs for the ithsuperscript𝑖𝑡i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT episode,
         Based on sample X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\ldots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and threshold u𝑢uitalic_u, obtain the estimates ξ^^𝜉\hat{\xi}over^ start_ARG italic_ξ end_ARG, σ^^𝜎\hat{\sigma}over^ start_ARG italic_σ end_ARG and F^(u)^𝐹𝑢\hat{F}(u)over^ start_ARG italic_F end_ARG ( italic_u ).
         Obtain the EVT-based estimate of J^(θ(j)+ϵ𝟙i)^𝐽superscript𝜃𝑗italic-ϵsubscript1𝑖\widehat{J}(\theta^{(j)}+\epsilon\mathds{1}_{i})over^ start_ARG italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + italic_ϵ blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) through (3.6),      
     Estimate the gradient J^(θ(j))^𝐽superscript𝜃𝑗\widehat{\nabla J}(\theta^{(j)})over^ start_ARG ∇ italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) through the finite difference scheme (2.3)-(2.4),
     θ(j+1)θ(j)+ηjJ^(θ(j))superscript𝜃𝑗1superscript𝜃𝑗subscript𝜂𝑗^𝐽superscript𝜃𝑗\theta^{(j+1)}\leftarrow\theta^{(j)}+\eta_{j}\widehat{\nabla J}(\theta^{(j)})italic_θ start_POSTSUPERSCRIPT ( italic_j + 1 ) end_POSTSUPERSCRIPT ← italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT over^ start_ARG ∇ italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ), with ηjsubscript𝜂𝑗\eta_{j}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT as determined by the ADAM algorithm. Return θ(M)superscript𝜃𝑀\theta^{(M)}italic_θ start_POSTSUPERSCRIPT ( italic_M ) end_POSTSUPERSCRIPT

If a simulator of the environment is available, it can be desirable, within a given iteration j𝑗jitalic_j, to use the same random seed to perform the simulation of episodes under policy πθ(j)subscript𝜋superscript𝜃𝑗\pi_{\theta^{(j)}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and these under policies πθ(j)+ϵ𝟙isubscript𝜋superscript𝜃𝑗italic-ϵsubscript1𝑖\pi_{\theta^{(j)}+\epsilon\mathds{1}_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + italic_ϵ blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT, i=1,,p𝑖1𝑝i=1,\ldots,pitalic_i = 1 , … , italic_p. This approach offers the advantage of isolating the impact of the policy alteration (from πθ(j)subscript𝜋superscript𝜃𝑗\pi_{\theta^{(j)}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to πθ(j)+ϵ𝟙isubscript𝜋superscript𝜃𝑗italic-ϵsubscript1𝑖\pi_{\theta^{(j)}+\epsilon\mathds{1}_{i}}italic_π start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + italic_ϵ blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT) from the randomness associated with the generation of episodes; the latter can add noise to the gradient estimate. The same seed is used throughout all the experiments presented here to simulate episodes under the original and shocked policies.

Note also that we propose to use the same threshold u𝑢uitalic_u to estimate J^(θ(j))^𝐽superscript𝜃𝑗\widehat{J}(\theta^{(j)})over^ start_ARG italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) and all J^(θ(j)+ϵ𝟙i)^𝐽superscript𝜃𝑗italic-ϵsubscript1𝑖\widehat{J}(\theta^{(j)}+\epsilon\mathds{1}_{i})over^ start_ARG italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT + italic_ϵ blackboard_1 start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in the POTPG algorithm to enhance the stability in the gradient estimation.

4   Simulation experiments in a controlled environment

Several simulation experiments in a controlled environment are first conducted to assess the performance of the POTPG algorithm from Section 3.2 and compare it to the conventional sample averaging (SA) benchmark based on (2.5). A simple simulation setting is considered to establish a proof of concept and highlight the potential usefulness of the POTPG algorithm. In such setting, we consider a single-dimension policy vector θ𝜃\thetaitalic_θ (i.e. p=1𝑝1p=1italic_p = 1), and we assume the cumulative discounted cost is distributed according to a given family of distributions whose parameters depend on θ𝜃\thetaitalic_θ.

More precisely, assume that under policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, Xsimilar-to𝑋absentX\simitalic_X ∼ GPD(ξ,ς=(θϑ)2+b)𝜉𝜍superscript𝜃italic-ϑ2𝑏(\xi,\varsigma=(\theta-\vartheta)^{2}+b)( italic_ξ , italic_ς = ( italic_θ - italic_ϑ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b ) for some ξ(0,1)𝜉01\xi\in(0,1)italic_ξ ∈ ( 0 , 1 ), b>0𝑏0b>0italic_b > 0 and ϑitalic-ϑ\vartheta\in\mathbb{R}italic_ϑ ∈ blackboard_R.101010Here, to avoid confusion, we use ς𝜍\varsigmaitalic_ς instead of σ𝜎\sigmaitalic_σ to represent the scale parameter of the whole distribution instead of that of the tail. Fix ϑ=0.4italic-ϑ0.4\vartheta=0.4italic_ϑ = 0.4 and b=2𝑏2b=2italic_b = 2 and consider values ξ=0.4𝜉0.4\xi=0.4italic_ξ = 0.4, 0.60.60.60.6 or 0.80.80.80.8 in subsequent experiments. Note that if XGPD(ξ,ς)similar-to𝑋𝐺𝑃𝐷𝜉𝜍X\sim GPD(\xi,\varsigma)italic_X ∼ italic_G italic_P italic_D ( italic_ξ , italic_ς ), then the conditional exceedance Xu|X>uGPD(ξ,ς+ξu)similar-to𝑋𝑢ket𝑋𝑢𝐺𝑃𝐷𝜉𝜍𝜉𝑢X-u|X>u\sim GPD(\xi,\varsigma+\xi u)italic_X - italic_u | italic_X > italic_u ∼ italic_G italic_P italic_D ( italic_ξ , italic_ς + italic_ξ italic_u ), meaning that the excess distribution of a GPD random variable is a GPD with the same shape parameter, and a scaling parameter that grows linearly with the threshold u𝑢uitalic_u. In that case, representing the tail distribution with a GPD is exact and not merely an asymptotic approximation. Such setting is used to test the POTPG algorithm in an ideal case with no misspecification of the tail distribution. The CVaR of the distribution with policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is then

CVaRα(X)=qα+ς+ξqα1ξ=ς1ξ(1+(1α)ξ1ξ),subscriptCVaR𝛼𝑋subscript𝑞𝛼𝜍𝜉subscript𝑞𝛼1𝜉𝜍1𝜉1superscript1𝛼𝜉1𝜉\text{CVaR}_{\alpha}(X)=q_{\alpha}+\frac{\varsigma+\xi q_{\alpha}}{1-\xi}=% \frac{\varsigma}{1-\xi}\left(1+\frac{\left(1-\alpha\right)^{-\xi}-1}{\xi}% \right),CVaR start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) = italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + divide start_ARG italic_ς + italic_ξ italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_ξ end_ARG = divide start_ARG italic_ς end_ARG start_ARG 1 - italic_ξ end_ARG ( 1 + divide start_ARG ( 1 - italic_α ) start_POSTSUPERSCRIPT - italic_ξ end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_ξ end_ARG ) ,

since qα=ςξ((1α)ξ1)subscript𝑞𝛼𝜍𝜉superscript1𝛼𝜉1q_{\alpha}=\frac{\varsigma}{\xi}\left((1-\alpha)^{-\xi}-1\right)italic_q start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = divide start_ARG italic_ς end_ARG start_ARG italic_ξ end_ARG ( ( 1 - italic_α ) start_POSTSUPERSCRIPT - italic_ξ end_POSTSUPERSCRIPT - 1 ). Therefore,

θ=argmin𝜃CVaRα(X)=ϑ,min𝜃CVaRα(X)=b1ξ(1+(1α)ξ1ξ),formulae-sequencesuperscript𝜃𝜃subscriptCVaR𝛼𝑋italic-ϑ𝜃subscriptCVaR𝛼𝑋𝑏1𝜉1superscript1𝛼𝜉1𝜉\theta^{*}=\underset{\theta}{\arg\min}\,\text{CVaR}_{\alpha}(X)=\vartheta,% \quad\underset{\theta}{\min}\,\text{CVaR}_{\alpha}(X)=\frac{b}{1-\xi}\left(1+% \frac{\left(1-\alpha\right)^{-\xi}-1}{\xi}\right),italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = underitalic_θ start_ARG roman_arg roman_min end_ARG CVaR start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) = italic_ϑ , underitalic_θ start_ARG roman_min end_ARG CVaR start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_X ) = divide start_ARG italic_b end_ARG start_ARG 1 - italic_ξ end_ARG ( 1 + divide start_ARG ( 1 - italic_α ) start_POSTSUPERSCRIPT - italic_ξ end_POSTSUPERSCRIPT - 1 end_ARG start_ARG italic_ξ end_ARG ) ,

i.e. the optimal policy is to set θ=ϑ𝜃italic-ϑ\theta=\varthetaitalic_θ = italic_ϑ to minimize the scale parameter of the cumulative discounted costs.

In each simulation run, we consider M=500𝑀500M=500italic_M = 500 iterations, in each of which n=2,000𝑛2000n=2,\!000italic_n = 2 , 000 cumulative discounted cost realizations are generated. A total of R=50𝑅50R=50italic_R = 50 runs are performed. The finite difference step size for the gradient computation is ϵ=0.01italic-ϵ0.01\epsilon=0.01italic_ϵ = 0.01. We set the initial policy parameter to θ(0)=1superscript𝜃01\theta^{(0)}=1italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 1. Define θ(j,r)superscript𝜃𝑗𝑟\theta^{(j,r)}italic_θ start_POSTSUPERSCRIPT ( italic_j , italic_r ) end_POSTSUPERSCRIPT and J^(j,r)superscript^𝐽𝑗𝑟\widehat{J}^{(j,r)}over^ start_ARG italic_J end_ARG start_POSTSUPERSCRIPT ( italic_j , italic_r ) end_POSTSUPERSCRIPT respectively as estimates of the policy parameter and the objective function (CVaR) estimate on the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT iteration of run r𝑟ritalic_r. We report the root-mean-square-error (RMSE) across the various runs for each iteration of policy parameters and the objective function associated as:

RMSEθ=1Rr=1R(θ(j,r)θ)2,RMSEJ^=1Rr=1R(J^(j,r)J(θ))2.formulae-sequencesubscriptRMSE𝜃1𝑅subscriptsuperscript𝑅𝑟1superscriptsuperscript𝜃𝑗𝑟superscript𝜃2subscriptRMSE^𝐽1𝑅subscriptsuperscript𝑅𝑟1superscriptsuperscript^𝐽𝑗𝑟𝐽superscript𝜃2\text{RMSE}_{\theta}=\sqrt{\frac{1}{R}\sum^{R}_{r=1}\left(\theta^{(j,r)}-% \theta^{*}\right)^{2}},\quad\text{RMSE}_{\widehat{J}}=\sqrt{\frac{1}{R}\sum^{R% }_{r=1}\left(\widehat{J}^{(j,r)}-J(\theta^{*})\right)^{2}}.RMSE start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUPERSCRIPT ( italic_j , italic_r ) end_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , RMSE start_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT ( over^ start_ARG italic_J end_ARG start_POSTSUPERSCRIPT ( italic_j , italic_r ) end_POSTSUPERSCRIPT - italic_J ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

The CVaR level α=0.998𝛼0.998\alpha=0.998italic_α = 0.998 is chosen to depict catastrophic risk levels.

Figure 1 reports metrics RMSEθsubscriptRMSE𝜃\text{RMSE}_{\theta}RMSE start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and RMSEJ^subscriptRMSE^𝐽\text{RMSE}_{\widehat{J}}RMSE start_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG end_POSTSUBSCRIPT with respect to iteration j=1,,M𝑗1𝑀j=1,\ldots,Mitalic_j = 1 , … , italic_M for the three different values of tail parameter ξ𝜉\xiitalic_ξ. The POTPG outperforms the SA benchmark in all experiments as the RMSE on the optimal policy parameter decreases faster for the former approach. The extent of out-performance increases when the tail thickness (i.e. parameter ξ𝜉\xiitalic_ξ) increases. This is because sample averaging relies on only four observations, i.e. n(1α)=4𝑛1𝛼4n(1-\alpha)=4italic_n ( 1 - italic_α ) = 4, coming from the tail of distribution, which is increasingly unstable as tail thickness increases. The POT approach better alleviates this issue by using many more observations from the body of the distribution to extrapolate tail behavior. Moreover, even when having converged to the optimal policy, both methods (POTPG and SA) exhibit residual estimation error in the objective function (i.e. cumulative discounted costs) CVaR estimate. Though RMSEJ^𝑅𝑀𝑆subscript𝐸^𝐽RMSE_{\hat{J}}italic_R italic_M italic_S italic_E start_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG end_POSTSUBSCRIPT is generally smaller for the POTPG approach, such methods out-perform SA significantly for the thicker tail case ξ=0.8𝜉0.8\xi=0.8italic_ξ = 0.8. In conclusion, the thicker the tail of the costs distribution is, the more useful the POTPG approach is.

Refer to caption
(a) RMSEθsubscriptRMSE𝜃\text{RMSE}_{\theta}RMSE start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, ξ=0.4𝜉0.4\xi=0.4italic_ξ = 0.4
Refer to caption
(b) RMSEJ^subscriptRMSE^𝐽\text{RMSE}_{\widehat{J}}RMSE start_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG end_POSTSUBSCRIPT, ξ=0.4𝜉0.4\xi=0.4italic_ξ = 0.4
Refer to caption
(c) RMSEθsubscriptRMSE𝜃\text{RMSE}_{\theta}RMSE start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, ξ=0.6𝜉0.6\xi=0.6italic_ξ = 0.6
Refer to caption
(d) RMSEJ^subscriptRMSE^𝐽\text{RMSE}_{\widehat{J}}RMSE start_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG end_POSTSUBSCRIPT, ξ=0.6𝜉0.6\xi=0.6italic_ξ = 0.6
Refer to caption
(e) RMSEθsubscriptRMSE𝜃\text{RMSE}_{\theta}RMSE start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, ξ=0.8𝜉0.8\xi=0.8italic_ξ = 0.8
Refer to caption
(f) RMSEJ^subscriptRMSE^𝐽\text{RMSE}_{\widehat{J}}RMSE start_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG end_POSTSUBSCRIPT, ξ=0.8𝜉0.8\xi=0.8italic_ξ = 0.8
Figure 1: Training performance for the POTPG algorithm and the sample averaging (SA) benchmark. Left column: RMSE of policy parameter estimate RMSEθsubscriptRMSE𝜃\text{RMSE}_{\theta}RMSE start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Right column: RMSE of the objective function (the CVaR) RMSEJ^subscriptRMSE^𝐽\text{RMSE}_{\widehat{J}}RMSE start_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG end_POSTSUBSCRIPT. RMSE metrics are computed over R=50𝑅50R=50italic_R = 50 runs.

5   Application to financial hedging

We present the application of the POTPG algorithm to a financial risk management problem, namely the dynamic Delta-Gamma hedging of an option. The problem of finding the optimal proportion of the Gamma to neutralize when options are very expensive is discussed.

5.1   The hedging framework

Time elapsed between consecutive time points are assumed to be weeks (period of length 1/521521/521 / 52 year). The periodic continuously compounded interest rate is r=0.02/52𝑟0.0252r=0.02/52italic_r = 0.02 / 52. With S0=1,000subscript𝑆01000S_{0}=1,\!000italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 , 000, let Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denote the time-t𝑡titalic_t price of a non-dividend-paying stock, whose dynamics is assumed to be a discrete-time version of an exponential normal-inverse Gaussian (NIG) Lévy process: St=S0em=1tZmsubscript𝑆𝑡subscript𝑆0superscript𝑒superscriptsubscript𝑚1𝑡subscript𝑍𝑚S_{t}=S_{0}e^{\sum_{m=1}^{t}Z_{m}}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_Z start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, with {Zt}t=1Tsubscriptsuperscriptsubscript𝑍𝑡𝑇𝑡1\{Z_{t}\}^{T}_{t=1}{ italic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT being, under the physical measure \mathbb{P}blackboard_P, i.i.d. random variables with a NIG(𝚊,β,δ,μ)superscript𝚊superscript𝛽superscript𝛿superscript𝜇(\mathtt{a}^{\mathbb{P}},\beta^{\mathbb{P}},\delta^{\mathbb{P}},\mu^{\mathbb{P% }})( typewriter_a start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT , italic_δ start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT , italic_μ start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT ) distribution whose PDF is given by

ϕNIG(x;𝚊,β,μ,δ)=𝚊δeδγπK1(𝚊δ2+(xμ)2δ2+(xμ2)eβ(xμ),x,\phi^{NIG}(x;\mathtt{a},\beta,\mu,\delta)=\frac{\mathtt{a}\delta e^{\delta% \gamma}}{\pi}\frac{K_{1}(\mathtt{a}\sqrt{\delta^{2}+(x-\mu)^{2}}}{\sqrt{\delta% ^{2}+(x-\mu^{2})}}e^{{\beta(x-\mu)}},\quad x\in\mathbb{R},italic_ϕ start_POSTSUPERSCRIPT italic_N italic_I italic_G end_POSTSUPERSCRIPT ( italic_x ; typewriter_a , italic_β , italic_μ , italic_δ ) = divide start_ARG typewriter_a italic_δ italic_e start_POSTSUPERSCRIPT italic_δ italic_γ end_POSTSUPERSCRIPT end_ARG start_ARG italic_π end_ARG divide start_ARG italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( typewriter_a square-root start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_x - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG square-root start_ARG italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_x - italic_μ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG end_ARG italic_e start_POSTSUPERSCRIPT italic_β ( italic_x - italic_μ ) end_POSTSUPERSCRIPT , italic_x ∈ blackboard_R , (5.1)

where Kλ(x)subscript𝐾𝜆𝑥K_{\lambda}(x)italic_K start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_x ) represents the modified Bessel function of the second kind with index λ𝜆\lambdaitalic_λ, defined as:

Kλ(x)=120uλ1e12x(u1+u)𝑑u,x>0.formulae-sequencesubscript𝐾𝜆𝑥12superscriptsubscript0superscript𝑢𝜆1superscript𝑒12𝑥superscript𝑢1𝑢differential-d𝑢𝑥0K_{\lambda}(x)=\frac{1}{2}\int_{0}^{\infty}u^{\lambda-1}e^{-\frac{1}{2}x(u^{-1% }+u)}du,\quad x>0.italic_K start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_u start_POSTSUPERSCRIPT italic_λ - 1 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x ( italic_u start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_u ) end_POSTSUPERSCRIPT italic_d italic_u , italic_x > 0 . (5.2)

Such distribution is known to exhibit fat tails and is therefore well-suited to study the extreme risk minimization framework of this study.

Parameters considered are taken from Godin, (2016), namely 𝚊=35.7superscript𝚊35.7\mathtt{a}^{\mathbb{P}}=35.7typewriter_a start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT = 35.7, β=10.8superscript𝛽10.8\beta^{\mathbb{P}}=-10.8italic_β start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT = - 10.8, δ=2.04×102superscript𝛿2.04superscript102\delta^{\mathbb{P}}=2.04\times 10^{-2}italic_δ start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT = 2.04 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and μ=6.7×103superscript𝜇6.7superscript103\mu^{\mathbb{P}}=6.7\times 10^{-3}italic_μ start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT = 6.7 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT.

We consider a market with high volatility risk premium where options are costly; as such we assume risk-neutral parameters and identical to the physical ones, except for the delta parameter driving the returns variance, which is inflated by a factor of 4: 𝚊=𝚊superscript𝚊superscript𝚊\mathtt{a}^{\mathbb{Q}}=\mathtt{a}^{\mathbb{P}}typewriter_a start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT = typewriter_a start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT, β=βsuperscript𝛽superscript𝛽\beta^{\mathbb{Q}}=\beta^{\mathbb{P}}italic_β start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT = italic_β start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT, δ=4δsuperscript𝛿4superscript𝛿\delta^{\mathbb{Q}}=4\delta^{\mathbb{P}}italic_δ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT = 4 italic_δ start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT and μ=μsuperscript𝜇superscript𝜇\mu^{\mathbb{Q}}=\mu^{\mathbb{P}}italic_μ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT = italic_μ start_POSTSUPERSCRIPT blackboard_P end_POSTSUPERSCRIPT. In such a market, fully neutralizing the gamma of the option being hedged is most likely sub-optimal, due to high option cost, and thus determining the best hedge ratio yielding the optimal cost versus risk reduction tradeoff is a non-trivial endeavor which is the problem considered in this section.

We assume than any European call option on such stock is priced according to the formula provided in Godin et al., (2012) which is based on the mean-correcting martingale measure described in Schoutens, (2003). The time-t𝑡titalic_t price of a European call option with strike E𝐸Eitalic_E providing the time tsuperscript𝑡t^{\prime}italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT payoff max(0,StE)0subscript𝑆superscript𝑡𝐸\max(0,S_{t^{\prime}}-E)roman_max ( 0 , italic_S start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT - italic_E ) is

Π(t,τ,E)Π𝑡𝜏𝐸\displaystyle\Pi(t,\tau,E)roman_Π ( italic_t , italic_τ , italic_E ) =St(1ΦNIG(ln(ESt);𝚊,β+1,δτ,[μ+ζ]τ))absentsubscript𝑆𝑡1superscriptΦ𝑁𝐼𝐺𝐸subscript𝑆𝑡superscript𝚊superscript𝛽1superscript𝛿𝜏delimited-[]superscript𝜇superscript𝜁𝜏\displaystyle=S_{t}\left(1-\Phi^{NIG}\left(\ln\left(\frac{E}{S_{t}}\right);\ % \mathtt{a}^{\mathbb{Q}},\ \beta^{\mathbb{Q}}+1,\ \delta^{\mathbb{Q}}\tau,\ [% \mu^{\mathbb{Q}}+\zeta^{\mathbb{Q}}]\tau\right)\right)= italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - roman_Φ start_POSTSUPERSCRIPT italic_N italic_I italic_G end_POSTSUPERSCRIPT ( roman_ln ( divide start_ARG italic_E end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ; typewriter_a start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT + 1 , italic_δ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT italic_τ , [ italic_μ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT + italic_ζ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT ] italic_τ ) )
Eerτ(1ΦNIG(ln(ESt);𝚊,β,δτ,[μ+ζ]τ)),𝐸superscript𝑒𝑟𝜏1superscriptΦ𝑁𝐼𝐺𝐸subscript𝑆𝑡superscript𝚊superscript𝛽superscript𝛿𝜏delimited-[]superscript𝜇superscript𝜁𝜏\displaystyle\quad-Ee^{-r\tau}\left(1-\Phi^{NIG}\left(\ln\left(\frac{E}{S_{t}}% \right);\ \mathtt{a}^{\mathbb{Q}},\ \beta^{\mathbb{Q}},\ \delta^{\mathbb{Q}}% \tau,\ [\mu^{\mathbb{Q}}+\zeta^{\mathbb{Q}}]\tau\right)\right),- italic_E italic_e start_POSTSUPERSCRIPT - italic_r italic_τ end_POSTSUPERSCRIPT ( 1 - roman_Φ start_POSTSUPERSCRIPT italic_N italic_I italic_G end_POSTSUPERSCRIPT ( roman_ln ( divide start_ARG italic_E end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ; typewriter_a start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT , italic_δ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT italic_τ , [ italic_μ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT + italic_ζ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT ] italic_τ ) ) , (5.3)

with τ=tt𝜏superscript𝑡𝑡\tau=t^{\prime}-titalic_τ = italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t weeks, ζ=rμ+δ((𝚊)2(β+1)2(𝚊)2(β)2)superscript𝜁𝑟superscript𝜇superscript𝛿superscriptsuperscript𝚊2superscriptsuperscript𝛽12superscriptsuperscript𝚊2superscriptsuperscript𝛽2\zeta^{\mathbb{Q}}=r-\mu^{\mathbb{Q}}+\delta^{\mathbb{Q}}(\sqrt{(\mathtt{a}^{% \mathbb{Q}})^{2}-(\beta^{\mathbb{Q}}+1)^{2}}-\sqrt{(\mathtt{a}^{\mathbb{Q}})^{% 2}-(\beta^{\mathbb{Q}})^{2}})italic_ζ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT = italic_r - italic_μ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT + italic_δ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT ( square-root start_ARG ( typewriter_a start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( italic_β start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - square-root start_ARG ( typewriter_a start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( italic_β start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) and ΦNIGsuperscriptΦ𝑁𝐼𝐺\Phi^{NIG}roman_Φ start_POSTSUPERSCRIPT italic_N italic_I italic_G end_POSTSUPERSCRIPT denoting the CDF of the NIG distribution. It is straightforward to compute the Delta and the Gamma of such options:

Δ(t,τ,E)Δ𝑡𝜏𝐸\displaystyle\Delta(t,\tau,E)roman_Δ ( italic_t , italic_τ , italic_E ) =\displaystyle== Π(t,τ,E)St=1ΦNIG(ln(ESt);𝚊,β+1,δτ,[μ+ζ]τ).Π𝑡𝜏𝐸subscript𝑆𝑡1superscriptΦ𝑁𝐼𝐺𝐸subscript𝑆𝑡superscript𝚊superscript𝛽1superscript𝛿𝜏delimited-[]superscript𝜇superscript𝜁𝜏\displaystyle\frac{\partial\Pi(t,\tau,E)}{\partial S_{t}}=1-\Phi^{NIG}\left(% \ln\left(\frac{E}{S_{t}}\right);\ \mathtt{a}^{\mathbb{Q}},\ \beta^{\mathbb{Q}}% +1,\ \delta^{\mathbb{Q}}\tau,\ [\mu^{\mathbb{Q}}+\zeta^{\mathbb{Q}}]\tau\right).divide start_ARG ∂ roman_Π ( italic_t , italic_τ , italic_E ) end_ARG start_ARG ∂ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = 1 - roman_Φ start_POSTSUPERSCRIPT italic_N italic_I italic_G end_POSTSUPERSCRIPT ( roman_ln ( divide start_ARG italic_E end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ; typewriter_a start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT + 1 , italic_δ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT italic_τ , [ italic_μ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT + italic_ζ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT ] italic_τ ) .
Γ(t,τ,E)Γ𝑡𝜏𝐸\displaystyle\Gamma(t,\tau,E)roman_Γ ( italic_t , italic_τ , italic_E ) =\displaystyle== 2Π(t,τ,E)(St)2=1StϕNIG(ln(ESt);𝚊,β+1,δτ,[μ+ζ]τ).superscript2Π𝑡𝜏𝐸superscriptsubscript𝑆𝑡21subscript𝑆𝑡superscriptitalic-ϕ𝑁𝐼𝐺𝐸subscript𝑆𝑡superscript𝚊superscript𝛽1superscript𝛿𝜏delimited-[]superscript𝜇superscript𝜁𝜏\displaystyle\frac{\partial^{2}\Pi(t,\tau,E)}{\partial(S_{t})^{2}}=\frac{1}{S_% {t}}\phi^{NIG}\left(\ln\left(\frac{E}{S_{t}}\right);\ \mathtt{a}^{\mathbb{Q}},% \ \beta^{\mathbb{Q}}+1,\ \delta^{\mathbb{Q}}\tau,\ [\mu^{\mathbb{Q}}+\zeta^{% \mathbb{Q}}]\tau\right).divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Π ( italic_t , italic_τ , italic_E ) end_ARG start_ARG ∂ ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϕ start_POSTSUPERSCRIPT italic_N italic_I italic_G end_POSTSUPERSCRIPT ( roman_ln ( divide start_ARG italic_E end_ARG start_ARG italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ; typewriter_a start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT , italic_β start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT + 1 , italic_δ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT italic_τ , [ italic_μ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT + italic_ζ start_POSTSUPERSCRIPT blackboard_Q end_POSTSUPERSCRIPT ] italic_τ ) .

We consider a financial institution (the hedging agent) which holds a short position in a call option with a strike price E=S0𝐸subscript𝑆0E=S_{0}italic_E = italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and maturity T=0.5×52=26𝑇0.55226T=0.5\times 52=26italic_T = 0.5 × 52 = 26 weeks. Such option is referred to as the target option. To mitigate the risk associated with the uncertainty related to its payoff, a self-financing hedging portfolio is used. At any time point, the portfolio is invested in three hedging assets, namely a risk-free account, the stock and an option on the stock. The time-t𝑡titalic_t value of the hedging portfolio is denoted Vtθsubscriptsuperscript𝑉𝜃𝑡V^{\theta}_{t}italic_V start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (the superscript θ𝜃\thetaitalic_θ refers to its dependence on the policy) and evolves according to

Vt+1θ=(Vtθψt(S)Stψt(O)Htbeg)cash investmenter+ψt(S)(St+1St)+ψt(O)(Ht+1endHtbeg),subscriptsuperscript𝑉𝜃𝑡1subscriptsubscriptsuperscript𝑉𝜃𝑡subscriptsuperscript𝜓𝑆𝑡subscript𝑆𝑡subscriptsuperscript𝜓𝑂𝑡subscriptsuperscript𝐻𝑏𝑒𝑔𝑡cash investmentsuperscript𝑒𝑟subscriptsuperscript𝜓𝑆𝑡subscript𝑆𝑡1subscript𝑆𝑡subscriptsuperscript𝜓𝑂𝑡subscriptsuperscript𝐻𝑒𝑛𝑑𝑡1subscriptsuperscript𝐻𝑏𝑒𝑔𝑡V^{\theta}_{t+1}=\underbrace{(V^{\theta}_{t}-\psi^{(S)}_{t}S_{t}-\psi^{(O)}_{t% }H^{beg}_{t})}_{\text{cash investment}}e^{r}+\psi^{(S)}_{t}(S_{t+1}-S_{t})+% \psi^{(O)}_{t}(H^{end}_{t+1}-H^{beg}_{t}),italic_V start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = under⏟ start_ARG ( italic_V start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ψ start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ψ start_POSTSUPERSCRIPT ( italic_O ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_H start_POSTSUPERSCRIPT italic_b italic_e italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT cash investment end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT + italic_ψ start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_ψ start_POSTSUPERSCRIPT ( italic_O ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_H start_POSTSUPERSCRIPT italic_e italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_H start_POSTSUPERSCRIPT italic_b italic_e italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

with (ψt(S),ψt(O))subscriptsuperscript𝜓𝑆𝑡subscriptsuperscript𝜓𝑂𝑡(\psi^{(S)}_{t},\psi^{(O)}_{t})( italic_ψ start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ψ start_POSTSUPERSCRIPT ( italic_O ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) being the respective portfolio positions on time interval [t,t+1)𝑡𝑡1[t,t+1)[ italic_t , italic_t + 1 ) in the stock and an option used for hedging, and Htbegsubscriptsuperscript𝐻𝑏𝑒𝑔𝑡H^{beg}_{t}italic_H start_POSTSUPERSCRIPT italic_b italic_e italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Ht+1endsubscriptsuperscript𝐻𝑒𝑛𝑑𝑡1H^{end}_{t+1}italic_H start_POSTSUPERSCRIPT italic_e italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT being the respective time-t𝑡titalic_t and time-(t+1)𝑡1(t+1)( italic_t + 1 ) price of the hedging option purchased at t𝑡titalic_t. The positions are thus rebalanced at each period, and option positions are rolled-over, with the hedging options currently in the portfolio being liquidated at the end of the period while new ones are being purchased. At the start of any period [t,t+1)𝑡𝑡1[t,t+1)[ italic_t , italic_t + 1 ), the option considered for purchase is at-the-money (its strike is Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and its maturity is τ=0.1×52=5.2𝜏0.1525.2\tau=0.1\times 52=5.2italic_τ = 0.1 × 52 = 5.2, meaning 10101010% of a year). As such, Htbeg=Π(t,0.1×52,St)subscriptsuperscript𝐻𝑏𝑒𝑔𝑡Π𝑡0.152subscript𝑆𝑡H^{beg}_{t}=\Pi(t,0.1\times 52,S_{t})italic_H start_POSTSUPERSCRIPT italic_b italic_e italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Π ( italic_t , 0.1 × 52 , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and Ht+1end=Π(t+1,0.1×521,St)subscriptsuperscript𝐻𝑒𝑛𝑑𝑡1Π𝑡10.1521subscript𝑆𝑡H^{end}_{t+1}=\Pi(t+1,0.1\times 52-1,S_{t})italic_H start_POSTSUPERSCRIPT italic_e italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_Π ( italic_t + 1 , 0.1 × 52 - 1 , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Note that unless St=St+1subscript𝑆𝑡subscript𝑆𝑡1S_{t}=S_{t+1}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, Ht+1endHt+1begsubscriptsuperscript𝐻𝑒𝑛𝑑𝑡1subscriptsuperscript𝐻𝑏𝑒𝑔𝑡1H^{end}_{t+1}\neq H^{beg}_{t+1}italic_H start_POSTSUPERSCRIPT italic_e italic_n italic_d end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≠ italic_H start_POSTSUPERSCRIPT italic_b italic_e italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT since options being included in the hedging portfolio change on the various periods. Moreover, V0θ=Π(0,T,S0)subscriptsuperscript𝑉𝜃0Π0𝑇subscript𝑆0V^{\theta}_{0}=\Pi(0,T,S_{0})italic_V start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Π ( 0 , italic_T , italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the option premium that is initially invested in the hedging portfolio.

The optimal policy should characterize the selection of positions ψt(S),ψt(O)subscriptsuperscript𝜓𝑆𝑡subscriptsuperscript𝜓𝑂𝑡\psi^{(S)}_{t},\psi^{(O)}_{t}italic_ψ start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ψ start_POSTSUPERSCRIPT ( italic_O ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, t=0,,T𝑡0𝑇t=0,\ldots,Titalic_t = 0 , … , italic_T to be included in the hedging portfolio. Assume that the agent wants to be fully Delta-neutral, which is obtained with

ψt(S)=Δ(t,Tt,S0)target option Δψt(O)Δ(t,0.1×52,St)hedging option Δ.subscriptsuperscript𝜓𝑆𝑡subscriptΔ𝑡𝑇𝑡subscript𝑆0target option Δsubscriptsuperscript𝜓𝑂𝑡subscriptΔ𝑡0.152subscript𝑆𝑡hedging option Δ\psi^{(S)}_{t}=\underbrace{\Delta(t,T-t,S_{0})}_{\text{target option }\Delta}-% \psi^{(O)}_{t}\underbrace{\Delta(t,0.1\times 52,S_{t})}_{\text{hedging option % }\Delta}.italic_ψ start_POSTSUPERSCRIPT ( italic_S ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = under⏟ start_ARG roman_Δ ( italic_t , italic_T - italic_t , italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT target option roman_Δ end_POSTSUBSCRIPT - italic_ψ start_POSTSUPERSCRIPT ( italic_O ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT under⏟ start_ARG roman_Δ ( italic_t , 0.1 × 52 , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT hedging option roman_Δ end_POSTSUBSCRIPT .

However, we assume that the agent might prefer not fully neutralizing the Gamma of the target option due to purchases of hedging options being too costly in a market with large volatility risk premium. The agent shall therefore only neutralize a portion θ(0,1)𝜃01\theta\in(0,1)italic_θ ∈ ( 0 , 1 ), called the hedge ratio, of the target option Gamma. This leads to ψt(O)Γ(t,0.1×52,St)=θΓ(t,Tt,S0)subscriptsuperscript𝜓𝑂𝑡Γ𝑡0.152subscript𝑆𝑡𝜃Γ𝑡𝑇𝑡subscript𝑆0\psi^{(O)}_{t}\Gamma(t,0.1\times 52,S_{t})=\theta\Gamma(t,T-t,S_{0})italic_ψ start_POSTSUPERSCRIPT ( italic_O ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT roman_Γ ( italic_t , 0.1 × 52 , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_θ roman_Γ ( italic_t , italic_T - italic_t , italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and thus to ψt(O)=θΓ(t,Tt,S0)Γ(t,0.1×52,St)subscriptsuperscript𝜓𝑂𝑡𝜃Γ𝑡𝑇𝑡subscript𝑆0Γ𝑡0.152subscript𝑆𝑡\psi^{(O)}_{t}=\theta\frac{\Gamma(t,T-t,S_{0})}{\Gamma(t,0.1\times 52,S_{t})}italic_ψ start_POSTSUPERSCRIPT ( italic_O ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_θ divide start_ARG roman_Γ ( italic_t , italic_T - italic_t , italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG roman_Γ ( italic_t , 0.1 × 52 , italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG.

The objective of the hedging agent is therefore to find the optimal hedge ratio, which is the optimal policy parameter θ𝜃\thetaitalic_θ. A single terminal cost is considered for the agent: Ct=𝟙{t=T}(max(0,STE)VTθ)subscript𝐶𝑡subscript1𝑡𝑇0subscript𝑆𝑇𝐸subscriptsuperscript𝑉𝜃𝑇C_{t}=\mathds{1}_{\{t=T\}}\left(\max(0,S_{T}-E)-V^{\theta}_{T}\right)italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = blackboard_1 start_POSTSUBSCRIPT { italic_t = italic_T } end_POSTSUBSCRIPT ( roman_max ( 0 , italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_E ) - italic_V start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) and no discount factor is considered γ=1𝛾1\gamma=1italic_γ = 1. The agent thus attempts minimizing risk associated with catastrophic hedging shortfalls at maturity: hence consider α=0.999𝛼0.999\alpha=0.999italic_α = 0.999.

Before applying the reinforcement learning procedure, we want to approximate the objective function J(θ)𝐽𝜃J(\theta)italic_J ( italic_θ ), the CVaR0.999 of the hedging shortfall, for various hedge ratios θ𝜃\thetaitalic_θ. Such approximations are produced with brute force Monte-Carlo simulations, where for several values of θ𝜃\thetaitalic_θ, 1,000,00010000001,\!000,\!0001 , 000 , 000 realizations of the hedging shortfall max(0,STE)VTθ0subscript𝑆𝑇𝐸subscriptsuperscript𝑉𝜃𝑇\max(0,S_{T}-E)-V^{\theta}_{T}roman_max ( 0 , italic_S start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - italic_E ) - italic_V start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are produced and sample averaging is applied, i.e. J𝐽Jitalic_J is estimated by the 1,00010001,\!0001 , 000 largest realizations. Figure 2 reports such estimates, with the optimal hedge ratio being estimated to be θ=0.5991superscript𝜃0.5991\theta^{*}=0.5991italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.5991 and the corresponding objective function being J^(θ)=40.37^𝐽superscript𝜃40.37\hat{J}(\theta^{*})=40.37over^ start_ARG italic_J end_ARG ( italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = 40.37.

Refer to caption
Figure 2: Objective function (CVaR0.999 of the hedging shortfall) versus the hedge ratio θ𝜃\thetaitalic_θ, representing the percentage of the target option Gamma being neutralized. Estimates are obtained by brute force calculations, i.e. through sample averaing over 1,000,00010000001,\!000,\!0001 , 000 , 000 simulated paths. Red point: optimal value.

Now apply the POTPG algorithm to the policy optimization problem, and compare its performance to that of the sample averaging method (SA). Such methods are applied with either n=1,000𝑛1000n=1,\!000italic_n = 1 , 000 or n=10,000𝑛10000n=10,\!000italic_n = 10 , 000 simulated paths of weekly stock returns. R=100𝑅100R=100italic_R = 100 independent runs are conducted, each comprised of M=500𝑀500M=500italic_M = 500 iterations for the case n=1,000𝑛1000n=1,\!000italic_n = 1 , 000, or M=150𝑀150M=150italic_M = 150 iterations when n=10,000𝑛10000n=10,\!000italic_n = 10 , 000. In each run, the initial policy is set to θ(0)=0superscript𝜃00\theta^{(0)}=0italic_θ start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 0. The finite difference shock is ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05. The method of moments is used to estimate tail parameters ξ,σ𝜉𝜎\xi,\sigmaitalic_ξ , italic_σ in the POTPG algorithms since such method exhibited (in unreported tests) greater stability than maximum likelihood estimates in the presented framework.

Figure 3 reports the performance of the POTPG and SA policy gradient algorithms for the hedging problem, by displaying the evolution of the RMSE (across runs) of the estimate of the optimal policy parameter (RMSEθ) and the corresponding objective function (RMSEJ^^𝐽{}_{\hat{J}}start_FLOATSUBSCRIPT over^ start_ARG italic_J end_ARG end_FLOATSUBSCRIPT) versus the number of iteration conducted. The POTPG algorithm exhibits materially superior performance by exhibiting much lower errors on estimates for the optimal policy parameter and objective function. The gap in performance between the POTPG and the benchmark is greater for the lower sample size n=1,000𝑛1000n=1,\!000italic_n = 1 , 000, which highlights that our method has more added value in the context of more severe distribution tail data scarcity. Note that none of the two methods have the estimated policy parameter converge to the true optimal value (i.e. RMSEθ does not converge to zero), which can be explained by the fact that both methods are biased in finite sample n𝑛nitalic_n. Nevertheless, we see that higher sample size n𝑛nitalic_n increases the precision, with lower RMSEs for the estimates of the policy parameter θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and of the objective function J𝐽Jitalic_J.

Refer to caption
(a) RMSEθ, n=1,000𝑛1000n=1,\!000italic_n = 1 , 000
Refer to caption
(b) RMSEJ^subscriptRMSE^𝐽\text{RMSE}_{\hat{J}}RMSE start_POSTSUBSCRIPT over^ start_ARG italic_J end_ARG end_POSTSUBSCRIPT, n=1,000𝑛1000n=1,\!000italic_n = 1 , 000
Refer to caption
(c) RMSEθ, n=10,000𝑛10000n=10,\!000italic_n = 10 , 000
Refer to caption
(d) RMSEJ^^𝐽{}_{\hat{J}}start_FLOATSUBSCRIPT over^ start_ARG italic_J end_ARG end_FLOATSUBSCRIPT, n=10,000𝑛10000n=10,\!000italic_n = 10 , 000
Figure 3: Evolution of the RMSE of the estimate of the optimal policy parameter (RMSEθ) and the corresponding objective function (RMSEJ^^𝐽{}_{\hat{J}}start_FLOATSUBSCRIPT over^ start_ARG italic_J end_ARG end_FLOATSUBSCRIPT) over iterations of the POTPG algorithm and the sample averaging (SA) benchmark. Top row: sample size n=1,000𝑛1000n=1,\!000italic_n = 1 , 000. Bottow row: n=10,000𝑛10000n=10,\!000italic_n = 10 , 000. Left panels: RMSEθ. Right panels: RMSEJ^^𝐽{}_{\hat{J}}start_FLOATSUBSCRIPT over^ start_ARG italic_J end_ARG end_FLOATSUBSCRIPT.

6   Conclusion

We propose a policy gradient algorithm based on estimators of tail risk borrowed from extreme value theory to tackle the difficult task of catastrophic risk minimization within a sequential decision making framework. The peaks-over-threshold procedure is used to estimate the CVaR of cumulative costs by leveraging the asymptotic convergence of the tail distribution to a generalized Pareto distribution. We have shown in several simulation experiments, including an application to financial options hedging, that our method can outperform conventional benchmarks relying on the empirical distribution of the cumulative costs. Indeed, such benchmarks can perform quite poorly to mitigate extreme risk when observations in the tail are scarce.

Our method relies on finite difference approximations for the gradient, and as such it work for low-dimensional policies relying on a small number of parameters. An extension of our approach could consist in develo** a high-dimensional EVT-based policy gradient framework to tackle more complex problems. This would for instance allow using policies represented by deep neural networks and combine the EVT-based policy gradient with deep reinforcement learning.

Appendix A     The automated threshold selection procedure

The automated threshold selection procedure involves testing several thresholds ui=F^1(qi)subscript𝑢𝑖superscript^𝐹1subscript𝑞𝑖u_{i}=\hat{F}^{-1}(q_{i})italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), i=1,,l𝑖1𝑙i=1,\ldots,litalic_i = 1 , … , italic_l, which we choose to be quantiles of pre-determined levels q1,,qsubscript𝑞1subscript𝑞q_{1},\ldots,q_{\ell}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT of the empirical distribution of the sample X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\ldots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. For each i𝑖iitalic_i, denote by kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the number of threshold excesses. Assuming that a threshold uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT leads to GPD parameter estimates (ξ^ui,σ^ui)subscript^𝜉subscript𝑢𝑖subscript^𝜎subscript𝑢𝑖(\hat{\xi}_{u_{i}},\hat{\sigma}_{u_{i}})( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) for the distribution of the excesses 𝒴ui={Xiui:Xi>ui}subscript𝒴subscript𝑢𝑖conditional-setsubscript𝑋𝑖subscript𝑢𝑖subscript𝑋𝑖subscript𝑢𝑖\mathcal{Y}_{u_{i}}=\{X_{i}-u_{i}:X_{i}>u_{i}\}caligraphic_Y start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = { italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT }, the Anderson-Darling test statistic for such threshold is

Ai2=ki1kij=1ki(2j1)[log(Z(j,i))+log(1Z(ki+1j,i))].superscriptsubscript𝐴𝑖2subscript𝑘𝑖1subscript𝑘𝑖superscriptsubscript𝑗1subscript𝑘𝑖2𝑗1delimited-[]subscript𝑍𝑗𝑖1subscript𝑍subscript𝑘𝑖1𝑗𝑖A_{i}^{2}=-k_{i}-\frac{1}{k_{i}}\sum_{j=1}^{k_{i}}(2j-1)\left[\log\left(Z_{(j,% i)}\right)+\log\left(1-Z_{(k_{i}+1-j,i)}\right)\right].italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = - italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( 2 italic_j - 1 ) [ roman_log ( italic_Z start_POSTSUBSCRIPT ( italic_j , italic_i ) end_POSTSUBSCRIPT ) + roman_log ( 1 - italic_Z start_POSTSUBSCRIPT ( italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 - italic_j , italic_i ) end_POSTSUBSCRIPT ) ] . (6.1)

where Z(j,i)=Gξ^ui,σ^ui(Y(j,i))subscript𝑍𝑗𝑖subscript𝐺subscript^𝜉subscript𝑢𝑖subscript^𝜎subscript𝑢𝑖subscript𝑌𝑗𝑖Z_{(j,i)}=G_{\hat{\xi}_{u_{i}},\hat{\sigma}_{u_{i}}}(Y_{(j,i)})italic_Z start_POSTSUBSCRIPT ( italic_j , italic_i ) end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Y start_POSTSUBSCRIPT ( italic_j , italic_i ) end_POSTSUBSCRIPT ) with Y(j,i)subscript𝑌𝑗𝑖Y_{(j,i)}italic_Y start_POSTSUBSCRIPT ( italic_j , italic_i ) end_POSTSUBSCRIPT being the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT smallest excess value, i.e. among values in 𝒴uisubscript𝒴subscript𝑢𝑖\mathcal{Y}_{u_{i}}caligraphic_Y start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. The automated selection procedure attempts using the smallest possible threshold for which no threshold above would be deemed inadequate. In the application, we choose l=20𝑙20l=20italic_l = 20, q1=0.79subscript𝑞10.79q_{1}=0.79italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.79, q2=0.80,,q20=0.98formulae-sequencesubscript𝑞20.80subscript𝑞200.98q_{2}=0.80,\ldots,q_{20}=0.98italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.80 , … , italic_q start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT = 0.98 and ξmax=0.9subscript𝜉𝑚𝑎𝑥0.9\xi_{max}=0.9italic_ξ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 0.9.

Remark 6.1.

If the automated threshold selection procedure is unsuccessful, i.e. I=𝐼I=\emptysetitalic_I = ∅, the sample averaging estimate in (2.5) is used for CVaR, as a fallback estimate.

Algorithm 2 Automated threshold selection procedure (from Troop et al., (2021))
Significance parameter γ𝛾\gammaitalic_γ, cutoff ξmax<1subscript𝜉max1\xi_{\text{max}}<1italic_ξ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT < 1, i.i.d. sample X1,,Xnsubscript𝑋1subscript𝑋𝑛X_{1},\ldots,X_{n}italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, threshold percentiles 0<q1,,ql<1formulae-sequence0subscript𝑞1subscript𝑞𝑙10<q_{1},\ldots,q_{l}<10 < italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT < 1.
I𝐼I\leftarrow\emptysetitalic_I ← ∅
for i=1,,l𝑖1𝑙i=1,\ldots,litalic_i = 1 , … , italic_l do
     Set ui=F^1(qi)subscript𝑢𝑖superscript^𝐹1subscript𝑞𝑖u_{i}=\hat{F}^{-1}(q_{i})italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
     Compute (ξ^ui,σ^ui)subscript^𝜉subscript𝑢𝑖subscript^𝜎subscript𝑢𝑖(\hat{\xi}_{u_{i}},\hat{\sigma}_{u_{i}})( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) from the kisubscript𝑘𝑖k_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT threshold excesses over uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
     if ξ^uiξmaxsubscript^𝜉subscript𝑢𝑖subscript𝜉max\hat{\xi}_{u_{i}}\leq\xi_{\text{max}}over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_ξ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT then
         Compute Ai2superscriptsubscript𝐴𝑖2A_{i}^{2}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT using (6.1)
         Set pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to p𝑝pitalic_p-value for Ai2superscriptsubscript𝐴𝑖2A_{i}^{2}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT using a lookup table
         II{i}𝐼𝐼𝑖I\leftarrow I\cup\{i\}italic_I ← italic_I ∪ { italic_i }      
if I𝐼I\neq\emptysetitalic_I ≠ ∅ then
     Set W={wI|1wi=1wlog(1pi)γ}𝑊conditional-set𝑤𝐼1𝑤superscriptsubscript𝑖1𝑤1subscript𝑝𝑖𝛾W=\{w\in I\,|\,-\frac{1}{w}\sum_{i=1}^{w}\log(1-p_{i})\leq\gamma\}italic_W = { italic_w ∈ italic_I | - divide start_ARG 1 end_ARG start_ARG italic_w end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_γ }
     if W𝑊W\neq\emptysetitalic_W ≠ ∅ then
         Compute w^F=maxWsubscript^𝑤𝐹𝑊\hat{w}_{F}=\max Wover^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = roman_max italic_W
         if w^F=max(I)subscript^𝑤𝐹𝐼\hat{w}_{F}=\max(I)over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = roman_max ( italic_I ) then
              vmax(I)𝑣𝐼v\leftarrow\max(I)italic_v ← roman_max ( italic_I )
         else
              vmin{wI|w>w^F}𝑣𝑤𝐼ket𝑤subscript^𝑤𝐹v\leftarrow\min\{w\in I\,|\,w>\hat{w}_{F}\}italic_v ← roman_min { italic_w ∈ italic_I | italic_w > over^ start_ARG italic_w end_ARG start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT }          
     else
         vmin(I)𝑣𝐼v\leftarrow\min(I)italic_v ← roman_min ( italic_I )      uuv𝑢subscript𝑢𝑣u\leftarrow u_{v}italic_u ← italic_u start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT
     Return uvsubscript𝑢𝑣u_{v}italic_u start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, (ξ^uv,σ^uv)subscript^𝜉subscript𝑢𝑣subscript^𝜎subscript𝑢𝑣(\hat{\xi}_{u_{v}},\hat{\sigma}_{u_{v}})( over^ start_ARG italic_ξ end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT )

References

  • Bader et al., (2018) Bader, B., Yan, J., and Zhang, X. (2018). Automated threshold selection for extreme value analysis via ordered goodness-of-fit tests with adjustment for false discovery rate. The Annals of Applied Statistics, 12(1):310–329.
  • Bairakdar et al., (2024) Bairakdar, R., Godin, F., Mailhot, M., and Yang, F. (2024). Estimation of generalized tail distortion risk measures with applications in reinsurance. available on SSRN.
  • Balkema and De Haan, (1974) Balkema, A. A. and De Haan, L. (1974). Residual life time at great age. The Annals of Probability, 2(5):792–804.
  • Borkar, (2001) Borkar, V. S. (2001). A sensitivity formula for risk-sensitive cost and the actor–critic algorithm. Systems & Control Letters, 44(5):339–346.
  • Buehler et al., (2019) Buehler, H., Gonon, L., Teichmann, J., and Wood, B. (2019). Deep hedging. Quantitative Finance, 19(8):1271–1291.
  • Cao et al., (2023) Cao, J., Chen, J., Farghadani, S., Hull, J., Poulos, Z., Wang, Z., and Yuan, J. (2023). Gamma and vega hedging using deep distributional reinforcement learning. Frontiers in Artificial Intelligence, 6:1129370.
  • Carbonneau and Godin, (2021) Carbonneau, A. and Godin, F. (2021). Equal risk pricing of derivatives with deep hedging. Quantitative Finance, 21(4):593–608.
  • Chow et al., (2018) Chow, Y., Ghavamzadeh, M., Janson, L., and Pavone, M. (2018). Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(167):1–51.
  • Coache et al., (2023) Coache, A., Jaimungal, S., and Cartea, Á. (2023). Conditionally elicitable dynamic risk measures for deep reinforcement learning. SIAM Journal on Financial Mathematics, 14(4):1249–1289.
  • Coles et al., (2001) Coles, S., Bawa, J., Trenner, L., and Dorazio, P. (2001). An introduction to statistical modeling of extreme values, volume 208. Springer.
  • Godin, (2016) Godin, F. (2016). Minimizing CVaR in global dynamic hedging with transaction costs. Quantitative Finance, 16(3):461–475.
  • Godin et al., (2012) Godin, F., Mayoral, S., and Morales, M. (2012). Contingent claim pricing using a normal inverse Gaussian probability distortion operator. Journal of Risk and Insurance, 79(3):841–866.
  • Greenberg et al., (2022) Greenberg, I., Chow, Y., Ghavamzadeh, M., and Mannor, S. (2022). Efficient risk-averse reinforcement learning. Advances in Neural Information Processing Systems, 35:32639–32652.
  • Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • La and Ghavamzadeh, (2013) La, P. and Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. Advances in neural information processing systems, 26.
  • McNeil et al., (2015) McNeil, A. J., Frey, R., and Embrechts, P. (2015). Quantitative risk management: concepts, techniques and tools-revised edition. Princeton university press.
  • Pickands III, (1975) Pickands III, J. (1975). Statistical inference using extreme order statistics. The Annals of Statistics, pages 119–131.
  • Prashanth et al., (2022) Prashanth, L., Fu, M. C., et al. (2022). Risk-sensitive reinforcement learning via policy gradient search. Foundations and Trends® in Machine Learning, 15(5):537–693.
  • Rockafellar and Uryasev, (2002) Rockafellar, R. T. and Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 26(7):1443–1471.
  • Saeed Marzban and Li, (2023) Saeed Marzban, E. D. and Li, J. Y.-M. (2023). Deep reinforcement learning for option pricing and hedging under dynamic expectile risk measures. Quantitative Finance, 23(10):1411–1430.
  • Schoutens, (2003) Schoutens, W. (2003). Lévy processes in finance: pricing financial derivatives. Wiley Online Library.
  • Sutton and Barto, (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
  • Tamar et al., (2012) Tamar, A., Di Castro, D., and Mannor, S. (2012). Policy gradients with variance related risk criteria. In Proceedings of the twenty-ninth international conference on machine learning, pages 387–396.
  • Troop et al., (2021) Troop, D., Godin, F., and Yu, J. Y. (2021). Bias-corrected peaks-over-threshold estimation of the cvar. In Uncertainty in Artificial Intelligence, pages 1809–1818. PMLR.
  • Troop et al., (2022) Troop, D., Godin, F., and Yu, J. Y. (2022). Best-arm identification using extreme value theory estimates of the CVaR. Journal of Risk and Financial Management, 15(4):172.
  • Vijayan and Prashanth, (2023) Vijayan, N. and Prashanth, L. (2023). A policy gradient approach for optimization of smooth risk measures. In Uncertainty in Artificial Intelligence, pages 2168–2178. PMLR.
  • Wu and Lin, (1999) Wu, C. and Lin, Y. (1999). Minimizing risk models in Markov decision processes with policies depending on target values. Journal of Mathematical Analysis and Applications, 231(1):47–67.
  • Wu and Jaimungal, (2023) Wu, D. and Jaimungal, S. (2023). Robust risk-aware option hedging. Applied Mathematical Finance, 30(3):153–174.