Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients††thanks: We thank the Natural Sciences and Engineering Research Council of Canada (Davar: FIN-ML NSERC CREATE program, Godin: RGPIN-2017-06837 and RGPIN-2024-04593, Garrido: RGPIN-2017-06643) for their financial support.
Abstract
This paper tackles the problem of mitigating catastrophic risk (which is risk with very low frequency but very high severity) in the context of a sequential decision making process. This problem is particularly challenging due to the scarcity of observations in the far tail of the distribution of cumulative costs (negative rewards). A policy gradient algorithm is developed, that we call POTPG. It is based on approximations of the tail risk derived from extreme value theory. Numerical experiments highlight the out-performance of our method over common benchmarks, relying on the empirical distribution. An application to financial risk management, more precisely to the dynamic hedging of a financial option, is presented.
Keywords: Risk-aware reinforcement learning, catastrophic risk, extreme value theory, peaks-over-threshold (POT), hedging.
1 Introduction
Reinforcement learning (RL) consists in a set of methods allowing to optimize sequential decision processes through interactions with an environment. In traditional RL (see Sutton and Barto,, 2018), the primary objective is to maximize expected rewards. However, a subset of RL techniques, referred to as risk-aware reinforcement learning, aim to take risk also into account (i.e. departure from the expected case), see for example Wu and Lin, (1999), Borkar, (2001), Tamar et al., (2012), La and Ghavamzadeh, (2013), Chow et al., (2018), Greenberg et al., (2022) and Vijayan and Prashanth, (2023). Integrating risk mitigation within the RL framework is of paramount importance in several areas, as policies producing high expected rewards together with a high risk might be unacceptable in certain circumstances. Financial risk management is an example key area where risk-aware RL methods are developed, see for instance Buehler et al., (2019), Carbonneau and Godin, (2021), Cao et al., (2023) and Wu and Jaimungal, (2023) for a few examples.
The present work is concerned with problems involving the minimization of catastrophic risk in a sequential decision process, which represents outcomes that are very rare but of extreme magnitude. Since such extreme events can cause very undesirable outcomes depending on the area of application, such as financial ruin, health-impeding consequences or accidents, mitigating their impact is very important. Another example of application is the measurement of capital requirements in finance or insurance, which are based on the average of outcomes in the very worst-case scenarios; minimizing capital requirements for a financial institution is a key determinant of its probability, as capital is costly to hold.
Here, extreme risk is quantified through risk measures, reflecting the far tail of the distribution of total costs incurred by the agent. In particular, we consider the special case of the conditional Value-at-Risk, CVaRα, which represents the average outcome among the worst possible set of scenarios with probability . The main motivation of the work is that CVaR with a very high confidence level is very poorly approximated with the empirical distribution, due to the scarcity of observations in the far tail. Such paucity can be caused either by the lack of extreme observations, or the inability to generate a sufficient number of scenarios that include enough extreme data points in a reasonable time frame.111Importance sampling (IS) methods can sometimes help with this issue, if a scenario generator is used. However, suitably improving performance with IS requires knowing the direction in which to tilt risk driver (i.e. states) distributions to produce outcomes. Such information is not necessarily known in the context of highly complex and non-linear dynamics (for instance the optimization of large financial portfolio with non-linear instruments such as exotic options) and methods alternative to IS would be required in such cases. In most acute cases, extreme outcomes might even be outside the data range, by not having materialized yet. The scarcity of tail observations can be exacerbated if the cost outcomes from the problem have fat-tailed distributions.
Our main contribution is to develop a policy gradient method for risk-aware RL problems that is tailor-made for cases where catastrophic-level risk must be minimized. We refer to our algorithm as POTPG, as it relies on the peaks-over-threshold (POT) approach of extreme value theory (EVT) that allows extrapolating the far tail behavior of a distribution through asymptotic approximations leveraging the distribution from large (but not extreme) outcomes. To the best of our knowledge, we are the first to incorporate EVT results within reinforcement learning algorithms to tackle general sequential decision problems; our work can be seen as an extension of Troop et al., (2022) that explores catastrophic risk minimization within the multi-armed bandits framework, but did not tackle the more general Markov decision problem setting.
The paper is divided as follows. Section 2 describes the risk-aware sequential decision making problem considered here, and provides a conventional policy gradient algorithm to tackle the problem. Section 3 proposes our POT policy gradient (POTPG) approach, a modified policy gradient algorithm based on extreme value theory estimates of the tail risk of a distribution. This algorithm is tailor-made to tackle catastrophic-level risk minimization. Section 4 benchmarks the performance of POTPG against the conventional approach in a controlled environment, whereas Section 5 assesses its performance in a financial risk management application, namely option hedging optimization. The paper concludes with some final remarks in Section 6. The Python code to replicate the various numerical experiments of this paper is available at https://github.com/parisadavar/EVT-policy-gradient-RL.
2 A risk-aware reinforcement learning problem and policy gradients
We herein consider the framework of Markov decision processes222This work generalizes to non-Markovian state transition dynamics in a straightforward way. to represent sequential decision problems. Such problems are represented by a set of time steps , a state space , an action space , a cost space and a sequence of transition probabilities characterizing the joint distribution of the next-step reward and state, given the current state and action, namely for , , and .
Without loss of generality, deterministic policies are considered in this work. Such framework gives rises to random state-action-cost sequences of the form , , , where at any time point the agent takes action when encountering state , at a cost of , then the next-stage state is drawn randomly from the probability measure .
2.1 A risk-aware reinforcement learning problem
The risk-aware reinforcement learning problem considered here333Other formulations of risk-aware RL problems exist, such as maximizing the expected rewards under some risk constraints (see for instance Prashanth et al.,, 2022), or using dynamic risk-measures leading to time-consistent dynamic programs (see Saeed Marzban and Li,, 2023; Coache et al.,, 2023). is to find the optimal policy minimizing the risk associated with the cumulative discounted cost: denoting costs as to highlight their dependence on policy , the problem considered can be written as
(2.1) |
for some discount factor and a risk measure map** random variables into perceived risk. In the classic non-risk-aware case (see for instance Sutton and Barto,, 2018), the risk measure is the expectation operator, namely . However, more general risk measures can be used to depict preferences of risk-aware agents. Since this work is concerned with catastrophic risk mitigation, we consider the specific case of the CVaR risk measure (Rockafellar and Uryasev,, 2002) depicting tail risk and defined as
where is the cumulative distribution function (CDF) of and is its quantile at level . In what follows we write . When is an absolutely continuous random variable, as in this work, then CVaR has the alternative representation , which can be interpreted as the average outcome among the set of the worst-case scenarios. This work considers catastrophic risk minimization, and as such, we consider very high levels for , i.e. very close to one.
2.2 A policy gradient solution approach
A natural approach to solve (2.1) is policy gradient methods. Policies are first restricted to a set of parametric policies with parameter vector . In that case, Problem (2.1) reduces to solving
(2.2) |
A common solution approach to the above problem is to use batch stochastic gradient descent, which leads to a sequence of parameter vectors obtained through
with representing the learning schedule and representing a suitable (stochastic) approximation of the gradient. Here we use the celebrated ADAM algorithm of Kingma and Ba, (2014), with a step size parameter of to determine learning rate sequences .
To approximate the gradient, a (forward) finite difference approach is used here: for some small ,
(2.3) | |||||
(2.4) |
with being the dimension of the parameter vector and being the dummy vector containing zeroes, except for its element that is equal to one. The objective function is approximated by sampling independent copies of the cumulative costs obtained from a Monte-Carlo simulation, if a simulator of the environment is available, or alternatively through the application of the policy to real data, either in an online or offline fashion.
The most natural approach to obtain the estimate of the objective function consists in assuming that the empirical distribution of cumulative costs obtained through the mini-batch is close to the true distribution. Such method is referred to as the sample averaging method and relies on
(2.5) |
where, denoting by the order statistics of the sample (i.e. the sample sorted in increasing order), is the empirical quantile given by:
(2.6) |
with being the empirical CDF of given by .
Unfortunately, when is high and very close to one, the scarcity of observations can make the sample averaging approach very unstable in estimating the objective function, a problem which is exacerbated if the distribution of is heavy-tailed. This justifies the development of the EVT-based estimator described in the next section.
3 Integrating extreme value theory estimates into policy gradients
This section first discusses the construction of CVaR estimates based on the peaks–over–threshold (POT) approach rooted in extreme value theory (EVT).444Alternative methods also based on EVT such as that of Bairakdar et al., (2024) could also have been contemplated. The POT approach is discussed more in-depth in Coles et al., (2001) or McNeil et al., (2015). The procedure integrating such estimates into policy gradient approaches is subsequently detailed.
3.1 Estimation of CVaR with the peaks-over-threshold approach
A wide set of distributions satisfy the following condition.
Definition 3.1.
A CDF is said to be in the maximum domain of attraction of the generalized extreme value distribution (GEVD) with parameter ,555The CDF of the GEVD is given by if , or if , with support . denoted , if there exist a sequence a positive numbers and a sequence of real numbers , such that
(3.1) |
Note that is the CDF of the maximum of i.i.d. copies of a random variable with CDF .
The property characterizes the asymptotic behavior of distribution . Indeed, define , the distribution of excesses above threshold , as
(3.2) |
Define also the generalized Pareto distribution (GPD) as follows.
Definition 3.2.
The GPD with scale parameter and shape parameter has a CDF
(3.3) |
where the support is , for , and , for , and a probability density function (PDF)
(3.4) |
Then the following result from Balkema and De Haan, (1974) or Pickands III, (1975) states that when , the excess distribution is well-approximated asymptotically by a GPD distribution when is near to the essential supremum of distribution .
Theorem 3.1 (Pickands–Balkema–de Haan).
If , there exists a positive measurable function such that
(3.5) |
where and .
As described in Section 7.2 of McNeil et al., (2015), such a result allows defining the following approximation for the CVaR of the variable with CDF , which is based on the assumption that for , i.e. if is large enough.666Note that the condition is required for the CVaR to exist, otherwise the GPD distribution has an infinite expectation.
Corollary 3.1.
Assume that for some , and that . Let satisfy conditions of Theorem 3.1. Then for ,
(3.6) |
This points toward the following procedure, called the peaks-over-threshold approach to estimate based on a sample of i.i.d. copies of :
-
1.
Select a proper threshold .
-
2.
Calculate sample values of excesses over threshold , denoted and defined as , where is the number of sample observations above .
-
3.
Fit a GPD distribution to the sample to get estimates .
-
4.
Replace and with respective estimates and into (3.6) to get an approximation for .
Since for any fixed , excesses are independent, Step 3 can be performed through maximum likelihood777De-biasing procedures could additionally be applied to adjust maximum likelihood estimates, such as in Troop et al., (2021). by solving numerically
(3.7) |
Alternatively, a method-of-moments (MOM) estimator matching the first two moments888Here the MOM estimator requires that to ensure that the variance of the GPD be finite. of the GPD distribution with those of the empirical distribution of excesses would lead to999This is because if , then if and if . Estimators in (3.8) are obtained by equating and with and , respectively.
(3.8) |
with and .
The task in Step , namely the selection of a suitable choice of threshold is challenging, as it entails seeking a proper bias-variance trade-off. Indeed, if is too low, the distribution tail behavior might not be well-approximated by its asymptotic GPD distribution, leading to high bias. Conversely, choosing a that is too large will imply a low number of excesses, which will lead to high variance for the GDP parameter estimators. A common approach in the literature is to manually select through visual inspection of the so-called Hill plot (see McNeil et al.,, 2015). However, such a method is not appropriate in our setup since the choice of threshold needs to be repeated a very large number of times through the learning phase. As such, we rely on the Bader et al., (2018) algorithm that performs automated selection of the threshold based on a sequence of Anderson-Darling goodness-of-fit tests. Such a procedure tests for a set of candidate values , and the smallest among these is selected as the threshold, which leads to a proper fit of the GPD to excesses over . The implementation from Troop et al., (2021) of the procedure is considered here and is detailed in Appendix Appendix A The automated threshold selection procedure. This modifications allows stabilizing estimates, for instance by not allowing estimated values of too close to one (to avoid the CVaR estimate exploding) and by using the sample averaging estimator as fallback, when none of the thresholds lead to a satisfactory fit of the GPD.
3.2 Our proposed EVT policy gradient algorithm
The POT-based CVaR estimation method from Section 3.1 is now integrated into the policy gradient estimation formula in (2.3) to obtain a complete policy gradient learning procedure for the policy parameters . This procedure, which we call the POTPG algorithm (standing for for peaks-over-threshold policy gradient), is summarized in the Algorithm 1 box below.
If a simulator of the environment is available, it can be desirable, within a given iteration , to use the same random seed to perform the simulation of episodes under policy and these under policies , . This approach offers the advantage of isolating the impact of the policy alteration (from to ) from the randomness associated with the generation of episodes; the latter can add noise to the gradient estimate. The same seed is used throughout all the experiments presented here to simulate episodes under the original and shocked policies.
Note also that we propose to use the same threshold to estimate and all in the POTPG algorithm to enhance the stability in the gradient estimation.
4 Simulation experiments in a controlled environment
Several simulation experiments in a controlled environment are first conducted to assess the performance of the POTPG algorithm from Section 3.2 and compare it to the conventional sample averaging (SA) benchmark based on (2.5). A simple simulation setting is considered to establish a proof of concept and highlight the potential usefulness of the POTPG algorithm. In such setting, we consider a single-dimension policy vector (i.e. ), and we assume the cumulative discounted cost is distributed according to a given family of distributions whose parameters depend on .
More precisely, assume that under policy , GPD for some , and .101010Here, to avoid confusion, we use instead of to represent the scale parameter of the whole distribution instead of that of the tail. Fix and and consider values , or in subsequent experiments. Note that if , then the conditional exceedance , meaning that the excess distribution of a GPD random variable is a GPD with the same shape parameter, and a scaling parameter that grows linearly with the threshold . In that case, representing the tail distribution with a GPD is exact and not merely an asymptotic approximation. Such setting is used to test the POTPG algorithm in an ideal case with no misspecification of the tail distribution. The CVaR of the distribution with policy is then
since . Therefore,
i.e. the optimal policy is to set to minimize the scale parameter of the cumulative discounted costs.
In each simulation run, we consider iterations, in each of which cumulative discounted cost realizations are generated. A total of runs are performed. The finite difference step size for the gradient computation is . We set the initial policy parameter to . Define and respectively as estimates of the policy parameter and the objective function (CVaR) estimate on the iteration of run . We report the root-mean-square-error (RMSE) across the various runs for each iteration of policy parameters and the objective function associated as:
The CVaR level is chosen to depict catastrophic risk levels.
Figure 1 reports metrics and with respect to iteration for the three different values of tail parameter . The POTPG outperforms the SA benchmark in all experiments as the RMSE on the optimal policy parameter decreases faster for the former approach. The extent of out-performance increases when the tail thickness (i.e. parameter ) increases. This is because sample averaging relies on only four observations, i.e. , coming from the tail of distribution, which is increasingly unstable as tail thickness increases. The POT approach better alleviates this issue by using many more observations from the body of the distribution to extrapolate tail behavior. Moreover, even when having converged to the optimal policy, both methods (POTPG and SA) exhibit residual estimation error in the objective function (i.e. cumulative discounted costs) CVaR estimate. Though is generally smaller for the POTPG approach, such methods out-perform SA significantly for the thicker tail case . In conclusion, the thicker the tail of the costs distribution is, the more useful the POTPG approach is.
![Refer to caption](x1.png)
![Refer to caption](x2.png)
![Refer to caption](x3.png)
![Refer to caption](x4.png)
![Refer to caption](x5.png)
![Refer to caption](x6.png)
5 Application to financial hedging
We present the application of the POTPG algorithm to a financial risk management problem, namely the dynamic Delta-Gamma hedging of an option. The problem of finding the optimal proportion of the Gamma to neutralize when options are very expensive is discussed.
5.1 The hedging framework
Time elapsed between consecutive time points are assumed to be weeks (period of length year). The periodic continuously compounded interest rate is . With , let denote the time- price of a non-dividend-paying stock, whose dynamics is assumed to be a discrete-time version of an exponential normal-inverse Gaussian (NIG) Lévy process: , with being, under the physical measure , i.i.d. random variables with a NIG distribution whose PDF is given by
(5.1) |
where represents the modified Bessel function of the second kind with index , defined as:
(5.2) |
Such distribution is known to exhibit fat tails and is therefore well-suited to study the extreme risk minimization framework of this study.
Parameters considered are taken from Godin, (2016), namely , , and .
We consider a market with high volatility risk premium where options are costly; as such we assume risk-neutral parameters and identical to the physical ones, except for the delta parameter driving the returns variance, which is inflated by a factor of 4: , , and . In such a market, fully neutralizing the gamma of the option being hedged is most likely sub-optimal, due to high option cost, and thus determining the best hedge ratio yielding the optimal cost versus risk reduction tradeoff is a non-trivial endeavor which is the problem considered in this section.
We assume than any European call option on such stock is priced according to the formula provided in Godin et al., (2012) which is based on the mean-correcting martingale measure described in Schoutens, (2003). The time- price of a European call option with strike providing the time payoff is
(5.3) |
with weeks, and denoting the CDF of the NIG distribution. It is straightforward to compute the Delta and the Gamma of such options:
We consider a financial institution (the hedging agent) which holds a short position in a call option with a strike price and maturity weeks. Such option is referred to as the target option. To mitigate the risk associated with the uncertainty related to its payoff, a self-financing hedging portfolio is used. At any time point, the portfolio is invested in three hedging assets, namely a risk-free account, the stock and an option on the stock. The time- value of the hedging portfolio is denoted (the superscript refers to its dependence on the policy) and evolves according to
with being the respective portfolio positions on time interval in the stock and an option used for hedging, and and being the respective time- and time- price of the hedging option purchased at . The positions are thus rebalanced at each period, and option positions are rolled-over, with the hedging options currently in the portfolio being liquidated at the end of the period while new ones are being purchased. At the start of any period , the option considered for purchase is at-the-money (its strike is and its maturity is , meaning % of a year). As such, and . Note that unless , since options being included in the hedging portfolio change on the various periods. Moreover, is the option premium that is initially invested in the hedging portfolio.
The optimal policy should characterize the selection of positions , to be included in the hedging portfolio. Assume that the agent wants to be fully Delta-neutral, which is obtained with
However, we assume that the agent might prefer not fully neutralizing the Gamma of the target option due to purchases of hedging options being too costly in a market with large volatility risk premium. The agent shall therefore only neutralize a portion , called the hedge ratio, of the target option Gamma. This leads to , and thus to .
The objective of the hedging agent is therefore to find the optimal hedge ratio, which is the optimal policy parameter . A single terminal cost is considered for the agent: and no discount factor is considered . The agent thus attempts minimizing risk associated with catastrophic hedging shortfalls at maturity: hence consider .
Before applying the reinforcement learning procedure, we want to approximate the objective function , the CVaR0.999 of the hedging shortfall, for various hedge ratios . Such approximations are produced with brute force Monte-Carlo simulations, where for several values of , realizations of the hedging shortfall are produced and sample averaging is applied, i.e. is estimated by the largest realizations. Figure 2 reports such estimates, with the optimal hedge ratio being estimated to be and the corresponding objective function being .
![Refer to caption](x7.png)
Now apply the POTPG algorithm to the policy optimization problem, and compare its performance to that of the sample averaging method (SA). Such methods are applied with either or simulated paths of weekly stock returns. independent runs are conducted, each comprised of iterations for the case , or iterations when . In each run, the initial policy is set to . The finite difference shock is . The method of moments is used to estimate tail parameters in the POTPG algorithms since such method exhibited (in unreported tests) greater stability than maximum likelihood estimates in the presented framework.
Figure 3 reports the performance of the POTPG and SA policy gradient algorithms for the hedging problem, by displaying the evolution of the RMSE (across runs) of the estimate of the optimal policy parameter (RMSEθ) and the corresponding objective function (RMSE) versus the number of iteration conducted. The POTPG algorithm exhibits materially superior performance by exhibiting much lower errors on estimates for the optimal policy parameter and objective function. The gap in performance between the POTPG and the benchmark is greater for the lower sample size , which highlights that our method has more added value in the context of more severe distribution tail data scarcity. Note that none of the two methods have the estimated policy parameter converge to the true optimal value (i.e. RMSEθ does not converge to zero), which can be explained by the fact that both methods are biased in finite sample . Nevertheless, we see that higher sample size increases the precision, with lower RMSEs for the estimates of the policy parameter and of the objective function .
![Refer to caption](x8.png)
![Refer to caption](x9.png)
![Refer to caption](x10.png)
![Refer to caption](x11.png)
6 Conclusion
We propose a policy gradient algorithm based on estimators of tail risk borrowed from extreme value theory to tackle the difficult task of catastrophic risk minimization within a sequential decision making framework. The peaks-over-threshold procedure is used to estimate the CVaR of cumulative costs by leveraging the asymptotic convergence of the tail distribution to a generalized Pareto distribution. We have shown in several simulation experiments, including an application to financial options hedging, that our method can outperform conventional benchmarks relying on the empirical distribution of the cumulative costs. Indeed, such benchmarks can perform quite poorly to mitigate extreme risk when observations in the tail are scarce.
Our method relies on finite difference approximations for the gradient, and as such it work for low-dimensional policies relying on a small number of parameters. An extension of our approach could consist in develo** a high-dimensional EVT-based policy gradient framework to tackle more complex problems. This would for instance allow using policies represented by deep neural networks and combine the EVT-based policy gradient with deep reinforcement learning.
Appendix A The automated threshold selection procedure
The automated threshold selection procedure involves testing several thresholds , , which we choose to be quantiles of pre-determined levels of the empirical distribution of the sample . For each , denote by the number of threshold excesses. Assuming that a threshold leads to GPD parameter estimates for the distribution of the excesses , the Anderson-Darling test statistic for such threshold is
(6.1) |
where with being the smallest excess value, i.e. among values in . The automated selection procedure attempts using the smallest possible threshold for which no threshold above would be deemed inadequate. In the application, we choose , , and .
Remark 6.1.
If the automated threshold selection procedure is unsuccessful, i.e. , the sample averaging estimate in (2.5) is used for CVaR, as a fallback estimate.
References
- Bader et al., (2018) Bader, B., Yan, J., and Zhang, X. (2018). Automated threshold selection for extreme value analysis via ordered goodness-of-fit tests with adjustment for false discovery rate. The Annals of Applied Statistics, 12(1):310–329.
- Bairakdar et al., (2024) Bairakdar, R., Godin, F., Mailhot, M., and Yang, F. (2024). Estimation of generalized tail distortion risk measures with applications in reinsurance. available on SSRN.
- Balkema and De Haan, (1974) Balkema, A. A. and De Haan, L. (1974). Residual life time at great age. The Annals of Probability, 2(5):792–804.
- Borkar, (2001) Borkar, V. S. (2001). A sensitivity formula for risk-sensitive cost and the actor–critic algorithm. Systems & Control Letters, 44(5):339–346.
- Buehler et al., (2019) Buehler, H., Gonon, L., Teichmann, J., and Wood, B. (2019). Deep hedging. Quantitative Finance, 19(8):1271–1291.
- Cao et al., (2023) Cao, J., Chen, J., Farghadani, S., Hull, J., Poulos, Z., Wang, Z., and Yuan, J. (2023). Gamma and vega hedging using deep distributional reinforcement learning. Frontiers in Artificial Intelligence, 6:1129370.
- Carbonneau and Godin, (2021) Carbonneau, A. and Godin, F. (2021). Equal risk pricing of derivatives with deep hedging. Quantitative Finance, 21(4):593–608.
- Chow et al., (2018) Chow, Y., Ghavamzadeh, M., Janson, L., and Pavone, M. (2018). Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(167):1–51.
- Coache et al., (2023) Coache, A., Jaimungal, S., and Cartea, Á. (2023). Conditionally elicitable dynamic risk measures for deep reinforcement learning. SIAM Journal on Financial Mathematics, 14(4):1249–1289.
- Coles et al., (2001) Coles, S., Bawa, J., Trenner, L., and Dorazio, P. (2001). An introduction to statistical modeling of extreme values, volume 208. Springer.
- Godin, (2016) Godin, F. (2016). Minimizing CVaR in global dynamic hedging with transaction costs. Quantitative Finance, 16(3):461–475.
- Godin et al., (2012) Godin, F., Mayoral, S., and Morales, M. (2012). Contingent claim pricing using a normal inverse Gaussian probability distortion operator. Journal of Risk and Insurance, 79(3):841–866.
- Greenberg et al., (2022) Greenberg, I., Chow, Y., Ghavamzadeh, M., and Mannor, S. (2022). Efficient risk-averse reinforcement learning. Advances in Neural Information Processing Systems, 35:32639–32652.
- Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). ADAM: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- La and Ghavamzadeh, (2013) La, P. and Ghavamzadeh, M. (2013). Actor-critic algorithms for risk-sensitive MDPs. Advances in neural information processing systems, 26.
- McNeil et al., (2015) McNeil, A. J., Frey, R., and Embrechts, P. (2015). Quantitative risk management: concepts, techniques and tools-revised edition. Princeton university press.
- Pickands III, (1975) Pickands III, J. (1975). Statistical inference using extreme order statistics. The Annals of Statistics, pages 119–131.
- Prashanth et al., (2022) Prashanth, L., Fu, M. C., et al. (2022). Risk-sensitive reinforcement learning via policy gradient search. Foundations and Trends® in Machine Learning, 15(5):537–693.
- Rockafellar and Uryasev, (2002) Rockafellar, R. T. and Uryasev, S. (2002). Conditional value-at-risk for general loss distributions. Journal of Banking & Finance, 26(7):1443–1471.
- Saeed Marzban and Li, (2023) Saeed Marzban, E. D. and Li, J. Y.-M. (2023). Deep reinforcement learning for option pricing and hedging under dynamic expectile risk measures. Quantitative Finance, 23(10):1411–1430.
- Schoutens, (2003) Schoutens, W. (2003). Lévy processes in finance: pricing financial derivatives. Wiley Online Library.
- Sutton and Barto, (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
- Tamar et al., (2012) Tamar, A., Di Castro, D., and Mannor, S. (2012). Policy gradients with variance related risk criteria. In Proceedings of the twenty-ninth international conference on machine learning, pages 387–396.
- Troop et al., (2021) Troop, D., Godin, F., and Yu, J. Y. (2021). Bias-corrected peaks-over-threshold estimation of the cvar. In Uncertainty in Artificial Intelligence, pages 1809–1818. PMLR.
- Troop et al., (2022) Troop, D., Godin, F., and Yu, J. Y. (2022). Best-arm identification using extreme value theory estimates of the CVaR. Journal of Risk and Financial Management, 15(4):172.
- Vijayan and Prashanth, (2023) Vijayan, N. and Prashanth, L. (2023). A policy gradient approach for optimization of smooth risk measures. In Uncertainty in Artificial Intelligence, pages 2168–2178. PMLR.
- Wu and Lin, (1999) Wu, C. and Lin, Y. (1999). Minimizing risk models in Markov decision processes with policies depending on target values. Journal of Mathematical Analysis and Applications, 231(1):47–67.
- Wu and Jaimungal, (2023) Wu, D. and Jaimungal, S. (2023). Robust risk-aware option hedging. Applied Mathematical Finance, 30(3):153–174.