1 Introduction

Deep Learning of Multivariate Extremes
via a Geometric Representation

Callum J. R. Murphy-Barltrop^1,2∗, Reetam Majumder³, Jordan Richards⁴

¹¹footnotetext: Technische Universität Dresden, Institut Für Mathematische Stochastik, Dresden, Germany ²²footnotetext: Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), Dresden/Leipzig, Germany ³³footnotetext: Southeast Climate Adaptation Science Center, North Carolina State University, USA ⁴⁴footnotetext: School of Mathematics, University of Edinburgh, UK

\;{}^{*}

Corresponding author: [email protected]

June 28, 2024

Abstract

The study of geometric extremes, where extremal dependence properties are inferred from the deterministic limiting shapes of scaled sample clouds, provides an exciting approach to modelling the extremes of multivariate data. These shapes, termed limit sets, link together several popular extremal dependence modelling frameworks. Although the geometric approach is becoming an increasingly popular modelling tool, current inference techniques are limited to a low dimensional setting ( $d\leq 3$ ), and generally require rigid modelling assumptions. In this work, we propose a range of novel theoretical results to aid with the implementation of the geometric extremes framework and introduce the first approach to modelling limit sets using deep learning. By leveraging neural networks, we construct asymptotically-justified yet flexible semi-parametric models for extremal dependence of high-dimensional data. We showcase the efficacy of our deep approach by modelling the complex extremal dependencies between meteorological and oceanographic variables in the North Sea off the coast of the UK.

Keywords: extremal dependence; geometric extremes; limit sets; neural networks

1 Introduction

Multivariate extreme value theory is a branch of statistics that deals with understanding relationships between the extremes of multiple variables. A wide variety of modelling approaches for multivariate extremes (or, equivalently, extremal dependence) have been proposed; classical approaches use the framework of regular variation (see, e.g., Tawn,, 1990; Rootzén and Tajvidi,, 2006; Einmahl and Segers,, 2009), but these models are restrictive in the forms of extremal dependence that they can capture (Huser et al.,, 2024). In particular, they can only capture asymptotic dependence (Coles et al.,, 1999), where extremes of a random vector occur when all of its components are jointly large. Assuming this form of extremal dependence for data is unrealistic in many applications, and numerous works have advocated against the use of regular variation models for environmental studies (Opitz,, 2016; Dawkins and Stephenson,, 2018; Huser et al.,, 2024).

The first approach to modelling non-asymptotically dependent data was provided by Ledford and Tawn, (1996, 1997); see, also, hidden regular variation (Resnick,, 2002). Letting $\bm{X}_{E}:=(X_{E,1},\dots,X_{E,d})$ denote a random vector with standard exponential margins, Ledford and Tawn, (1996) assume that, as $u\to\infty$ ,

\Pr\left(\min_{i\in\mathcal{D}}\{X_{E,i}\}>u\right)=L(e^{u})\exp\{-u/\eta\},

(1)

where $\mathcal{D}:=\{1,\ldots,d\}$ , $L(\cdot)$ is a slowly varying function, i.e., $\lim_{u\to\infty}L(cu)/L(u)=1$ for any constant $c>0$ , and $\eta\in(0,1]$ is termed the coefficient of tail dependence. Under asymptotic dependence, we have $\eta=1$ and $\lim_{u\to\infty}L(u)>0$ , with other extremal dependence structures arising when these conditions are not satisfied. In a practical setting, model (1) is only applicable when all components of $\bm{X}_{E}$ are large simultaneously. To overcome this limitation, Wadsworth and Tawn, (2013) introduced the angular dependence function (ADF), which generalises the coefficient $\eta$ . Consider any angle $\bm{w}:=(w_{1},\dots,w_{d})^{T}\in\mathcal{S}^{d-1}_{+}$ where $\mathcal{S}_{+}^{d-1}:=\{\bm{x}\in\mathbb{R}_{+}^{d}:||\bm{x}||=1\}$ is the strictly positive part of the unit $(d-1)$ -sphere with $\|\cdot\|$ the Euclidean norm; Wadsworth and Tawn, (2013) assume that

\Pr\left(\min_{i\in\mathcal{D}}\{X_{E,i}/w_{i}\}>u\right)=L(e^{u};\bm{w})e^{-% \lambda(\bm{w})u},\;\;\lambda(\bm{w})\geq\max(\bm{w}),

(2)

as $u\to\infty$ , where $L(\cdot\;;\bm{w})$ is a slowly varying function and $\lambda(\cdot)$ denotes the ADF; the latter provides information about the joint tail of $\bm{X}_{E}$ , and we have $\eta=\{\sqrt{d}\lambda(d^{-1/2},\ldots,d^{-1/2})\}^{-1}$ . Model (2) can capture both extremal dependence regimes, with asymptotic dependence implying the lower bound, $\lambda(\bm{w})=\max(\bm{w})$ for all $\bm{w}\in\mathcal{S}^{d-1}_{+}$ . Loosely speaking, the angle $\bm{w}$ is the direction in $\mathbb{R}_{+}^{d}$ for which the joint tail region in (2) is defined. This model has been successfully applied in environmental applications by, e.g., Murphy-Barltrop et al., (2023), Murphy-Barltrop and Wadsworth, (2024), and Murphy-Barltrop et al., 2024b .

Several of the aforementioned models introduced are defined for random vectors exhibiting standard margins with finite lower bounds, e.g., Pareto, Fréchet, or exponential, which limits the study of extremal dependence. In particular, when applied to random vectors with double tailed margins, such as Laplace, these modelling approaches reduce the study of extremal dependence to data observed only in the positive orthant, $\mathbb{R}^{d}_{+}$ , of $\mathbb{R}^{d}$ ; see Section 2 for further discussion. This can be restrictive in practical applications where different regions of low joint probability mass may be of interest. Consequently, many recent works have introduced modelling approaches for data on standard Laplace margins. For example, Keef et al., (2013) extended the model of Heffernan and Tawn, (2004) to Laplace margins, with the resulting framework providing greater flexibility and interpretability. Moreover, Mackay and Jonathan, (2023), Papastathopoulos et al., (2024) and Murphy-Barltrop et al., 2024a demonstrate that models on Laplace margins permit evaluation of the joint tail behaviour of random vectors in all $2^{d}$ orthants of $\mathbb{R}^{d}$ . To demonstrate this, Figure 1 illustrates extreme density contours for a bivariate Gaussian copula with correlation parameter $\rho=-0.5$ on both standard exponential and standard Laplace margins. Hereafter, we use $\bm{X}:=(X_{1},\dots,X_{d})^{T}$ to denote a $d$ -dimensional random vector with standard Laplace margins, with distribution function $F_{\bm{X}}(\cdot)$ and continuous density function $f_{\bm{X}}(\cdot)$ .

Refer to caption — Figure 1: Density contours for a bivariate random vector $\bm{X}$ with a Gaussian copula (with negative correlation) and with standard exponential (left) or standard Laplace (right) margins. The density levels for each colour are given in the legend in the left panel.

Recent theoretical developments for multivariate extremes have focused on geometric extremes, whereby extremal dependence properties of $\bm{X}$ can be inferred directly from the deterministic limiting shapes of scaled sample clouds. Let $C_{n}:={\{\bm{X}_{i}/r_{n}\}^{n}_{i=1}}$ denote $n$ independent copies of $\bm{X}$ , scaled by a suitably chosen positive sequence $(r_{n})_{n\in\mathbb{N}}$ satisfying $r_{n}\to\infty$ as $n\to\infty$ . Under mild conditions, $C_{n}$ converges in probability, with respect to the Hausdorff metric, onto the compact, star-shaped limit set $\mathcal{G}:=\{\bm{x}:g(\bm{x})\leq 1\}\subset[-1,1]^{d}$ , where $g:\mathbb{R}^{d}\mapsto\mathbb{R}_{+}$ is the gauge function of $\mathcal{G}$ (Fisher,, 1969; Davis et al.,, 1988; Kinoshita and Resnick,, 1991). A sufficient condition for convergence onto $\mathcal{G}$ is that

-\log f_{\bm{X}}(t\bm{x})\sim tg(\bm{x}),\;\;\bm{x}\in\mathbb{R}^{d},\;\;t\to\infty,

(3)

or, equivalently, $g(\bm{x})=\lim_{t\to\infty}[-\log f_{\bm{X}}(t\bm{x})]/t$ (Balkema and Nolde,, 2010). Following Nolde, (2014), we also define the unit-level, or boundary, set $\partial\mathcal{G}:=\{\bm{x}:g(\bm{x})=1\}\subset\mathcal{G}$ . For standard Laplace margins, it suffices to set $r_{n}=\log(n/2)$ to achieve the required convergence (Papastathopoulos et al.,, 2024). To demonstrate this, Figure 2 illustrates the limit set $\mathcal{G}$ and unit-level set $\partial\mathcal{G}$ for three copulas, alongside $n=10,000$ samples $\{\bm{x}_{i}/\log(n/2)\}_{i=1}^{n}$ from each copula; formal definitions of these copulas are given in Section 4. One can observe that the (finite) observed sample clouds lie approximately within the theoretical limit set. Hereafter, we implicitly assume that any $\bm{X}$ satisfies the conditions for convergence onto $\mathcal{G}$ .

Nolde, (2014) and Nolde and Wadsworth, (2022) illustrate that the shape of $\partial\mathcal{G}$ , or equivalently $\mathcal{G}$ , is directly related to extremal dependence of $\bm{X}$ . Specifically, $\partial\mathcal{G}$ links together several representations for multivariate extremes: from $\partial\mathcal{G}$ , we can determine all parameters associated with the models described in equations (1) and (2), as well as those proposed by Heffernan and Tawn, (2004) and Simpson et al., (2020). Taking, e.g., the ADF, we have that $\lambda(\bm{w})=||\bm{w}||_{\infty}\times\mathfrak{r}_{\bm{w}}^{-1}$ for all $\bm{w}\in\mathcal{S}^{d-1}_{+}$ , where $\mathfrak{r}_{\bm{w}}:=\max\{\mathfrak{r}\in[0,1]:\mathfrak{r}\mathcal{R}_{\bm% {w}}\cap\partial\mathcal{G}\neq\emptyset\}$ is a $\bm{w}$ -dependent coefficient used to scale back the set $\mathcal{R}_{\bm{w}}:=\bigotimes_{i=1,\ldots,d}[w_{i}/||\bm{w}||_{\infty},\infty]$ to intersect with $\partial\mathcal{G}$ . Such parameters allow us to quantify extremal dependence of $\bm{X}$ in an interpretable manner. Furthermore, from a practical perspective, estimates of $\partial\mathcal{G}$ can be used to estimate extreme statistics; for example, joint tail probabilities (Wadsworth and Campbell,, 2024), return curves (Murphy-Barltrop et al., 2024b, ), and return level sets (Papastathopoulos et al.,, 2024). The unit-level set $\partial\mathcal{G}$ therefore offers a high degree of practical utility for inference of multivariate extremes, and thus its accurate estimation is of particular importance.

Owing to this perspective, recent works have introduced techniques for the estimation of $\partial\mathcal{G}$ , on both standard exponential and standard Laplace margins. For the former margin, Simpson and Tawn, 2024a and Majumder et al., 2024b proposed semi-parametric techniques using generalised additive models (GAMs) and Bézier curves, respectively, to approximate $\partial\mathcal{G}$ , while Wadsworth and Campbell, (2024) used a parametric-copula-based model. For Laplace margins, Papastathopoulos et al., (2024) proposed a latent Gaussian model to approximate the shape of $\partial\mathcal{G}$ , while Murphy-Barltrop et al., 2024a used GAMs to estimate $\partial\mathcal{G}$ using the model introduced by Mackay and Jonathan, (2023). Within these approaches, a wide range of statistical techniques have been considered in both Bayesian and frequentist settings, and it is possible to obtain accurate estimates of $\partial\mathcal{G}$ for a wide range of dependence structures. However, current techniques for modelling and estimating $\partial\mathcal{G}$ have several shortcomings: i) all existing approaches are limited to a low-dimensional setting ( $d\leq 3$ ), ii) the restriction of several approaches to standard exponential margins offers a limited perspective for evaluating joint tail properties (see Figure 1), and iii) one must always specify parametric or semi-parametric forms for quantities related to $\partial\mathcal{G}$ , or select a large number of tuning parameters. Therefore, the existing estimation techniques for $\partial\mathcal{G}$ offer limited practical utility, motivating novel developments.

Recent literature combining extreme value theory and deep learning has seen the construction of flexible, computationally-scalable models for univariate extremes (see, e.g., Pasche and Engelke,, 2022; Richards and Huser,, 2022; Cisneros et al.,, 2024) and generative models for multivariate and spatial extremes (Boulaguiem et al.,, 2022; Lafon et al.,, 2023; Zhang et al.,, 2023; Majumder et al., 2024a, ). While Hasan et al., (2022) used neural networks to build flexible models for asymptotically-dependent multivariate extremes, deep learning is yet to be exploited in the construction of models that can capture non-asymptotically dependent data structures. Here, we propose the first deep learning-based approach for modelling $\partial\mathcal{G}$ , referred to hereafter as the DeepGauge framework. Our approach uses neural networks to perform inference on multivariate extremes, and represents a significant step towards flexible and robust models that require minimal parametric assumptions. The use of deep learning methods gives the DeepGauge framework a high degree of flexibility and, as we demonstrate in Sections 4 and 5, allows us to capture a wide variety of extremal dependence structures. Furthermore, it can be applied in higher dimensional settings ( $d\geq 4$ ) than existing techniques for estimating $\partial\mathcal{G}$ , and also requires fewer modelling assumptions.

The paper is organised as follows. Section 2 outlines the theory underpinning the DeepGauge framework, alongside novel theoretical results pertaining to the unit-level set $\partial\mathcal{G}$ and its estimation. Section 3 outlines our methodology for estimating $\partial\mathcal{G}$ and related quantities, with Section 3.3 detailing our neural network-based representation for geometric extremes and Section 3.5 introducing novel diagnostic tools for validating model fits in high dimensional settings. Section 4 provides a simulation study showcasing the efficacy of our framework for inferring the extremal dependence of random vectors. Section 5 provides an application to the NORA10 hindcast data set of meteorological and oceanographic (metocean) variables in the North Sea that exhibit complex dependence structures. We conclude in Section 6 with a discussion and outlook on future work.

2 Theoretical developments in Geometric Extremes

2.1 Overview of the angular-radial decomposition

To estimate the unit-level set $\partial\mathcal{G}$ and its corresponding gauge function $g(\cdot)$ , we first decompose $\bm{X}$ into angular and radial components, and then model the radii conditional on a fixed angle. While one could select one of many radial-angular systems for this decomposition, we follow advocacy by Murphy-Barltrop et al., 2024a and define angular and radial components via the Euclidean norm. For any $\bm{X}\in\mathbb{R}^{d}\setminus\bm{0}_{d}$ , with $\bm{0}_{d}:=(0,\ldots,0)^{T}$ , define $(R,\bm{W})$ by $\bm{X}\mapsto(R,\bm{W}):=(\|\bm{X}\|,\bm{X}/\|\bm{X}\|)$ for $R>0$ and $\bm{W}\in\mathcal{S}^{d-1}$ , where $\mathcal{S}^{d-1}:=\{\bm{x}\in\mathbb{R}^{d}:||\bm{x}||=1\}$ denotes the unit $(d-1)$ -sphere. It follows that $\bm{X}=R\bm{W}$ , which implies that $\bm{X}$ is completely determined by the joint behaviour of $R$ and $\bm{W}$ . It is trivial to show that map** $t:\mathbb{R}^{d}\setminus\bm{0}_{d}\mapsto\mathbb{R}_{+}\times\mathcal{S}^{d-1}$ , where $t(\bm{x}):=(||\bm{x}||,\bm{x}/||\bm{x}||)$ , is bijective; thus, no information is lost through considering $(R,\bm{W})$ . Loosely speaking, $R$ is the magnitude of an event, while $\bm{W}$ defines its direction, i.e., in which orthant of $\mathbb{R}^{d}$ the event occurs. Note that for any $d\in\mathbb{N}$ there exist $2^{d}$ possible orthants, as each component can be either positive or negative.

We now recall some theoretical properties of gauge functions and limit sets. The star-shapedness of the limit set $\mathcal{G}$ implies that, for any $\bm{x}\in\mathcal{G}$ and $t>0$ , we have $t\bm{x}\in\mathcal{G}$ . Moreover, if $\bm{0}_{d}\in\mathcal{G}$ and $g(\bm{0}_{d})<1$ , we have that the line segment $\{\bm{0}_{d}+t\bm{x}:t\in[0,1]\}\subset\mathcal{G}$ for any $\bm{x}\in\mathcal{G}$ . Furthermore, one can show that the componentwise maxima and minima of $\mathcal{G}$ equal $\bm{1}_{d}$ and $-\bm{1}_{d}$ respectively, implying that $\mathcal{G}$ (and, thus, the unit-level set $\partial\mathcal{G}$ ) must touch all boundaries of the unit hypercube $[-1,1]^{d}$ at least once. We further note that the gauge function $g(\cdot)$ of $\mathcal{G}$ is 1-homogeneous, i.e., $g(t\bm{x})=tg(\bm{x})$ for any $\bm{x}\in\mathbb{R}^{d},t\in\mathbb{R}_{+}$ .

Using the radial-angular decomposition of $\bm{X}$ and the star-shaped property of $\mathcal{G}$ , we can reformulate the unit-level set as $\partial\mathcal{G}=\left\{r\bm{w}:r>0,\bm{w}\in\mathcal{S}^{d-1},g(r\bm{w})=1% \right\}.$ Homogeneity of $g(\cdot)$ implies that, for any $\bm{w}\in\mathcal{S}^{d-1}$ , the radial value of the corresponding point on the unit-level set must be $1/g(\bm{w})$ ; hence, $\partial\mathcal{G}=\left\{\bm{w}/g(\bm{w}):\bm{w}\in\mathcal{S}^{d-1}\right\}.$ This reformulation has the powerful implication that, to determine $\partial\mathcal{G}$ , we need only to evaluate $g(\cdot)$ on $\mathcal{S}^{d-1}$ . Illustrations of the radial-angular representations for the sets $\mathcal{G}$ and $\partial\mathcal{G}$ are given in Figure 3.

2.2 Geometric extremes on Laplace margins

As noted in Section 1, we consider data on standard Laplace margins as this permits a more detailed description of joint tail behaviour. However, the theoretical results provided by Nolde and Wadsworth, (2022), linking the unit-level set $\partial\mathcal{G}$ to a variety of modelling frameworks, are given only for random vectors on standard exponential margins. The next proposition illustrates that some of the same results also hold for data on Laplace margins.

Proposition 2.1.

Consider a random vector $\bm{X}\in\mathbb{R}^{d}\setminus\bm{0}_{d}$ with standard Laplace margins and gauge function $g(\cdot)$ . Let $\bm{X}_{E}$ denote the same vector with unit exponential margins, with gauge function $g_{E}(\cdot)$ . We have equality of the gauge functions for positive angles, that is, $g(\bm{w})=g_{E}(\bm{w})$ for all $\bm{w}\in\mathcal{S}^{d-1}_{+}$ .

Proof of Proposition 2.1 is provided in Appendix B.1.

Remark 1.

From Proposition 2.1, we immediately see that the sets $\left\{\bm{w}/g_{E}(\bm{w}):\bm{w}\in\mathcal{S}_{+}^{d-1}\right\}$ and $\left\{\bm{w}/g(\bm{w}):\bm{w}\in\mathcal{S}_{+}^{d-1}\right\}$ are identical, implying equality of the unit-level sets on the positive orthant. This has the strong implication that the results proposed by Nolde and Wadsworth, (2022), which link the frameworks described in equations (1) and (2), as well as the model proposed by Simpson et al., (2020), to the unit-level set $\partial\mathcal{G}$ , are also valid for random vectors with Laplace margins.

2.3 Constructing valid unit-level sets

We now consider estimation of the unit-level set $\partial\mathcal{G}$ or, equivalently, the gauge function $g(\cdot)$ on $\mathcal{S}^{d-1}$ . This requires estimates of the corresponding limit set, $\mathcal{G}$ , to have certain properties; see Section 2.1. In what follows, we show that such properties are easily satisfied via an appropriate estimator for $g(\cdot)$ . We begin by noting that $g(\cdot)$ must satisfy the constraint described in the following Proposition 2.2.

Proposition 2.2.

For all $\bm{w}\in\mathcal{S}^{d-1}$ , the gauge function $g(\cdot)$ satisfies the constraint that

g(\bm{w})\geq||\bm{w}||_{\infty},

where $||\bm{x}||_{\infty}:=\max\{|x_{1}|,\ldots,|x_{d}|\}$ denotes the infinity norm.

Proof.

Given $\bm{w}\in\mathcal{S}^{d-1}$ , let $\bm{w}/g(\bm{w})\in\partial\mathcal{G}$ denote the corresponding point on the unit-level set. Since $\partial\mathcal{G}\subset[-1,1]^{d}$ , we must have $\max_{i=1,\dots,d}\{{w}_{i}\}/g(\bm{w})=\max_{i=1,\dots,d}\{{w}_{i}/g(\bm{w})% \}\leq 1,$ which implies that $\max_{i=1,\dots,d}\{{w}_{i}\}\leq g(\bm{w})$ . Similarly for the vector minima, we obtain $\max_{i=1,\dots,d}\{-{w}_{i}\}=-\min_{i=1,\dots,d}\{{w}_{i}\}\leq g(\bm{w})$ . Considering each component of $\bm{w}$ gives $g(\bm{w})\geq\max(w_{i},-w_{i})$ for all $i=1,\ldots,d$ ; hence, $g(\bm{w})\geq||\bm{w}||_{\infty}$ . ∎

Ignoring this constraint may lead to estimates of $\mathcal{G}$ not contained within $[-1,1]^{d}$ . Such estimates lack any theoretical interpretation and must undergo a user-specified rescaling to be of use; see, e.g., Papastathopoulos et al., (2024) or Murphy-Barltrop et al., 2024a . We exploit Proposition 2.2 to ensure that our model does not suffer from this problem and can be designed to always provide valid estimates of limit sets. Now consider any continuous radial function $h:\mathcal{S}^{d-1}\mapsto\mathbb{R}_{+}$ . The following proposition holds.

Proposition 2.3.

Suppose $h(\cdot):\mathcal{S}^{d-1}\mapsto\mathbb{R}_{+}$ satisfies $1/h(\bm{w})\geq||\bm{w}||_{\infty}$ for all $\bm{w}\in\mathcal{S}^{d-1}$ , and define the set

\mathcal{H}:=\left\{\bm{x}\in\mathbb{R}^{d}\setminus\{\bm{0}_{d}\}\;\bigg{|}\;% ||\bm{x}||\leq h(\bm{x}/||\bm{x}||)\right\}\bigcup\bigg{\{}\bm{0}_{d}\bigg{\}}.

Then $\mathcal{H}$ is star-shaped and satisfies $\mathcal{H}\subset[-1,1]^{d}$ . Moreover, $\mathcal{H}$ is compact.

Proof of Proposition 2.3 is provided in Appendix B.2.

Proposition 2.3 implies that we can design models for $\mathcal{G}$ that satisfy (some of) the validity properties of limit sets by starting with any (potentially arbitrary) continuous radial function $h(\cdot)$ that satisfies $1/h(\bm{w})\geq||\bm{w}||_{\infty}$ ; the corresponding set $\mathcal{H}$ will be a subset of $[-1,1]^{d}$ , star-shaped, and compact. Whilst $\mathcal{H}$ must be a subset of the unit hyper-cube, it is not guaranteed to intersect with its boundary in each component, or even at all; the same is true of the boundary set $\partial\mathcal{H}$ . Since this is a requirement of valid limit and unit-level sets, we propose a rescaling procedure to ensure it is always satisfied for our model.

Observe that the boundary of $\mathcal{H}$ is given by $\partial\mathcal{H}=\left\{\bm{w}h(\bm{w}):\bm{w}\in\mathcal{S}^{d-1}\right\}$ . For each $i=1,\ldots,d$ , we define $b_{i}(w_{i}):=\mathbbm{1}(w_{i}\geq 0)b_{i}^{U}-\mathbbm{1}(w_{i}<0)b_{i}^{L}>0$ , where

\displaystyle b_{i}^{U}:=\max\left\{w_{i}h(\bm{w})\mid\bm{w}\in\mathcal{S}^{d-% 1}\right\}>0,\quad\text{and}\quad b_{i}^{L}:=\min\left\{w_{i}h(\bm{w})\mid\bm{% w}\in\mathcal{S}^{d-1}\right\}<0,

(4)

for all $i=1,\ldots,d$ . Using these scaling functions, we define the rescaled set

\widetilde{\partial\mathcal{H}}:=\left\{h(\bm{w})\left(\frac{w_{1}}{b_{1}(w_{1% })},\ldots,\frac{w_{d}}{b_{d}(w_{d})}\right)\bigg{|}\bm{w}\in\mathcal{S}^{d-1}% \right\},

which satisfies the following Proposition 2.4.

Proposition 2.4.

The rescaled set $\widetilde{\partial\mathcal{H}}$ is in one-to-one correspondence with $\partial\mathcal{H}$ , satisfies $\widetilde{\partial\mathcal{H}}\subset[-1,1]^{d}$ , and has componentwise maxima and minima $\bm{1}_{d}$ and $-\bm{1}_{d}$ , respectively.

To map $\partial\mathcal{H}$ to $\widetilde{\partial\mathcal{H}}$ , we require the transformation described in the following lemma.

Lemma 2.1.

Let $\kappa:\mathcal{S}^{d-1}\mapsto\mathcal{S}^{d-1}$ denote the following map**:

\kappa(\bm{w})=\left(\frac{w_{1}}{b_{1}(w_{1})},\ldots,\frac{w_{d}}{b_{d}(w_{d% })}\right)\bigg{/}\bigg{\lVert}\left(\frac{w_{1}}{b_{1}(w_{1})},\ldots,\frac{w% _{d}}{b_{d}(w_{d})}\right)\bigg{\rVert}.

Then $\kappa$ is a bijective map**.

Lemma 2.1 is used to prove Proposition 2.4 in Appendix B.3. Proposition 2.4 permits a new construction of valid gauge functions, denoted by $\tilde{g}:\mathcal{S}^{d-1}\mapsto\mathbb{R}_{+}$ , with

\tilde{g}(\bm{w}):=1\bigg{/}\left\|h(\kappa^{-1}(\bm{w}))\left(\frac{\kappa^{-% 1}(\bm{w})_{1}}{b_{1}(\kappa^{-1}(\bm{w})_{1})},\ldots,\frac{\kappa^{-1}(\bm{w% })_{d}}{b_{d}(\kappa^{-1}(\bm{w})_{d})}\right)\right\|.

(5)

Note that we term $\tilde{g}(\cdot)$ the rescaled gauge function. Since $h(\cdot)$ is continuous, we have that $\tilde{g}(\cdot)$ is also continuous. Moreover, as we show in the following corollaries, $\tilde{g}(\cdot)$ satisfies the theoretical properties required to produce valid limit sets.

Corollary 2.1.

The rescaled gauge function $\tilde{g}(\cdot)$ satisfies $\tilde{g}(\bm{w})\geq||\bm{w}||_{\infty}$ for all $\bm{w}\in\mathcal{S}^{d-1}$ . Let $\bm{w}^{u,i}=\operatorname*{arg\,max}_{\bm{w}\in\mathcal{S}^{d-1}}\{w_{i}h(\bm% {w})\}$ and $\bm{w}^{l,i}=\operatorname*{arg\,min}_{\bm{w}\in\mathcal{S}^{d-1}}\{w_{i}h(\bm% {w})\}$ for $i=1,\dots,d$ . Then $\tilde{g}(\kappa(\bm{w}^{u,i}))=||\kappa(\bm{w}^{u,i})||_{\infty}$ and $\tilde{g}(\kappa(\bm{w}^{l,i}))=||\kappa(\bm{w}^{l,i})||_{\infty}$ for each $i=1,\ldots,d$ .

Corollary 2.2.

Let $h(\cdot)$ be any continuous radial function and define the corresponding rescaled gauge function by $\tilde{g}(\cdot)$ as in (5). The set

\widetilde{\mathcal{H}}:=\left\{\bm{x}\in\mathbb{R}^{d}\setminus\{\bm{0}_{d}\}% \bigg{|}||\bm{x}||\leq\frac{1}{\tilde{g}(\bm{x}/||\bm{x}||)}\right\}\bigcup% \bigg{\{}\bm{0}_{d}\bigg{\}},

is star-shaped, compact, and satisfies $\widetilde{\mathcal{H}}\subset[-1,1]^{d}$ . Furthermore, $\widetilde{\mathcal{H}}$ has componentwise maxima and minima $\bm{1}_{d}$ and $-\bm{1}_{d}$ , respectively.

Proof of Corollary 2.1 is provided in Appendix B.4. Proof of Corollary 2.2 follows directly from Propositions 2.3 and 2.4, and Corollary 2.1.

Remark 2.

Corollary 2.2 also implies that the boundary $\widetilde{\partial\mathcal{H}}$ associated with $\widetilde{\mathcal{H}}$ satisfies all of the required theoretical properties of valid unit-level sets.

For inference of the limit set, we exploit Corollary 2.2 to ensure our estimates of the limit (or unit-level) set have the required properties for validity; see Section 3.2. We note that an alternative rescaling was proposed by Papastathopoulos et al., (2024). However, this was applied post-hoc on an initial estimate of the gauge function via a two-step procedure, whereas our rescaling is performed during inference without the need for an additional step.

2.4 Extended Angular Dependence Function

Consider now any gauge function $g(\cdot)$ with limit and unit level sets $\mathcal{G}$ and $\partial\mathcal{G}$ respectively. Proposition 2.1 implies that valid estimates of $\partial\mathcal{G}$ can be used to immediately obtain parameter estimates for several existing modelling frameworks, providing information about the extremal dependence structure in the positive orthant of $\mathbb{R}^{d}$ . However, as noted in Section 1, our interest lies more generally in understanding the extremal dependence in all orthants of $\mathbb{R}^{d}$ . The next proposition shows that, with suitable scaling coefficients, the limit set in any orthant can be obtained by considering a rescaled random vector on the set $\mathcal{S}^{d-1}_{+}$ .

Proposition 2.5.

Given $\bm{w}\in\mathcal{S}^{d-1}$ , let $\bm{c}:=(\varepsilon(w_{1}),\ldots,\varepsilon(w_{d}))^{T}$ where $\varepsilon(x)=1$ for $x\geq 0$ and $\varepsilon(x)=-1$ , otherwise. For $\bm{c}\bm{X}:=(c_{1}X_{1},\ldots,c_{d}X_{d})^{T}$ and $\bm{c}\bm{w}:=(c_{1}w_{1},\ldots,c_{d}w_{d})^{T}\in\mathcal{S}^{d-1}_{+}$ , we have $g(\bm{w})=g_{\bm{c}\bm{X}}(\bm{c}\bm{w})$ , where $g_{\bm{c}\bm{X}}(\cdot)$ denotes the gauge function of $\bm{c}\bm{X}$ .

Proof.

Letting $f_{\bm{c}\bm{X}}(\cdot)$ denote the joint density function of $\bm{c}\bm{X}$ , we have

g(\bm{w})=\lim_{t\to\infty}[-\log f_{\bm{X}}(t\bm{w})]/t=\lim_{t\to\infty}[-% \log f_{\bm{c}\bm{X}}(t\bm{c}\bm{w})]/t=g_{\bm{c}\bm{X}}(\bm{c}\bm{w}),

as the Jacobian of the transformation $\bm{c}\bm{X}\mapsto\bm{X}$ is equal to $1$ . ∎

Remark 3.

Proposition 2.5 implies that, for any $\bm{c}\in\{-1,1\}^{d}$ that leads to the partitioning $\mathcal{S}^{d-1}_{\bm{c}}:=\{\bm{w}\in\mathcal{S}^{d-1}:\bm{c}:=(\varepsilon(% w_{1}),\ldots,\varepsilon(w_{d}))^{T}\}$ of the unit $(d-1)$ -sphere, the sets ${\{\bm{w}/g(\bm{w}):\bm{w}\in\mathcal{S}^{d-1}_{\bm{c}}\}}$ and $\{\bm{w}/g_{\bm{c}\bm{X}}(\bm{c}\bm{w}):\bm{w}\in\mathcal{S}^{d-1}_{\bm{c}}\}$ are equal. Therefore, the unit-level set $\partial\mathcal{G}$ can be obtained by evaluating gauge functions for rescaled vectors on the set $\{\bm{w}\in\mathcal{S}^{d-1}:\min_{i=1,\dots,d}\{w_{i}\}\geq 0\}$ .

Propositions 2.1 and 2.5 have implications when using the geometric representation to infer the ADF from (2). As this framework is given for vectors on standard exponential margins, careful treatment is required to define an analogous model for Laplace margins.

Proposition 2.6.

Given any angle $\bm{w}\in\mathcal{S}^{d-1}_{+}$ , assume that equation (2) holds for the random vector $\bm{X}_{E}$ . Then

\Pr\left(\min_{i\in\mathcal{D}}\{X_{i}/w_{i}\}>u\right)=L(e^{u};\bm{w})e^{-% \lambda(\bm{w})u},\;\;u\to\infty.

Proof of Proposition 2.6 is provided in Appendix B.5. Combining Proposition 2.6 and model (2) allows us to assess joint tail behaviour in the positive orthant $\mathbb{R}^{d}_{+}$ through the ADF, $\lambda(\cdot)$ . We further extend this model to $\mathbb{R}^{d}$ by considering an extended ADF, denoted by $\Lambda(\cdot)$ . Specifically, given any $\bm{w}\in\mathcal{S}^{d-1}\setminus\mathcal{A}$ , where $\mathcal{A}:=\bigcup_{i=1}^{d}\{\bm{w}\in\mathcal{S}^{d-1}:w_{i}=0\}$ is the intersection of $\mathcal{S}^{d-1}$ with each axis, we assume that

\Pr\left(\min_{i\in\mathcal{D}}\{X_{i}/w_{i}\}>u\right)=\mathcal{L}(e^{u};\bm{% w})e^{-\Lambda(\bm{w})u},\;\;u\to\infty,

(6)

where $\mathcal{L}(\cdot;\bm{w})$ is a slowly varying function and $\Lambda(\cdot)$ is defined in Proposition 2.8. Note that the extended ADF corresponds with the copula exponent function introduced by Mackay and Jonathan, (2023) for data on uniform margins. The next proposition illustrates that, under mild conditions, the convergence in (6) is always achieved.

Proposition 2.7.

Assume that the conditions of Proposition 2.6 are satisfied for any random vector $\bm{c}\bm{X}$ , where $\bm{c}\in\{-1,1\}^{d}$ . Then equation (6) holds for any $\bm{w}\in\mathcal{S}^{d-1}\setminus\mathcal{A}$ .

Proof of Proposition 2.7 is provided in Appendix B.6. Note that while the assumptions for Proposition 2.7 may seem restrictive, Wadsworth and Tawn, (2013) demonstrate through rigorous theoretical treatment that model (2) captures the joint tail structure of $\bm{X}_{E}$ for a wide variety of theoretical examples. Therefore, it is reasonable to assume this framework can capture the tail structure in every orthant.

Remark 4.

In the definition of our model (6), we purposely exclude any angles in $\mathcal{S}^{d-1}$ that intersect the axes, as the model is not well defined there. Take, e.g., $\bm{w}:=(1,0,\ldots,0)^{T}$ ; it is not clear whether one should consider the probability $\Pr(X_{1}>u,X_{i}>0,i=2,\ldots,d)$ or $\Pr(X_{1}>u,X_{i}<0,i=2,\ldots,d)$ under the modelling framework. This results in discontinuities for the extended ADF at the axes; see Figure 2.

On the original exponential margins, Nolde and Wadsworth, (2022) show that the unit-level set $\partial\mathcal{G}$ is linked to the ADF; see Section 1. As demonstrated by Proposition 2.8, an analogous relationship holds for the extended ADF.

Proposition 2.8.

Suppose equation (6) holds for any angle $\bm{w}\in\mathcal{S}^{d-1}\setminus\mathcal{A}$ . Then $\Lambda(\bm{w})=||\bm{w}||_{\infty}\times\tilde{\mathfrak{r}}^{-1}_{\bm{w}}$ , where $\tilde{\mathfrak{r}}_{\bm{w}}=\max\{\mathfrak{r}\in[0,1]:\mathfrak{r}\tilde{% \mathcal{R}}_{\bm{w}}\cap\partial\mathcal{G}\neq\emptyset\}$ and $\tilde{\mathcal{R}}_{\bm{w}}:=\bigotimes_{i=1,\ldots,d}\mathcal{U}_{w_{i}}$ , with $\mathcal{U}_{w_{i}}:=[w_{i}/||\bm{w}||_{\infty},\infty]$ for $w_{i}>0$ and $[-\infty,w_{i}/||\bm{w}||_{\infty}]$ for $w_{i}<0$ .

Proof of Proposition 2.8 is provided in Appendix B.7, alongside the illustrative Figure A1. Proposition 2.8 illustrates that the extended ADF can be obtained directly from the unit-level set $\partial\mathcal{G}$ . Consequently, the extended ADF is linked to the gauge function, as demonstrated by the following corollary.

Corollary 2.3.

Suppose model (6) holds for all $\bm{w}\in\mathcal{S}^{d-1}\setminus\mathcal{A}$ . Then, we have that

g(\bm{w})\geq\Lambda(\bm{w})\geq||\bm{w}||_{\infty}.

Proof of Corollary 2.3 is provided in Appendix B.8, with a visualisation provided in Figure 2. The blue lines in each panel of Figure 2 denote the sets $\{\bm{w}/\Lambda(\bm{w}):\bm{w}\in\mathcal{S}^{d-1}\setminus\mathcal{A}\}$ , corresponding to the intersection points $\{\tilde{\mathfrak{r}}_{\bm{w}}\tilde{\mathcal{R}}_{\bm{w}}\cap\partial% \mathcal{G},\bm{w}\in\mathcal{S}^{d-1}\setminus\mathcal{A}$ }. Furthermore, from the first and third panels of Figure 2, one can clearly observe the discontinuities of the extended ADF at the axes, as discussed in Remark 4.

3 Inference

3.1 Overview

Here we describe our DeepGauge framework for modelling and estimating limit sets. Section 3.2 describes model assumptions for the conditional radii $R\mid(\bm{W}=\bm{w})$ , $\bm{w}\in\mathcal{S}^{d-1}$ , through which we obtain estimates of the unit-level set $\partial G$ . Section 3.3 describes our DeepGauge representation of gauge functions using neural networks. Section 3.4 covers estimation of the extended ADF and its use in probability estimation. Section 3.5 concludes with diagnostic tools for assessing goodness-of-fit.

3.2 Modelling assumptions for the conditional radii

Wadsworth and Campbell, (2024) demonstrate that for large radial values $r$ , we have that $f_{R\mid\bm{W}}(r\mid\bm{w})\propto r^{d-1}\exp\{-rg(\bm{w})\}$ , where $f_{R\mid\bm{W}}$ denotes the density function of $R\mid\bm{W}$ . This implies that the upper tail of $R\mid(\bm{W}=\bm{w})$ should follow the form of a gamma kernel. Note that this holds for $\bm{X}$ on both exponential and Laplace margins. To accommodate this form, Wadsworth and Campbell, (2024) propose the modelling assumption:

R\mid(\bm{W}=\bm{w},R>r_{\tau}(\bm{w}))\sim\text{truncGamma}(\alpha,g(\bm{w})),

(7)

where ‘truncGamma’ denotes a truncated Gamma distribution with shape and rate parameters $\alpha>0$ and $g(\bm{w})>0$ , respectively, and $r_{\tau}(\bm{w})$ denotes the $\tau$ -quantile of $R\mid(\bm{W}=\bm{w})$ for some $\tau\in(0,1)$ close to 1, that is, $\Pr\{R\leq r_{\tau}(\bm{w})\mid\bm{W}=\bm{w}\}=\tau$ for all $\bm{w}\in\mathcal{S}^{d-1}$ . Using form (7), we can use maximum likelihood techniques to estimate the gauge function $g(\cdot)$ , which is viewed as a non-stationary rate parameter of the truncated gamma distribution.

Through rigorous theoretical treatment, Wadsworth and Campbell, (2024) show that equation (7) is a valid modelling assumption for a wide range of parametric copulas, including the three illustrated in Figure 2. For most examples, the true shape parameter is $\alpha=d$ . However, to increase the flexibility of their model, Wadsworth and Campbell, (2024) permit estimation of this parameter; the DeepGauge framework follows suit.

3.3 Estimation of the gauge function using neural networks

To overcome the limited flexibility of existing geometric modelling approaches, we model the rescaled gauge function ( $\tilde{g}$ ; see Corollary 2.1) using neural networks. Full inference requires the construction of two models: one for the radial threshold, $r_{\tau}(\bm{w})$ in (7), and one for an (unscaled) gauge function ${g}(\bm{w})$ satisfying ${g}(\bm{w})\geq||\bm{w}||_{\infty}$ for all $\bm{w}\in\mathcal{S}^{d-1}$ (see Proposition 2.2). With the latter, we can employ the transformation in Corollary 2.1 to produce a model for the rescaled gauge function, $\tilde{g}(\bm{w})$ , that provides a valid unit-level set.

We model the radial threshold $r_{\tau}:\mathcal{S}^{d-1}\mapsto\mathbb{R}_{+}$ using a multi-layer perceptron (MLP) with rectified linear unit (ReLU) activation functions, denoted by $m_{\bm{\psi}}:\mathcal{S}^{d-1}\mapsto\mathbb{R}_{+}$ and parameterised by the set $\bm{\psi}$ . For brevity, we provide details of the construction in Appendix A.1; see, also, Richards and Huser, (2024) for a review of these models and their inference. We similarly represent the (unscaled) gauge function ${g}(\bm{w})$ using an MLP. To ensure that ${g}(\cdot)$ satisfies ${g}(\bm{w})\geq\|\bm{w}\|_{\infty}$ for all $\bm{w}\in\mathcal{S}^{d-1}$ , we let ${g}(\bm{w})=\text{ReLU}\{m_{\bm{\psi}}(\bm{w})\}+\|\bm{w}\|_{\infty}$ , where ReLU( $\bm{x})=(\max\{x_{1},0\},\max\{x_{2},0\},\dots)^{T}$ for a vector $\bm{x}=(x_{1},x_{2},\dots)^{T}$ of finite (unspecified) length. Note that we share neither parameters nor architectures between the two neural networks determining $r_{\tau}(\bm{w})$ and ${g}(\bm{w})$ .

To transform $g(\cdot)$ to $\tilde{g}(\cdot)$ requires evaluation of all scaling factors $b_{i}^{U}$ and $b_{i}^{L},i=1,\dots,d,$ in (4). We do so numerically using a sample of angles, denoted by $\mathcal{W}$ , that provides a dense coverage of the $(d-1)$ -sphere. In practice, we simulate $|\mathcal{W}|=10^{6}$ points using the rejection sampling algorithm of Neumann, (1951).

We fit or train the neural network $r_{\tau}(\bm{w})$ via minimisation of some loss function, denoted by $\ell(r,r_{\tau})$ . Consider the set $\{(r_{j},\bm{w}_{j})\}_{j=1}^{n}$ , where $r_{j}$ and $\bm{w}_{j}$ are observations of $R$ and $\bm{W}$ , respectively. Optimal estimates of the neural network parameters, denoted $\widehat{\bm{\psi}}$ , can be found by solving the minimisation problem

\widehat{\bm{\psi}}\in\operatorname*{arg\,min}\limits_{\bm{\psi}}\frac{1}{n}% \sum^{n}_{j=1}\ell(r_{j},r_{\tau}(\bm{w}_{j})),

(8)

where we have suppressed the dependency of $r_{\tau}(\cdot)$ on $\bm{\psi}$ . As $r_{\tau}(\bm{w})$ denotes the $\tau$ -quantile of $R\mid(\bm{W}=\bm{w})$ , the most appropriate choice of loss function is the tilted loss, given by $l(r,r_{\tau})=\rho_{\tau}(r-r_{\tau})$ for $\rho_{\tau}(z):=z(\tau-\mathbbm{1}\{z<0\})$ (Koenker et al.,, 2017). We can also define a suitable loss function and minimisation problem to train the deep representation of the rescaled gauge function, $\tilde{g}(\bm{w}).$ This model can be described as a conditional density network estimation (see, e.g., Rothfuss et al.,, 2019), where the loss function is the negative log-likelihood associated with the truncated gamma model defined in (7); note that this is dependent on both the exceedance threshold $r_{\tau}(\bm{w}_{j})$ and a scaling parameter, $\alpha>0$ . To fit the gauge function model, we replace $\ell(\cdot,\cdot)$ in (8) with

	$\displaystyle\ell\{r_{j},\tilde{g}(\bm{w}_{j}),\alpha;r_{\tau}(\bm{w}_{j})\}=-% \mathbbm{1}\{r_{j}>r_{\tau}(\bm{w}_{j})\}$	$\displaystyle\big{[}\alpha\log\{\tilde{g}(\bm{w}_{j})\}+(\alpha-1)\log(r_{j})-% r_{j}\tilde{g}(\bm{w}_{j})$
		$\displaystyle-\log\{\Gamma(\alpha)\}-\log\{Q(\alpha,\tilde{g}(\bm{w}_{j})r_{% \tau}(\bm{w}_{j}))\}\big{]},$		(9)

where $Q(\alpha,z)=\Gamma(\alpha,z)/\Gamma(z)$ for $\Gamma(\alpha,z)=\int^{\infty}_{z}t^{\alpha-1}\exp(-t)\,dt$ , and $\tilde{g}(\cdot)$ is the rescaled gauge function described above. Note that, through an abuse of notation, we have suppressed dependency of $\bm{\psi}$ on $\alpha$ , but this parameter is estimated concurrently with the parameters that comprise the neural network.

Full inference for our framework is performed in a two-stage fashion. We first train a model for $r_{\tau}(\bm{w})$ and derive its estimate, $\hat{r}_{\tau}(\bm{w})$ . This estimate is then using in a subsequent training step for the rescaled gauge function $\tilde{g}(\cdot)$ and $\alpha$ , by replacing $r_{\tau}(\bm{w})$ in (3.3) with its estimated counterpart. For details of the algorithms and regularisation techniques used for training of both models, as well as practical advice for pre-training (Goodfellow et al.,, 2016) of the models using initial estimates for $r_{\tau}(\cdot)$ , see Appendix A.2.

3.4 Estimating the extended ADF

We now adapt the approach of Simpson and Tawn, 2024a , for estimation of the classical ADF, to permit estimation of our extended ADF, $\Lambda(\cdot)$ . First, given a sample of angles $\{\bm{w}_{j}\}_{j=1}^{n}$ , we define the corresponding point estimates for the unit-level set by $\{\tilde{\bm{x}}_{j}\}^{n}_{i=1}$ where ${\tilde{\bm{x}}_{j}:=(\tilde{x}_{j,1},\ldots,\tilde{x}_{j,d})^{T}=\bm{w}_{j}/% \tilde{g}(\bm{w}_{j})\in\widetilde{\partial\mathcal{G}}}$ for $j=1,\ldots,n$ . Then, for any angle $\bm{w}\in\mathcal{S}^{d-1}\setminus\mathcal{A}$ at which we wish to estimate $\Lambda(\bm{w})$ , we consider the sample $\{\tilde{\bm{x}}_{j}\}^{n}_{i=1}$ to be candidates for the intersection of $\widetilde{\partial\mathcal{G}}$ and the scaled-back set $\tilde{\mathcal{R}}_{\bm{w}}$ (see Proposition 2.8). The corresponding scaling coefficient $\tilde{\mathfrak{r}}_{\bm{w}}$ must satisfy $\tilde{\mathfrak{r}}_{\bm{w}}\geq\tilde{\mathfrak{r}}_{j}$ for all $j=1,\dots,n$ , where $\tilde{\mathfrak{r}}_{j}:=||\bm{w}||_{\infty}\min_{i}\{\tilde{x}_{j,i}/w_{i}\}$ ; hence, we approximate $\tilde{\mathfrak{r}}_{\bm{w}}$ as $\tilde{\mathfrak{r}}_{\bm{w}}\approx\max_{j=1,\dots,n}\{\tilde{\mathfrak{r}}_{% j}\}$ . Recalling Proposition 2.8, it follows that an estimate $\hat{\Lambda}(\bm{w})$ of the extended ADF at $\bm{w}$ is $\hat{\Lambda}(\bm{w})=||\bm{w}||_{\infty}/\max_{j}\{\tilde{\mathfrak{r}}_{j}\}$ ; see Simpson and Tawn, 2024a for further details. Furthermore, since the estimated unit-level set $\widetilde{\partial\mathcal{G}}$ satisfies all of the required theoretical properties for $\partial\mathcal{G}$ , the resulting estimate $\hat{\Lambda}(\cdot)$ must also satisfy the properties of the extended ADF.

Estimates $\hat{\Lambda}(\bm{w}),\bm{w}\in\mathcal{S}^{d-1}\setminus\mathcal{A},$ can be used to obtain probability estimates for a wide variety of joint tail regions of $\bm{X}$ . The estimation scheme outlined below avoids the need to sample from, or model, the distribution of the angles $\bm{W}$ , as has been considered in other probability estimation schemes using the geometric representation (Papastathopoulos et al.,, 2024; Wadsworth and Campbell,, 2024). This feature is practically advantageous as estimation of the angular distribution alongside the limit set introduces additional modelling uncertainty into our framework, and many existing techniques for non-parametric density estimation are only applicable in low dimensional settings (Ruzgas et al.,, 2021).

To begin, let $\bm{x}\in\mathbb{R}^{d}$ be such that $r:=||\bm{x}||$ is large and define the corresponding angle $\bm{w}:=\bm{x}/r$ . Further define the univariate structure variable $T_{\bm{w}}:=\min_{i=1,\dots,d}\{X_{i}/w_{i}\}$ , and let $u$ denote a quantile of $T_{\bm{w}}$ satisfying $q:=\Pr(T_{\bm{w}}\leq u)<\Pr(T_{\bm{w}}\leq r)$ , with $q$ close to 1 and $u>0$ large. Equation (6) implies that

	$\displaystyle\Pr(\text{sgn}(x_{i})X_{i}>\text{sgn}(x_{i})x_{i},i=1,\ldots,d)=% \Pr(T_{\bm{w}}>r)$		(10)
	$\displaystyle=\Pr(T_{\bm{w}}>r\mid T_{\bm{w}}>u)\Pr(T_{\bm{w}}>u)\approx\exp\{% -\hat{\Lambda}(\bm{w})(r-u)\}(1-q).$		(11)

Figure 4 illustrates examples of joint tail regions in the two dimensional setting that can be estimated using the framework described in equation (6). Observe that these regions are more general than many inference procedures for multivariate extremes, which just focus on the joint survivor or cumulative distribution function (e.g., Ledford and Tawn,, 1996; Ramos and Ledford,, 2009; Cooley et al.,, 2019).

3.5 Assessing goodness-of-fit for DeepGauge model fits

We now discuss several diagnostic tools for assessing goodness-of-fit under the DeepGauge modelling framework, which we utilise for the case study in Section 5. The first three tools correspond to quantile-quantile (QQ) plots on a unit exponential scale. Comparing quantiles on this scale is common within extreme value analyses, as it allows one to assess how well a given model captures the most extreme observations (see, e.g., Coles,, 2001; Heffernan and Tawn,, 2001). The latter two tools described below are for visualisation of unit-level sets and the extended ADF in low ( $d\leq 3$ ) and high ( $d>3$ ) dimensional settings.

Truncated gamma QQ plot:

We consider the diagnostic proposed by Wadsworth and Campbell, (2024) to assess the validity of the modelling assumption described in equation (7). Here, the fitted truncated gamma model is used to transform all of the threshold-exceeding observations of $R\mid R>r_{\tau}(\bm{w})$ to a unit exponential scale, and we compare the observed and theoretical quantiles via a QQ plot. This provides a global diagnostic for the model fit over the entire angular domain.

Extended ADF diagnostic:

We adapt the diagnostic proposed by Murphy-Barltrop et al., 2024b to assess goodness-of-fit with respect to the estimated extended ADF. Note that equation (6) can be reformulated as follows: given any angle $\bm{w}\in\mathcal{S}^{d-1}\setminus\mathcal{A}$ , we have that ${T^{u}_{\bm{w}}\sim\text{Exp}(\Lambda(\bm{w}))}$ as $u\to\infty$ , for exceedances $T^{u}_{\bm{w}}:=(T_{\bm{w}}-u)\mid(T_{\bm{w}}>u)$ with $T_{\bm{w}}$ defined in Section 3.2. Therefore, for sufficiently large $u$ , the random variable $E:=-\log[1-(1-\exp\{-\Lambda(\bm{w})T^{u}_{\bm{w}}\})]=\Lambda(\bm{w})T^{u}_{% \bm{w}}$ follows a unit exponential distribution, independent of the choice of $\bm{w}$ . Murphy-Barltrop et al., 2024b exploit this fact and compute min-projection exceedances over the angular observations, before transforming these exceedances to the standard exponential scale. One can then evaluate the alignment of the exponential quantiles on a QQ plot. This results in a global diagnostic and allows us to test the validity of the modelling assumption described in equation (6). The corresponding algorithm for computing the diagnostic is given in Appendix B.9.

Return level sets and probabilities:

Using our fitted model, we can obtain estimates of return level sets, which provide a summary of the joint extremal behaviour of random vectors and are used extensively in practice for design sensitivity analysis (see, e.g., Mackay and Haselsteiner,, 2021; Papastathopoulos et al.,, 2024; Simpson and Tawn, 2024b, ). For a probability $p$ , a return level set is defined as any region $\mathcal{B}\subset\mathbb{R}^{d}$ satisfying $\Pr(\bm{X}\in\mathcal{B})=p$ . When $\mathcal{B}$ is centred at the origin $\bm{0}_{d}$ , it can be obtained directly from the modelling framework outlined in Section 3.2; for each angle $\bm{w}\in\mathcal{S}^{d-1}$ , set

r_{p}(\bm{w})=F^{-1}_{\Gamma}\left[\left(\frac{p-\tau}{1-\tau}\right)\times% \bar{F}_{\Gamma}\left(r_{\tau}\left(\bm{w}\right);\alpha,g\left(\bm{w}\right)% \right)+F_{\Gamma}\left(r_{\tau}\left(\bm{w}\right);\alpha,g\left(\bm{w}\right% )\right);\alpha,g\left(\bm{w}\right)\right],

where $F_{\Gamma}(\cdot)$ denotes the gamma distribution function. This corresponds to the estimated $p$ -quantile of the conditional distribution $R\mid(\bm{W}=\bm{w})$ under model (7). Using these radial quantiles, define the Cartesian set

\mathcal{B}_{p}:=\left\{\bm{x}\in\mathbb{R}^{d}\setminus\{\bm{0}_{d}\}:||\bm{x% }||\leq r_{p}\left(\bm{x}/||\bm{x}||\right)\right\}\bigcup\big{\{}\bm{0}_{d}% \big{\}}.

Assuming unbiased estimation, the total law of probability implies that $\Pr(\bm{X}\in\mathcal{B}_{p})=p$ (Papastathopoulos et al.,, 2024). To assess the accuracy of return level set estimates, we propose the following diagnostic. First, observe that for any observation $\bm{x}_{j}\neq\bm{0}_{d}$ , we have $\bm{x}_{j}\in\mathcal{B}_{p}\iff||\bm{x}_{j}||\leq r_{p}(\bm{x}_{j}/||\bm{x}_{% j}||)$ . Consequently, for a sample of non-zero observations $\{\bm{x}_{j}\}^{n}_{j=1}$ , an empirical estimate $\hat{p}$ of $\Pr(\bm{X}\in\mathcal{B}_{p})$ is $\hat{p}:=(1/n)\sum_{j=1}^{n}\mathbbm{1}(||\bm{x}_{j}||\leq r_{p}(\bm{x}_{j}/||% \bm{x}_{j}||))$ . Uncertainty intervals for $\hat{p}$ can be obtained through bootstrap** the data sample. Our diagnostic is then to plot the pairs $(-\log(1-p),-\log(1-\hat{p}))$ , alongside tolerance bounds, for a subset of increasing probabilities close to 1. This approach for assessing goodness-of-fit provides a multivariate extension of ‘return level plots’; see, e.g., Coles, (2001).

Three-dimensional unit-level and extended ADF sets:

For $d\leq 3$ , we can plot the scaled sample clouds $\{\bm{x}_{j}/\log(n/2):j=1,\ldots,n\}$ against the estimated unit-level and extended ADF sets; see, e.g., Figure 2. For large enough $n$ , we would expect the scaled sample clouds to lie approximately within the interiors of the estimated unit-level and extended ADF sets. Plotting scaled observations against the estimated unit-level sets has also been used for validation by, e.g., Majumder et al., 2024b ; Papastathopoulos et al., (2024). In addition to being a visual indicator of how well the estimated shapes capture the complex extremal dependence features of data, one can also use these plots to verify that the estimated unit-level sets satisfy all of the theoretical properties discussed in Section 1.

Bivariate unit-level set slices:

For $d\geq 4$ , we cannot visualise unit-level sets. Furthermore, interpretation is challenging for $d=3$ unless one can freely alter the perspective angle, e.g., using computational software. To account for this shortcoming, we propose considering bivariate slices of the estimated unit-level sets. Specifically, given indices $(\mathfrak{i},\mathfrak{j}),$ with $1\leq\mathfrak{i}<\mathfrak{j}\leq d$ , consider the set of points

\widetilde{\partial\mathcal{G}}_{\mathfrak{i},\mathfrak{j}}:=\left\{(w_{% \mathfrak{i}},w_{\mathfrak{j}})/\tilde{g}(\bm{w}):\bm{w}\in\mathcal{S}^{d-1},w% _{k}=0,k\in\mathcal{D}\setminus\{\mathfrak{i},\mathfrak{j}\}\right\}\subset[-1% ,1]^{2}.

It is important to note that $\widetilde{\partial\mathcal{G}}_{\mathfrak{i},\mathfrak{j}}$ is not the bivariate unit-level set for the vector $(X_{\mathfrak{i}},X_{\mathfrak{j}})$ ; rather, $\widetilde{\partial\mathcal{G}}_{\mathfrak{i},\mathfrak{j}}$ is a bivariate projection from the subset of $\widetilde{\partial\mathcal{G}}$ for which all angles indexed by $\mathcal{D}\setminus\{\mathfrak{i},\mathfrak{j}\}$ are equal to 0. This projection is illustrated in Figure 5 for a three-dimensional unit-level set with $(\mathfrak{i},\mathfrak{j})=(1,2)$ .

In practice, we observe very few angles in the region $\{\bm{w}\in\mathcal{S}^{d-1}:w_{k}=0,k\in\mathcal{D}\setminus\{\mathfrak{i},% \mathfrak{j}\}\}$ ; however, we will typically observe a significant number of angular observations that lie close to this region. Moreover, note that for any observation $\bm{x}$ with corresponding angular observation $\bm{w}$ , we have $w_{k}=0\iff x_{k}/\log(n/2)=0$ for any $k\in\mathcal{D}$ . Therefore, we plot, alongside $\widetilde{\partial\mathcal{G}}_{\mathfrak{i},\mathfrak{j}}$ , all observations of the the scaled bivariate sample clouds for which $||\bm{x}_{-\{\mathfrak{i},\mathfrak{j}\}}||/\log(n/2)\leq\epsilon$ , where $\bm{x}_{-\{\mathfrak{i},\mathfrak{j}\}}$ denotes the observation with its $\mathfrak{i}$ -th and $\mathfrak{j}$ -th components removed and $\epsilon>0$ denotes some small value. For sufficiently small $\epsilon$ and large enough $n$ , one would expect the scaled bivariate sample cloud to lie approximately within the interior of the estimated bivariate slice. Selection of $\epsilon$ is considered in Section 5.

4 Simulation study

4.1 Overview

We conduct two simulation studies to investigate the efficacy of the DeepGauge framework for estimating the gauge function of three known copulas: Gaussian, Student-t, and logistic. Details of the copulas and their theoretical gauge functions are provided in Section 4.2.

Two studies are performed and their results presented in Section 4.3. Efficacy in both studies is quantified using the validation diagnostics described in Section 4.2. The first study considers, for a fixed quantile level $\tau$ and architecture, the effect of varying $d$ and $n$ on the accuracy of DeepGauge estimates of the gauge functions and exceedance probabilities; the second study considers, for a fixed dimension $d$ and sample size $n$ , the effect of hyper-parameter choice. In both studies, we perform $100$ experiments for every model specification. That is, for a single choice of $n$ , $d$ , copula, and hyper-parameter configuration, we simulate 100 data sets and apply the methodology, separately, to each. Performance metrics are then presented as the median and $95\%$ confidence intervals over all experiments. All models for both $r_{\tau}(\cdot)$ and $\tilde{g}(\cdot)$ are trained over 500 iterations with a mini-batch size of 1024, and the ADF function is estimated with $u$ in (11) taken to be the $q=0.9995$ empirical quantile of the structure variable $T_{\bm{w}}$ . Early-stop** is used with a patience of $\Delta=5$ ; see Appendix A.2 for details.

4.2 Models and performance metrics

We consider three copulas with known gauge functions and limit sets: Gaussian, Student-t, and logistic. Their theoretical unit-level sets and gauge functions are provided by Papastathopoulos et al., (2024). Recall that we consider a random vector $\bm{X}$ with standard Laplace margins. If $\bm{X}$ also has a $d$ -variate Gaussian copula with positive-definite precision matrix $Q\in\mathbb{R}^{d\times d}$ , its gauge function is $g(\bm{x})=\left\{\text{sgn}(\bm{x})|\bm{x}|^{1/2}\right\}^{T}Q\left\{\text{sgn% }(\bm{x})|\bm{x}|^{-1/2}\right\}$ where sgn is applied component-wise. If $\bm{X}$ follows a $d$ -variate Student-t copula with $Q$ as before and degrees of freedom $\nu>0$ , its gauge function is $g(\bm{x})=-\nu^{-1}\sum^{d}_{i=1}|x_{i}|+(1+d\nu^{-1})\max_{i=1,\dots,d}|x_{i}|$ . Finally, let $\bm{X}$ follow a logistic copula, i.e., a Gumbel copula with dependence parameter $1/\theta$ for $\theta\in(0,1]$ . In this case, the form of the theoretical gauge function is cumbersome; see Appendix A.4 of Papastathopoulos et al., (2024). For the Gaussian and Student-t copulas, we randomly generate, for each $d$ , a valid precision matrix. However, we impose that the corresponding correlation matrices are ordered with respect to $d$ . That is, if we consider two copulas with dimension $d_{1}$ and $d_{2}$ , satisfying $d_{1}<d_{2}$ , then the $d_{1}\times d_{1}$ correlation matrix is a submatrix of the $d_{2}\times d_{2}$ correlation matrix. Note that the generated correlation matrices are kept fixed throughout all studies, and we further set $\nu=1$ and $\theta=0.3$

We simulate from these copulas and apply the methodology described in Section 3 to estimate the gauge function on $\mathcal{S}^{d-1}$ . To quantify the accuracy of the estimated unit-level set, we measure the integrated squared error (ISE) of the form

\mbox{ISE}=\int_{\mathcal{S}^{d-1}}\{1/g_{0}(\bm{w})-1/\tilde{g}(\bm{w})\}^{2}% \mathrm{d}\bm{w},

(12)

where $g_{0}$ and $\tilde{g}$ are, respectively, the theoretical and estimated (rescaled) gauge functions. We evaluate the integral in (12) using Monte-Carlo methods; specifically, we use the estimator $\widehat{\mbox{ISE}}=(A_{d}/|\mathcal{W}|)\sum_{\bm{w}\in\mathcal{W}}\{1/g_{0}% (\bm{w})-1/\tilde{g}(\bm{w})\}^{2}$ , where $A_{d}=2\pi^{d/2}/\Gamma(d/2)$ is the surface area of $\mathcal{S}^{d-1}$ , $\mathcal{W}$ denotes the set of points described in Section 3.3, and $\Gamma(\cdot)$ is the gamma function.

We validate estimates of the extended angular dependence function (ADF) by evaluating, for each considered copula, four joint probabilities. The first two are exceedance probabilities of the form $\Pr\{X_{i}>u,i=1,\dots,d\},$ with $u$ taken to be both the $0.99$ and $0.999$ standard Laplace quantiles. For the joint lower tail, we also evaluate $\Pr\{X_{i}<u,i=1,\dots,d\},$ with $u$ taken as the $0.01$ and $0.001$ standard Laplace quantiles. One of the strengths of the DeepGauge framework is that it is not limited to estimation of probabilities on hyper-cubes. To show this, we also estimate

\Pr\{u_{1,i}<X_{i}<u_{2,i},i=1,\dots,d\},

where $u_{1,i}=u_{1},u_{2,i}=\infty$ if $i$ is odd and, otherwise, $u_{1,i}=-\infty,u_{2,i}=u_{2}$ , where $u_{1}$ corresponds to the $0.999$ standard Laplace quantile. For $u_{2}$ , we consider two cases: $u_{2}$ as either the $0.2$ or $0.4$ standard Laplace quantile. We consider estimation of the two latter probabilities for only the Gaussian and Student-t copulas, as evaluation is permissible using standard computational software. This is not the case for the logistic copula, and we found that Monte-Carlo methods perform poorly. To quantify efficacy, we provide the mean absolute log error (MALE), taking the average over all probabilities. The order-of-magnitude for the probabilities ranges from approximately $10^{-3}$ to $10^{-37}$ .

4.3 Results

We first consider a setting with the quantile level fixed to $\tau=0.75$ and the architecture for $g(\cdot)$ (equivalently, $\tilde{g}$ ) taken to be to a neural network with $N=3$ hidden layers of width 64 (see (A.1) for details), but allow the sample size $n$ and dimension $d$ to vary: we consider $d=3$ , $d=5$ , and $d=8$ , as well as $n=10,000$ , $n=50,000$ , $n=100,000$ , and $n=250,000$ . For $r_{\tau}(\cdot)$ , we use a neural network with $N=3$ hidden layers of width $32$ and keep this fixed throughout. Results for this setting are provided in Figure A2 of Appendix C. As expected, we generally observe decreasing ISE and MALE with increasing sample size $n$ across all dimensions and copulas.

We then consider a setting with the sample size $n$ and dimension $d$ fixed to investigate how the choice of hyper-parameters impact the model fits. We fix $n=100,000$ and $d=5$ or $d=8$ , but, across two scenarios, we allow the quantile level and architecture of $g(\cdot)$ to vary. For the first scenario, we consider a sequence of quantile levels $\tau\in\{0.1,0.3,\dots,0.9\}$ and take $g(\cdot)$ to be a neural network with $N=3$ hidden layers and with $32$ nodes per layer. For the second scenario, we fix $\tau=0.75$ and vary the architecture for $g(\cdot)$ . We consider eight neural networks with $N=1$ to $N=4$ hidden layers, and with consistent width across layers; we take this to be either 16 or 64. Results for the first and second scenarios are presented in Figure 6 and Table A1, respectively.

Figure 6 illustrates the simulation results for varying quantile level, $\tau$ . We find that the optimal value of $\tau$ (in terms of minimising the performance metrics) differs across the copulas and dimension. For the Gaussian and Student-t copulas, we find that higher values of $\tau$ are preferable when estimating the unit-level set, as the ISE is reduced; the converse holds for the logistic copula. When considering the MALE, we find that the Gaussian and Student-t copulas tend to favour a lower value of $\tau$ , particularly for the high dimensional case where $d=8$ . Conversely, for the logistic copula, the MALE is minimised with the largest value of $\tau$ . We note that the logistic copula has the quickest rate of convergence to the truncated gamma model described in (7), which may explain why this copula benefits from a smaller $\tau$ (i.e., more data is available for inference).

Table A1 in Appendix C presents the simulation results for a fixed quantile level ${\tau=0.75}$ , but with varying architecture for $g(\cdot)$ . We find that, for the Gaussian and Student-t copulas, the ISE is minimised when using the more complicated architecture, e.g., 3–4 layers of width 64; the converse holds for the logistic copula. Note that the logistic copula is determined by a single parameter, and so its corresponding gauge function has a much simpler shape that the other copulas (see Figure 2 for the case when $d=2$ ). Hence, it is unsurprising that the logistic copulas favours a simpler neural network when considering the ISE. For the MALE, the results are less clear: all three copulas generally favour an architecture with fewer parameters (and, hence, a larger effective sample size), except the Gaussian copula with $d=5$ . In applications, the shape of the gauge function that arises from the data generating process is likely to be quite complex. Figure 6 and Table A1 suggest that, if the goal is to accurately estimate the unit-level set $\partial\mathcal{G}$ , then one should consider using a more complicated neural network architecture for $g(\cdot)$ . However, a more conservative approach should be taken if we wish to accurately estimate tail probabilities; reducing the quantile level $\tau$ and using a simpler neural network architecture will increase the effective sample size used in inference for model (7), and this improve estimation of tail probabilities.

Generally, we find some variation in model fits across different choices of architecture and quantile level $\tau$ , but the results are broadly similar. From this study, we can conclude that there is no one-size-fits-all approach to fully optimising hyper-parameters of the the DeepGauge framework. In practice, where the true gauge function and underlying data generating process are unknown, we advocate that practitioners fit a collection of DeepGauge models with various quantile levels and architecture, and choose the best model fit using the goodness-of-fit metrics detailed in Section 3.5.

5 Analysis of the NORA10 Metocean Data

5.1 Overview

We demonstrate the efficacy of the DeepGauge framework by modelling extremal dependence of metocean variables associated with severe ocean events. Such events typically occur when multiple metocean variables are simultaneously extreme, and pose a risk to offshore and coastal structures, such as wind farms and oil platforms (Shooter et al.,, 2021, 2022). Robust analyses of joint extremes is therefore crucial in this setting for informed decision making.

Our study uses the NORA10 hindcast gridded data set (NOrwegian ReAnalysis 10km, Reistad et al.,, 2011), which provides three-hourly wave fields over the Norwegian Sea, the North Sea, and the Barents Sea, at a spatial resolution of 10km. We focus on an area between the British Isles and Iceland for the period between September 1957 and December 2009 inclusive, which amounts to $n=152,917$ observations. These data have also previously been analysed by Shooter et al., (2022). We run two analyses to illustrate the efficacy of the DeepGauge framework for dimensions $d=3$ and $d=5$ . For both analyses, hyper-parameters (e.g., model architecture and quantile level $\tau$ ) were optimised with respect to the goodness-of-fit diagnostics described in Section 3.5. As our interest is in modelling extremal dependence and not marginal extremes, the data are apriori transformed to standard Laplace margins using a standard rank transform, applied independently at each spatial location and for each variable. All study locations are plotted in Figure 7.

First, we consider the joint behaviour of wind speed (ws; m/s), significant wave height (hs; m), and mean sea level pressure (mslp; hPa) separately at three locations (01, 46, and 85, from south to north), outlined in blue in Figure 7. These three variables are associated with severe ocean events (Ewans and Jonathan,, 2014), and we aim to model their joint tail behaviour separately at each of the three locations. Of the $d=3$ variables, ws and hs exhibit positive dependence, while the other two pairs of variables exhibit negative dependence; see Figure 9. Given the complex dependence structure of these data, it is likely that standard parametric extremal dependence models will perform poorly here; hence, these data provide the perfect candidate for illustrating the flexibility of our DeepGauge model.

Our second analysis models the joint tail behaviour of significant wave height (hs) across five locations (16, 33, 58, 72, and 92; highlighted in black, Figure 7) along a transect that runs approximately south-west to north-east, representing storm fronts that move along that direction. Simultaneous extremes at multiple locations pose higher risks than extremes at individual locations, and so quantification of joint extreme risk across these $d=5$ locations represents an important area for investigation.

For the $d=3$ and $d=5$ cases, we use $\tau=0.9$ and $\tau=0.4$ , respectively. The neural networks for $r_{\tau}(\cdot)$ and $\tilde{g}(\cdot)$ in the $d=3$ setting have identical architectures: three hidden layers, each with 64 neurons. For $d=5$ , both neural networks have two hidden layers; the neural networks for $r_{\tau}(\cdot)$ and $\tilde{g}(\cdot)$ have all widths 16 and 64, respectively.

5.2 Results

5.2.1 Joint distribution of hs, ws, and mslp: $d=3$

Here we present results for the three metocean variables observed at location 01; results for the remaining two locations are presented in Appendix C. Figure 8 plots the estimates of the three-dimensional unit-level set and the extended ADF for the three variables, with the sample clouds transformed to standard Laplace margins. We can see that the estimated unit-level sets capture well the shape and joint tail behaviour of the sample cloud. We also plot bivariate unit-level set slices for each of the three pairs of variables in Figure 9. For these, we set $\epsilon=0.01$ (see Section 3.5), resulting in a reasonable amount of validation data for each unit-level set slice. We observe that the estimated slices closely match the shapes of the corresponding bivariate sample clouds, again suggesting good model fits to these data. Corresponding plots for the remaining two locations are provided in Figures A6–A9.

Goodness-of-fit is further verified using the truncated gamma QQ plot, extended ADF diagnostic, and return level set probabilities, as described in Section 3.5. The plots, provided in Figure A3 of Appendix C, illustrate strong agreement between model and empirical quantiles across all three diagnostics. The goodness-of-fit analyses for the remaining two locations also indicate good model fits over all diagnostics (Figures A4–A5 of Appendix C). For all three locations, we also provide animated three dimensional plots for the unit-level sets and extended ADFs in the Supplementary Material.

5.2.2 Joint distribution of hs along transect: $d=5$

For the $d=5$ joint distribution of significant wave height hs, we cannot visualise the unit-level set estimate. However, there are 10 possible bivariate unit-level slices corresponding to different pairs of locations; three of these are illustrated in Figure 10. In this case, we set $\epsilon=0.05$ to ensure a reasonable amount of validation data in each slice, accounting for the fact angular observations are more sparse in higher dimensions (relative to $d=3$ ). As previously, we observed that the estimated slices closely match the shapes of the corresponding bivariate sample cloud, indicating that the DeepGauge model can capture extremal dependence of hs across all five locations. Goodness-of-fit diagnostics, as illustrated in Figure A10 of Appendix C, show generally good agreement between model and empirical quantiles, albeit with some small divergences at extremely high quantiles (for the truncated gamma QQ plot).

The DeepGauge framework provides fantastic model fits for our metocean data, in both $d=3$ and $d=5$ dimensions. The estimated unit-level sets facilitate inference of the entire extremal dependence structure of the data, not only dependence in a single orthant of $\mathbb{R}^{d}$ , without the need for restrictive parametric models or low-order summary measures (see Section 1). Whilst we have omitted this part of the study for brevity, we note that estimates of our novel extended ADF permit estimation of joint tail probabilities on arbitrary hyper-cubes; our accompanying diagnostics for the ADF estimates suggest high accuracy of such estimates. Finally, we remark that our favourable results for the case where $d=5$ are particularly encouraging. Flexible models for extremal dependence are typically limited to $d\leq 3$ , but we have shown that this limitation does not hold for the DeepGauge framework.

6 Discussion

We introduce novel theoretical contributions for the geometric representation of multivariate extremes, which include extensions of the Wadsworth and Tawn, (2013) model and a demonstration that many of the results introduced by Nolde and Wadsworth, (2022) can be extended to data on Laplace margins. We further introduce the DeepGauge modelling framework and, through rigorous theoretical treatment and in contrast to all existing models, we prove that our framework gives estimates that satisfy all of the required theoretical properties of unit-level sets. In practice, this results in consistent inference of extremal dependence and provides a means to estimate joint tail properties; we demonstrate this by evaluating the performance of the DeepGauge framework on simulated and observed data in Sections 4 and 5, respectively. Unlike the majority of existing models for multivariate extremes, the DeepGauge framework is not limited to asymptotically dependent data nor low dimension $d\leq 3$ . Thus, our approach represents a significant step towards flexible, non-parametric, and scalable models for multivariate extremes.

We acknowledge that there is no theoretical guarantee that our proposed estimator for $\partial\mathcal{G}$ will converge to the true unit-level set. One could, for instance, try to prove consistency of the rescaled gauge function estimator, $\tilde{g}(\cdot)$ , for all angles in the set $\mathcal{S}^{d-1}$ . However, theoretical results of this nature usually necessitate strict assumptions, which themselves can be difficult to verify. In practice, however, one can only ever look at diagnostics obtained from the data; we have therefore opted for a more practical treatment of our proposed estimator, and leave proofs of theoretical convergence for future work.

One noticeable omission from the DeepGauge framework is a model for the angular density $f_{\bm{W}}(\bm{w}),\bm{w}\in\mathcal{S}^{d-1}$ ; see Section 3.4 for further discussion. Several existing approaches propose non-parametric techniques for this estimation, albeit in low dimensional ( $d\leq 3$ ) settings (Papastathopoulos et al.,, 2024; Murphy-Barltrop et al., 2024a, ; Simpson and Tawn, 2024b, ). Models for the angular density offer the advantage of simulation from the model described in equation (7), from which one can obtain probability estimates in joint tail regions of any form, i.e., more general regions than those illustrated in Figure 4; see Wadsworth and Campbell, (2024) for instance. The estimation of, and simulation from, the angular density function $f_{\bm{W}}(\bm{w})$ therefore represents an important area for future work in the context of multivariate extremes.

Finally, given an estimate of $\partial\mathcal{G}$ for some $d\geq 3$ , we remark that our modelling framework provides no means of obtaining lower dimensional unit-level sets, e.g., the unit-level set of $(X_{\mathfrak{i}},X_{\mathfrak{j}})$ for indices $(\mathfrak{i},\mathfrak{j}),$ with $1\leq\mathfrak{i}<\mathfrak{j}\leq d$ . This is because we only estimate the gauge function $g(\cdot)$ over $\mathcal{S}^{d-1}$ , and to obtain a lower-dimensional gauge function, such as $g_{(X_{\mathfrak{i}},X_{\mathfrak{j}})}(\cdot)$ , would require us to minimise $g(\cdot)$ over all components in $\mathcal{D}\setminus\{\mathfrak{i},\mathfrak{j}\}$ (Nolde and Wadsworth,, 2022); this is only possible if we can evaluate $g(\cdot)$ on the whole of $\mathbb{R}^{d}$ . Future work could investigate techniques for extracting lower dimensional unit-level sets from higher dimensional estimates, which would avoid the need for refitting.

Acknowledgements

Reetam Majumder was supported by grants from the United States Geological Survey’s National Climate Adaptation Science Center (G24AC00197), and the National Science Foundation (DMS2152887). The authors would like to thank Phil Jonathan for providing the data, Ryan Campbell for access to code, and Lambert de Monte for helpful discussions.

Appendix

Appendix A Neural network construction and inference

A.1 Neural network representation

We here describe the construction of $r_{\tau}(\cdot)$ and $g(\cdot)$ used multi-layer perceptrons. Let $r_{\tau}(\bm{w}):=\exp\{m_{\bm{\psi}}(\bm{w})\},$ where $m_{\bm{\psi}}(\bm{w})$ is a neural network, parameterised by $\bm{\psi}$ , that comprises $N\in\mathbb{N}$ hidden layers. Then $m_{\bm{\psi}}$ is a composition of the form ${m_{\bm{\psi}}(\cdot)=m_{\bm{\psi}}^{(N+1)}\circ\dots\circ\mathbf{m}^{(1)}_{% \bm{\psi}}(\cdot)}$ , where, for $j=1,\dots,N$ , the output from the $j$ -th hidden layer, $\mathbf{m}^{(j)}_{\bm{\psi}}:\mathbb{R}^{n_{j-1}}\mapsto\mathbb{R}^{n_{j}}$ , can be written as

\bm{w}^{(j)}:=\mathbf{m}^{(j)}(\bm{w}^{(j-1)})=\text{ReLU}\left(\mathbf{M}^{(j% )}\bm{w}^{(j-1)}+\mathbf{b}^{(j)}\right),

(A.1)

where $\bm{w}^{(0)}:=\bm{w}$ is the input angles with dimension $n_{0}=d$ and the final layer is $m^{(N+1)}(\bm{w}^{(N)})=\mathbf{M}^{(N+1)}\bm{w}^{(N)}+\mathbf{b}^{(N)}$ . The ReLU function evaluates the component-wise maxima of its input and a vector of zeroes (of suitable length), i.e., ReLU( $\bm{x})=(\max\{x_{1},0\},\max\{x_{2},0\},\dots)^{T}$ , where $\bm{x}=(x_{1},x_{2},\dots)^{T}$ is a vector of finite (unspecified) length. Each layer $j$ , for $j=1,\dots,N+1,$ is parameterised by estimable weights matrices and vectors ${\mathbf{M}^{(j)}\in\mathbb{R}^{n_{j}\times n_{j-1}}}$ and $\mathbf{b}^{(j)}\in\mathbb{R}^{n_{j}}$ , respectively. Thus, the neural network is fully-parameterised by the collection of parameters $\bm{\psi}=\{\{\mathbf{M}^{(j)},\mathbf{b}^{(j)}\}_{j=1}^{N+1}\}$ . The dimension $n_{j}$ of the output of layer $\mathbf{m}^{(j)}$ , as well as the “depth” $N$ of $m_{\bm{\psi}}(\cdot)$ , are referred to as the neural network’s architecture and they are tunable hyper-parameters.

A.2 Training and pre-training

The minimisation in (8) is performed using the Adaptive Moment Estimation (Adam) algorithm (Kingma and Ba,, 2014), which is a variant of mini-batch stochastic gradient descent¹¹1We use the default settings of the hyper-parameters for Adam.. We utilise the R interface to keras (Allaire and Chollet,, 2021). Prior to training, we uniformly-at-random partition the data into $80\%$ training and $20\%$ validation, with this partition consistent across the training stage for both $r_{\tau}(\bm{w})$ and $\tilde{g}(\bm{w})$ . To mitigate overfitting of the neural network model, we train with checkpoints and early-stop** (Prechelt,, 2002). Neural network models are optimised for a finite number of pre-specified iterations, say $M$ , by minimising the loss function evaluated on the training data. The validation data are not used to optimise the neural network parameters. Instead, at every iteration of the training algorithm, we evaluate the loss function on the validation data. Then, the “best fitting” model is determined to be that which which minimises the loss function, evaluated on the validation data, across all epochs. Early-stop** ensures that the training scheme does not necessarily run for all $M$ iterations; training stops early if the validation loss has not decreased in the last $\Delta\in\mathbb{N}$ iterations. We also add $L_{1}$ and $L_{2}$ penalties to the estimable neural network parameters, $\psi$ , to provide further model regularisation; throughout we adopt these penalties with equal shrinkage weight of $1\times 10^{-4}$ . For further details on fitting and regularising deep neural network models, see Goodfellow et al., (2016).

Typically, without prior knowledge of the function that one seeks to approximate, the estimable parameter set, $\bm{\psi},$ of a neural network model is randomly initialised prior to training; thus when training our neural network model for the exceedance threshold $r_{\tau}(\cdot)$ we always choose to take this approach, as we have no prior knowledge of its structure. However, for training of the neural network that comprises the gauge function $g(\cdot)$ , we can exploit some of the theoretical structure of gauge functions to perform pre-training, which is a popular technique used in applications of neural networks to reduce computation times. Pre-training use parameter estimates of a neural network trained for one task as initial parameter estimates for a neural network model designed for a different task (with equivalent architecture; see Goodfellow et al., (2016), Ch. 8). Following Corollary 2.2, we can use an estimate of the radial exceedance threshold, say $\hat{r}_{\tau}(\bm{w}),$ to obtain an initial estimate for the gauge function, say $\hat{g}_{\tau}(\bm{w})$ ; this is achieved by replacing the radial function $h(\cdot)$ in Corollary 2.2 with $\hat{r}_{\tau}(\cdot)$ and deriving the corresponding rescaled gauge function. In the two dimensional setting, Wadsworth and Campbell, (2024) use a similar idea to get an initial estimate for the unit-level set $\partial\mathcal{G}$ . The authors also show that, for $\tau$ close to 1, the rescaled quantile set is a valid approximation of $\partial\mathcal{G}$ . We use this idea to pre-train the neural network of the gauge function model $\tilde{g}(\cdot)$ ; we perform an initial optimisation of its weights subject to minimisation of the squared-error loss between the initial gauge function $\hat{g}_{\tau}(\bm{w})$ and the model output, $\tilde{g}(\bm{w})$ . For $\alpha$ , we always set its initial estimate to $d$ , as this is the theoretical value attained by gauge functions for many popular copula models (Wadsworth and Campbell,, 2024). In unreported experiments, we found that this pre-training helps to mitigate numerical instability during training and improves accuracy of gauge function estimation.

With the initial pre-trained weights, we estimate the rescaled gauge function, from which we obtain estimates of the unit-level set $\partial\mathcal{G}$ . Selection of tuning parameters is discussed in Section 4 of the main text.

Appendix B Additional proofs

B.1 Proof of Proposition 2.1

Proof.

First, observe that for any $X_{E}\sim\text{Exp}(1)$ , we have that the variable given by

X:=\begin{cases}\log(1-e^{-X_{E}})+\log(2),\hskip 10.00002ptX_{E}\leq\log(2),% \\ X_{E}-\log(2),\hskip 55.70015ptX_{E}>\log(2),\end{cases}

is standard Laplace distributed, where the $\log(2)$ here denotes the median of $X_{E}$ . Given $\bm{w}\in\mathcal{S}^{d-1}_{+}$ we have, by definition, $g_{E}(\bm{w})=\lim_{t\to\infty}(-\log f_{\bm{X}_{E}}(t\bm{w})/t)$ , where $f_{\bm{X}_{E}}(\cdot)$ denotes the continuous joint density function for $\bm{X}_{E}$ . For any large $t$ satisfying $t>\max_{i=1,\dots,d}\{\log(2)/w_{i}\}>0$ , we have that

\displaystyle\frac{-\log f_{\bm{X}_{E}}(t\bm{w})}{t}=\frac{-\log f_{\bm{X}}(t% \bm{w}-{\log 2})}{t},

where $f_{\bm{X}}(\cdot)$ is the continuous joint density function for $\bm{X}$ and the Jacobian of the transformation $\bm{X}\mapsto\bm{X}_{E}$ equals $1$ for the considered coordinates. For a fixed $\bm{w}$ , set $t^{*}:=||t\bm{w}-{\log 2}||=t||\bm{w}-{\log 2}/t||$ and $\bm{w}^{*}:=(t\bm{w}-{\log 2})/t^{*}=(\bm{w}-{\log 2}/t)/||\bm{w}-{\log 2}/t||$ . We then observe that $t^{*}\sim t$ and $\bm{w}^{*}\sim\bm{w}$ as $t\to\infty$ . Therefore, it follows that

\frac{-\log f_{\bm{X}}(t\bm{w}-{\log 2})}{t}=\frac{-\log f_{\bm{X}}(t^{*}\bm{w% }^{*})}{t}\sim\frac{-\log f_{\bm{X}}(t\bm{w})}{t}.

Taking the limit as $t\to\infty$ , the result follows. ∎

B.2 Proof of Proposition 2.3

Proof.

To prove the star-shaped property, it suffices to show that for any $\bm{x}\in\mathcal{H}$ , we have $t\bm{x}\in\mathcal{H}$ for all $t\in[0,1]$ . This is trivial for $\bm{x}=\bm{0}_{d}$ . Taking $\bm{x}\in\mathcal{H}\setminus\{\bm{0}_{d}\}$ , we see that

||t\bm{x}||=t||\bm{x}||\leq th(\bm{x}/||\bm{x}||)\leq h(\bm{x}/||\bm{x}||)=h(t% \bm{x}/||t\bm{x}||),

and thus $t\bm{x}\in{G}$ . Moreover, given $\bm{x}\in\mathcal{H}\setminus\{\bm{0}_{d}\}$ , we have

||\bm{x}||\leq h(\bm{x}/||\bm{x}||)\leq\frac{1}{\big{\|}\bm{x}/||\bm{x}||\big{% \|}_{\infty}}=\frac{||\bm{x}||}{||\bm{x}||_{\infty}},

so $||\bm{x}||_{\infty}\leq 1$ , implying $-1\leq x_{i}\leq 1$ for all $i=1,\ldots,d$ ; thus, $\mathcal{H}\subset[-1,1]^{d}$ . This also implies that $\mathcal{H}$ is bounded; thus, to prove $\mathcal{H}$ is compact, we just need to show that $\mathcal{H}$ is closed, owing to the Heine–Borel theorem (see, e.g., Hayes,, 1956). Let $(\bm{x}_{n})_{n\in\mathbb{N}}$ denote a sequence in $\mathcal{H}$ such that $\bm{x}_{n}\to\bm{x}$ as $n\to\infty$ , i.e., $||\bm{x}_{n}-\bm{x}||\to 0$ . We must show that $\bm{x}\in\mathcal{H}$ . If $\bm{x}=\bm{0}_{d}$ , the proof is trivial. Thus, we assume $\bm{x}\neq\bm{0}_{d}$ . Given $\epsilon>0$ , there exists $N_{1}\in\mathbb{N}$ such that $||\bm{x}_{n}-\bm{x}||<\epsilon$ for all $n\geq N_{1}$ . Furthermore, since ${h}(\cdot)$ is continuous, there must exist $N_{2}\in\mathbb{N}$ such that $|{h}(\bm{x}_{n}/||\bm{x}_{n}||)-{h}(\bm{x}/||\bm{x}||)|<\epsilon$ for all $n\geq N_{2}$ . Setting $N:=\max\{N_{1},N_{2}\}$ , we have that

\displaystyle||\bm{x}||

\displaystyle=||\bm{x}-\bm{x}_{n}+\bm{x}_{n}||\leq||\bm{x}-\bm{x}_{n}||+||\bm{% x}_{n}||<\epsilon+{h}(\bm{x}_{n}/||\bm{x}_{n}||)<2\epsilon+{h}(\bm{x}/||\bm{x}% ||).

Taking the limit as $\epsilon\to 0$ , we have $\bm{x}\in\mathcal{H}$ ; thus, $\mathcal{H}$ is compact. ∎

B.3 Proof of Proposition 2.4

To prove Proposition 2.4, we require Lemma 2.1, which we first prove below.

Proof of Lemma 2.1.

To prove bijectivity, it suffices to show that $\kappa$ is injective and surjective.

Injectivity: suppose $\kappa(\bm{w})=\kappa(\bm{w}^{*})$ for $\bm{w},\bm{w}^{*}\in\mathcal{S}^{d-1}$ , and define the constant

c:=\frac{\Big{\lVert}\left(\frac{w_{1}}{b_{1}(w_{1})},\ldots,\frac{w_{d}}{b_{d% }(w_{d})}\right)\Big{\rVert}}{\Big{\lVert}\left(\frac{w^{*}_{1}}{b_{1}(w^{*}_{% 1})},\ldots,\frac{w^{*}_{d}}{b_{d}(w^{*}_{d})}\right)\Big{\rVert}}\in\mathbb{R% }_{+}.

If $\kappa(\bm{w})=\kappa(\bm{w}^{*})$ , then for each $i=1,\ldots,d$ , we have $w_{i}=cw_{i}^{*}$ . Furthermore, since $\bm{w},\bm{w}^{*}\in\mathcal{S}^{d-1}$ , we see that $c=c||\bm{w}^{*}||=||c\bm{w}^{*}||=||\bm{w}||=1$ ; thus, $\bm{w}=\bm{w}^{*}$ .

Surjectivity: let $\bm{w}\in\mathcal{S}^{d-1}$ and without loss of generality, assume that $w_{d}\neq 0$ . Setting

\kappa^{-1}(\bm{w}):=\left(\frac{a(\bm{w})b_{1}(w_{1})w_{1}}{b_{d}(w_{d})|w_{d% }|},\ldots,\frac{a(\bm{w})b_{d-1}(w_{d-1})w_{d-1}}{b_{d}(w_{d})|w_{d}|},a(\bm{% w})\text{sgn}(w_{d})\right),

where $a(\bm{w}):=1/\sqrt{1+\sum_{j=1}^{d-1}(b_{j}(w_{j})w_{j}/b_{d}(w_{d})w_{d})^{2}}$ , it is straightforward to verify that $\kappa(\kappa^{-1}(\bm{w}))=\bm{w}$ , completing the proof. ∎

Proof of Proposition 2.4.

First, observe that we can rewrite $\widetilde{\partial\mathcal{H}}$ as

\widetilde{\partial\mathcal{H}}=\left\{\bm{w}h(\kappa^{-1}(\bm{w}))\left\lVert% \left(\frac{\kappa^{-1}(\bm{w})_{1}}{b_{1}(\kappa^{-1}(\bm{w})_{1})},\ldots,% \frac{\kappa^{-1}(\bm{w})_{d}}{b_{d}(\kappa^{-1}(\bm{w})_{d})}\right)\right% \rVert\bigg{|}\bm{w}\in\mathcal{S}^{d-1}\right\},

where $\kappa^{-1}(\bm{w})_{i}$ denotes the $i$ -th component of $\kappa^{-1}(\bm{w})$ for each $i=1,\ldots,d$ . Since $\kappa(\cdot)$ is bijective, there exists a unique, bijective map** between $\partial\mathcal{H}$ and $\widetilde{\partial\mathcal{H}}$ ; thus, the sets are in one-to-one correspondence. Furthermore, by definition, we have $-1\leq w_{i}h(\bm{w})/b_{i}(w_{i})\leq 1$ for each $i=1,\ldots,d$ and $\bm{w}\in\mathcal{S}^{d-1}$ ; thus $\widetilde{\partial\mathcal{H}}\subset[-1,1]^{d}$ . Finally, considering $i=1,\dots,d$ and setting $\bm{w}^{u,i}=\operatorname*{arg\,max}_{\bm{w}\in\mathcal{S}^{d-1}}\{w_{i}h(\bm% {w})\}$ and $\bm{w}^{l,i}=\operatorname*{arg\,min}_{\bm{w}\in\mathcal{S}^{d-1}}\{w_{i}h(\bm% {w})\}$ , we have $w^{u,i}_{i}h(\bm{w}^{u,i})/b_{i}(w^{u,i}_{i})=1$ and $w^{l,i}_{i}h(\bm{w}^{l,i})/b_{i}(w^{l,i}_{i})=-1$ , implying the componentwise maxima and minima equal $\bm{1}_{d}$ and $-\bm{1}_{d}$ , respectively. ∎

B.4 Proof of Corollary 2.1

Proof.

By definition, we have that $\bm{w}/\tilde{g}(\bm{w})\in\widetilde{\partial\mathcal{H}}\subset[-1,1]^{d}$ for any $\bm{w}\in\mathcal{S}^{d-1}$ . Applying similar reasoning to the proof of Proposition 2.2, we immediately see that $\tilde{g}(\bm{w})\geq||\bm{w}||_{\infty}$ . Furthermore, taking any $i=1,\dots,d$ , we have

	$\displaystyle\|\|\kappa(\bm{w}^{u,i})\|\|_{\infty}$	$\displaystyle=\left\lVert\frac{\left(\frac{w^{u,i}_{1}}{b_{1}(w^{u,i}_{1})},% \ldots,\frac{w^{u,i}_{d}}{b_{d}(w^{u,i}_{d})}\right)}{\Big{\lVert}\left(\frac{% w^{u,i}_{1}}{b_{1}(w^{u,i}_{1})},\ldots,\frac{w^{u,i}_{d}}{b_{d}(w^{u,i}_{d})}% \right)\Big{\rVert}}\right\rVert_{\infty}=\frac{\left\lVert h(\bm{w}^{u,i})% \left(\frac{w^{u,i}_{1}}{b_{1}(w^{u,i}_{1})},\ldots,\frac{w^{u,i}_{d}}{b_{d}(w% ^{u,i}_{d})}\right)\right\rVert_{\infty}}{\left\lVert h(\bm{w}^{u,i})\left(% \frac{w^{u,i}_{1}}{b_{1}(w^{u,i}_{1})},\ldots,\frac{w^{u,i}_{d}}{b_{d}(w^{u,i}% _{d})}\right)\right\rVert\;\;}$
		$\displaystyle=\frac{1}{\Big{\lVert}h(\bm{w}^{u,i})\left(\frac{w^{u,i}_{1}}{b_{% 1}(w^{u,i}_{1})},\ldots,\frac{w^{u,i}_{d}}{b_{d}(w^{u,i}_{d})}\right)\Big{% \rVert}}=\tilde{g}(\kappa(\bm{w}^{u,i})).$

The same reasoning holds for $\bm{w}^{l,i}$ with $i=1,\ldots,d$ . ∎

B.5 Proof of Proposition 2.6

Proof.

Given any large $u$ satisfying $u>\max_{i=1,\dots,d}\{\log(2)/w_{i}\}>0$ , and applying similar reasoning as in the proof of Proposition 2.1 (Appendix B.1), we have that

	$\displaystyle\Pr\left(\min_{i\in\mathcal{D}}\{X_{E,i}/w_{i}\}>u\right)$	$\displaystyle=\Pr\left(X_{E,i}>w_{i}u,i=1,\ldots,d\right)$
		$\displaystyle=\Pr\left(X_{i}>w_{i}u-\log 2,i=1,\ldots,d\right)$
		$\displaystyle\sim\Pr\left(X_{i}>w_{i}u,i=1,\ldots,d\right)$
		$\displaystyle=\Pr\left(\min_{i\in\mathcal{D}}\{X_{i}/w_{i}\}>u\right).$

Taking $u\to\infty$ , we have

\Pr\left(\min_{i\in\mathcal{D}}\{X_{i}/w_{i}\}>u\right)=\Pr\left(\min_{i\in% \mathcal{D}}\{X_{E,i}/w_{i}\}>u\right)=L(e^{u};\bm{w})e^{-\lambda(\bm{w})u},

proving the statement. ∎

B.6 Proof of Proposition 2.7

Proof.

Given $\bm{w}\in\mathcal{S}^{d-1}\setminus\mathcal{A}$ , set $\bm{c}:=\text{sgn}(\bm{w})=(\text{sgn}(w_{1}),\ldots,\text{sgn}(w_{d}))^{T}$ , where $\text{sgn}(\cdot)$ denotes the regular signum function. Then, we have that

\displaystyle\Pr\left(\min_{i\in\mathcal{D}}\{X_{i}/w_{i}\}>u\right)

\displaystyle=\Pr\left(\min_{i\in\mathcal{D}}\{c_{i}X_{i}/c_{i}w_{i}\}>u\right% )\to L_{\bm{c}\bm{X}}(e^{u};\bm{c}\bm{w})e^{-\lambda_{\bm{c}\bm{X}}(\bm{c}\bm{% w})u},

as $u\to\infty$ , where $L_{\bm{c}\bm{X}}(\cdot;\bm{w})$ and $\lambda_{\bm{c}\bm{X}}(\cdot)$ denote the slowly varying function and ADF for $\bm{c}\bm{X}$ as in equation (2), respectively. The result follows with $\mathcal{L}(\cdot;\bm{w})=L_{\bm{c}\bm{X}}(\cdot;\bm{c}\bm{w})$ and $\Lambda(\bm{w})=\lambda_{\bm{c}\bm{X}}(\bm{c}\bm{w})$ . ∎

B.7 Proof of Proposition 2.8

Proof.

Let $\bm{c}=\text{sgn}(\bm{w})\in\{-1,1\}^{d}$ , and define $\bm{c}\bm{X}$ and $\bm{c}\bm{w}\in\mathcal{S}^{d-1}_{+}$ as in Proposition 2.5, noting that $||\bm{c}\bm{w}||_{\infty}=||\bm{w}||_{\infty}$ . Further, let $\partial\mathcal{G}_{\bm{c}\bm{X}}$ and $g_{\bm{c}\bm{X}}(\cdot)$ denote the unit-level set and gauge function, respectively, for $\bm{c}\bm{X}$ . For any $\mathfrak{r}\in\{\mathfrak{r}\in[0,1]:\mathfrak{r}\mathcal{R}_{\bm{c}\bm{w}}% \cap\partial\mathcal{G}_{\bm{c}\bm{X}}\neq\emptyset\}$ , there must exist some $j\in\{1,\dots,d\}$ and $\bm{w}^{*}\in\mathcal{S}^{d-1}_{+}$ such that

\mathfrak{r}\frac{c_{j}w_{j}}{||\bm{c}\bm{w}||_{\infty}}=\frac{w^{*}_{j}}{g_{% \bm{c}\bm{X}}(\bm{w}^{*})}.

By Proposition 2.5, $g_{\bm{c}\bm{X}}(\bm{w}^{*})=g(\bm{w}^{*}/\bm{c})$ , with $\bm{w}^{*}/\bm{c}:=(w^{*}_{1}/c_{1},\ldots,w^{*}_{d}/c_{d})\in\mathcal{S}^{d-1% }\setminus\mathcal{A}$ and $\text{sgn}(\bm{w})=\text{sgn}(\bm{w}^{*}/\bm{c})$ . Thus, we have

\mathfrak{r}\frac{w_{j}}{||\bm{w}||_{\infty}}=\frac{w^{*}_{j}/c_{j}}{g(\bm{w}^% {*}/\bm{c})},

implying $\mathfrak{r}\in\{\mathfrak{r}\in[0,1]:\mathfrak{r}\tilde{\mathcal{R}}_{\bm{w}}% \cap\partial\mathcal{G}\neq\emptyset\}$ . Analogous reasoning shows that

\mathfrak{r}\in\{\mathfrak{r}\in[0,1]:\mathfrak{r}\tilde{\mathcal{R}}_{\bm{w}}% \cap\partial\mathcal{G}\neq\emptyset\}\Rightarrow\mathfrak{r}\in\{\mathfrak{r}% \in[0,1]:\mathfrak{r}\mathcal{R}_{\bm{c}\bm{w}}\cap\partial\mathcal{G}_{\bm{c}% \bm{X}}\neq\emptyset\},

giving $\{\mathfrak{r}\in[0,1]:\mathfrak{r}\tilde{\mathcal{R}}_{\bm{w}}\cap\partial% \mathcal{G}\neq\emptyset\}=\{\mathfrak{r}\in[0,1]:\mathfrak{r}\mathcal{R}_{\bm% {c}\bm{w}}\cap\partial\mathcal{G}_{\bm{c}\bm{X}}\neq\emptyset\}$ and $\tilde{\mathfrak{r}}_{\bm{w}}=\max\{\mathfrak{r}\in[0,1]:\mathfrak{r}\tilde{% \mathcal{R}}_{\bm{w}}\cap\partial\mathcal{G}\neq\emptyset\}=\mathfrak{r}_{\bm{% c}\bm{w}}$ . Applying Proposition 3.3 of Nolde and Wadsworth, (2022), alongside Proposition 2.1, we see that

\displaystyle\Lambda(\bm{w})=\lambda_{\bm{c}\bm{X}}(\bm{c}\bm{w})=||\bm{c}\bm{% w}||_{\infty}\times\mathfrak{r}_{\bm{c}\bm{w}}^{-1}=||\bm{w}||_{\infty}\times% \tilde{\mathfrak{r}}_{\bm{w}}^{-1},

completing the proof.

∎

An illustration of Proposition 2.8 is given in Figure A1.

B.8 Proof of Corollary 2.3

Proof.

Let $\bm{w}/g(\bm{w})\in\partial\mathcal{G}$ denote coordinates on the unit-level set. Recalling the definitions of $\tilde{\mathfrak{r}}_{\bm{w}}$ and $\tilde{\mathcal{R}}_{\bm{w}}$ , we must have that $\tilde{\mathfrak{r}}_{\bm{w}}|w_{i}|/||\bm{w}||_{\infty}\geq|w_{i}|/g(\bm{w})$ for all $i=1,\ldots,d$ ; to see this, consider the two possibilities $\bm{w}/g(\bm{w})\in\{\tilde{\mathfrak{r}}_{\bm{w}}\tilde{\mathcal{R}}_{\bm{w}}% \cap\partial\mathcal{G}\}$ and $\bm{w}/g(\bm{w})\notin\{\tilde{\mathfrak{r}}_{\bm{w}}\tilde{\mathcal{R}}_{\bm{% w}}\cap\partial\mathcal{G}\}$ in turn. Consequently,

\frac{\tilde{\mathfrak{r}}_{\bm{w}}}{||\bm{w}||_{\infty}}\geq\min_{i=1,\dots,d% }\left\{\frac{w_{i}}{g(\bm{w})w_{i}}\right\}=\frac{1}{g(\bm{w})}.

Recalling from Proposition 2.8 that $||\bm{w}||_{\infty}\Lambda(\bm{w})^{-1}=\tilde{\mathfrak{r}}_{\bm{w}}$ , the first inequality follows. The second inequality follows directly from Proposition 2.8.

∎

B.9 Algorithm for computing the extended ADF diagnostic

Consider a sample $\{\bm{x}_{j}\}_{j=1}^{n}$ of independent copies of $\bm{X}$ , with $\{(r_{j},\bm{w}_{j})\}_{j=1}^{n}$ denoting the corresponding radial-angular observations, and let $q$ denote quantile level close to 1. The extended ADF diagnostic is computed using Algorithm 1.

Algorithm 1 Computing the extended ADF diagnostic.

b\leftarrow 1

for

j\leftarrow 1

n

for

k\leftarrow 1

n

t_{\bm{w}_{j}}^{k}\leftarrow\min_{i=1,\ldots,d}\{x_{k,i}/w_{j,i}\}

end for

\mathbf{t}_{\bm{w}_{j}}\leftarrow(t_{\bm{w}_{j}}^{1},\ldots,t_{\bm{w}_{j}}^{n})

\hat{u}_{\bm{w}_{j}}\leftarrow\texttt{quantile}(\mathbf{t}_{\bm{w}_{j}},q)

r_{j}>\hat{u}_{\bm{w}_{j}}

then

e_{b}\leftarrow\hat{\Lambda}(\bm{w}_{j})(r_{j}-\hat{u}_{\bm{w}_{j}})

b\leftarrow b+1

end if

end for

Note that $r_{j}=t_{\bm{w}_{j}}^{j}$ ; hence, the radial component can be considered an observation of the min-projection for the corresponding angle. We compare the resulting quantiles, $(e_{1},\ldots,e_{b})^{T}$ , to the corresponding theoretical quantiles using a QQ plot.

Appendix C Supplementary figures and tables

Table A1: Estimates of the median (

2.5\%

97.5\%

percentile) of the ISE and MALE for the simulation study of the architecture for the gauge function

g(\cdot)

, described in Section 4.2. Here the sample size is

n=100,000

, the radial quantile level is

\tau=0.75

, and the considered dimensions are

d=5

and

d=8

		$d=5$		$d=8$
Copula	$g(\cdot)$	ISE	MALE	ISE	MALE
Gaussian	$(16)$	0.44 (0.27, 1.87)	4.58 (2.85, 9.03)	2.52 (0.80. 5.67)	14.4 (6.48, 20.8)
	$(16,16)$	0.16 (0.13, 0.23)	1.43 (0.50, 2.81)	0.28 (0.14, 0.93)	3.12 (1.02, 8.28)
	$(16,16,16)$	0.14 (0.11, 0.19)	1.11 (0.39, 2.27)	0.19 (0.11, 0.62)	3.87 (1.71, 6.83)
	$(16,16,16,16)$	0.14 (0.11, 0.22)	1.10 (0.35, 2.52)	0.23 (0.11, 0.61)	3.61 (1.19, 7.20)
	$(64)$	0.19 (0.16, 0.25)	1.88 (0.98, 3.02)	0.33 (0.22, 0.49)	2.44 (1.27, 4.49)
	$(64,64)$	0.14 (0.11, 0.20)	0.91 (0.28, 1.87)	0.16 (0.11, 0.24)	4.52 (2.31, 7.13)
	$(64,64,64)$	0.13 (0.10, 0.17)	0.88 (0.36, 1.80)	0.11 (0.08, 0.19)	5.94 (3.58, 9.14)
	$(64,64,64,64)$	0.12 (0.10, 0.18)	0.97 (0.37, 2.04)	0.11 (0.08, 0.17)	6.44 (3.13, 8.83)
Logistic	$(16)$	0.49 (0.40, 0.58)	0.35 (0.28, 0.45)	0.36 (0.26, 0.53)	0.53 (0.36, 0.66)
	$(16,16)$	0.67 (0.53, 0.91)	0.33 (0.24, 0.44)	0.52 (0.38, 0.69)	0.54 (0.39, 0.69)
	$(16,16,16)$	0.78 (0.62, 1.04)	0.32 (0.16, 0.44)	0.62 (0.50, 0.81)	0.54 (0.40, 0.66)
	$(16,16,16,16)$	0.85 (0.65, 1.03)	0.32 (0.21, 0.45)	0.66 (0.52, 0.90)	0.54 (0.40, 0.69)
	$(64)$	0.48 (0.39, 0.64)	0.34 (0.28, 0.45)	0.37 (0.31, 0.54)	0.56 (0.43, 0.68)
	(64,64)	0.69 (0.56, 1.01)	0.35 (0.29, 0.47)	0.57 (0.45, 0.78)	0.54 (0.42, 0.67)
	(64,64,64)	0.84 (0.66, 1.07)	0.35 (0.28, 0.46)	0.66 (0.53, 0.87)	0.55 (0.36, 0.68)
	(64,64,64,64)	0.91 (0.73, 1.07)	0.35 (0.23, 0.45)	0.78 (0.62, 0.98)	0.54 (0.39, 0.67)
Student-t	(16)	12.0 (6.92, 15.0)	1.73 (1.59, 1.97)	25.0 (17.7, 28.8)	3.74 (2.60, 5.40)
	(16,16)	4.39 (0.97, 7.51)	1.75 (1.58, 2.38)	16.2 (11.3, 21.1)	2.97 (2.36, 3.97)
	(16,16,16)	0.62 (0.25, 3.21)	2.09 (1.70, 2.72)	9.44 (4.05, 15.3)	3.06 (2.40, 4.70)
	(16,16,16,16)	0.43 (0.20, 2.07)	2.24 (1.77, 2.78)	6.42 (2.90, 13.1)	3.22 (2.43, 5.98)
	(64)	2.43 (1.33, 4.41)	1.76 (1.61, 2.12)	6.76 (4.36, 13.98)	3.13 (2.30, 4.83)
	(64,64)	0.30 (0.15, 0.64)	2.07 (1.64, 2.86)	1.20 (0.42, 3.83)	4.71 (2.80, 6.28)
	(64,64,64)	0.26 (0.15, 0.47)	2.26 (1.76, 2.80)	0.43 (0.25, 0.68)	6.28 (4.40, 9.00)
	(64,64,64,64)	0.26 (0.18, 0.45)	2.30 (1.87, 2.98)	0.40(0.25, 0.70)	6.42 (4.13, 9.47)

References

Allaire and Chollet, (2021) Allaire, J. and Chollet, F. (2021). keras: R Interface to ’Keras’. R package version 2.7.0.
Balkema and Nolde, (2010) Balkema, G. and Nolde, N. (2010). Asymptotic independence for unimodal densities. Advances in Applied Probability, 42:411–432.
Boulaguiem et al., (2022) Boulaguiem, Y., Zscheischler, J., Vignotto, E., van der Wiel, K., and Engelke, S. (2022). Modeling and simulating spatial extremes by combining extreme value theory with generative adversarial networks. Environmental Data Science, 1:e5.
Cisneros et al., (2024) Cisneros, D., Richards, J., Dahal, A., Lombardo, L., and Huser, R. (2024). Deep graphical regression for jointly moderate and extreme Australian wildfires. Spatial Statistics, 59:100811.
Coles, (2001) Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values. Springer London.
Coles et al., (1999) Coles, S., Heffernan, J., and Tawn, J. (1999). Dependence measures for multivariate extremes. Extremes, 2:339–365.
Cooley et al., (2019) Cooley, D., Thibaud, E., Castillo, F., and Wehner, M. F. (2019). A nonparametric method for producing isolines of bivariate exceedance probabilities. Extremes, 22:373–390.
Davis et al., (1988) Davis, R. A., Mulrow, E., and Resnick, S. I. (1988). Almost sure limit sets of random samples in $\mathbb{R}^{d}$ . Advances in Applied Probability, 20:573–599.
Dawkins and Stephenson, (2018) Dawkins, L. C. and Stephenson, D. B. (2018). Quantification of extremal dependence in spatial natural hazard footprints: independence of windstorm gust speeds and its impact on aggregate losses. Natural Hazards and Earth System Sciences, 18:2933–2949.
Einmahl and Segers, (2009) Einmahl, J. H. and Segers, J. (2009). Maximum empirical likelihood estimation of the spectral measure of an extreme-value distribution. Annals of Statistics, 37:2953–2989.
Ewans and Jonathan, (2014) Ewans, K. and Jonathan, P. (2014). Recent advances in the analysis of extreme metocean events. In Offshore Technology Conference Asia, pages 3009–3019. OTC.
Fisher, (1969) Fisher, L. (1969). Limiting Sets and Convex Hulls of Samples from Product Measures. The Annals of Mathematical Statistics, 40:1824–1832.
Goodfellow et al., (2016) Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT press.
Hasan et al., (2022) Hasan, A., Elkhalil, K., Ng, Y., Pereira, J. M., Farsiu, S., Blanchet, J., and Tarokh, V. (2022). Modeling extremes with $d$ -max-decreasing neural networks. In Uncertainty in Artificial Intelligence, pages 759–768. PMLR.
Hayes, (1956) Hayes, C. A. (1956). The Heine-Borel theorem. The American Mathematical Monthly, 63:180.
Heffernan and Tawn, (2001) Heffernan, J. E. and Tawn, J. A. (2001). Extreme value analysis of a large designed experiment: a case study in bulk carrier safety. Extremes, 4:359–378.
Heffernan and Tawn, (2004) Heffernan, J. E. and Tawn, J. A. (2004). A conditional approach for multivariate extreme values. Journal of the Royal Statistical Society. Series B: Methodology, 66:497–546.
Huser et al., (2024) Huser, R., Opitz, T., and Wadsworth, J. (2024). Modeling of spatial extremes in environmental data science: Time to move away from max-stable processes. arXiv, 2401.17430.
Keef et al., (2013) Keef, C., Papastathopoulos, I., and Tawn, J. A. (2013). Estimation of the conditional distribution of a multivariate variable given that one of its components is large: Additional constraints for the Heffernan and Tawn model. Journal of Multivariate Analysis, 115:396–404.
Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kinoshita and Resnick, (1991) Kinoshita, K. and Resnick, S. I. (1991). Convergence of Scaled Random Samples in $\mathbb{R}^{d}$ . The Annals of Probability, 19:1640–1663.
Koenker et al., (2017) Koenker, R., Chernozhukov, V., He, X., and Peng, L. (2017). Handbook of Quantile Regression. Chapman and Hall/CRC.
Lafon et al., (2023) Lafon, N., Naveau, P., and Fablet, R. (2023). A VAE approach to sample multivariate extremes. arXiv preprint arXiv:2306.10987.
Ledford and Tawn, (1996) Ledford, A. W. and Tawn, J. A. (1996). Statistics for near independence in multivariate extreme values. Biometrika, 83:169–187.
Ledford and Tawn, (1997) Ledford, A. W. and Tawn, J. A. (1997). Modelling dependence within joint tail regions. Journal of the Royal Statistical Society. Series B: Methodology, 59:475–499.
Mackay and Haselsteiner, (2021) Mackay, E. and Haselsteiner, A. F. (2021). Marginal and total exceedance probabilities of environmental contours. Marine Structures, 75:1–24.
Mackay and Jonathan, (2023) Mackay, E. and Jonathan, P. (2023). Modelling multivariate extremes through angular-radial decomposition of the density function. arXiv, 2310.12711.
(28) Majumder, R., Reich, B. J., and Shaby, B. A. (2024a). Modeling extremal streamflow using deep learning approximations and a flexible spatial process. The Annals of Applied Statistics, 18(2):1519 – 1542.
(29) Majumder, R., Shaby, B. A., Reich, B. J., and Cooley, D. (2024b). Semiparametric estimation of the shape of the limiting bivariate point cloud. arXiv, 2306.13257.
(30) Murphy-Barltrop, C. J. R., Mackay, E., and Jonathan, P. (2024a). Inference for bivariate extremes via a semi-parametric angular-radial model. arXiv, 2401.07259.
Murphy-Barltrop and Wadsworth, (2024) Murphy-Barltrop, C. J. R. and Wadsworth, J. L. (2024). Modelling non-stationarity in asymptotically independent extremes. arXiv, 2203.05860.
Murphy-Barltrop et al., (2023) Murphy-Barltrop, C. J. R., Wadsworth, J. L., and Eastoe, E. F. (2023). New estimation methods for extremal bivariate return curves. Environmetrics, e2797:1–22.
(33) Murphy-Barltrop, C. J. R., Wadsworth, J. L., and Eastoe, E. F. (2024b). Improving estimation for asymptotically independent bivariate extremes via global estimators for the angular dependence function. arXiv, 2303.13237.
Neumann, (1951) Neumann, J. V. (1951). Various techniques used in connection with random digits. NBS Applied Mathematics Series, 12:36–38.
Nolde, (2014) Nolde, N. (2014). Geometric interpretation of the residual dependence coefficient. Journal of Multivariate Analysis, 123:85–95.
Nolde and Wadsworth, (2022) Nolde, N. and Wadsworth, J. L. (2022). Linking representations for multivariate extremes via a limit set. Advances in Applied Probability, 54:688–717.
Opitz, (2016) Opitz, T. (2016). Modeling asymptotically independent spatial extremes based on Laplace random fields. Spatial Statistics, 16:1–18.
Papastathopoulos et al., (2024) Papastathopoulos, I., de Monte, L., Campbell, R., and Rue, H. (2024). Statistical inference for radially-stable generalized Pareto distributions and return level-sets in geometric extremes. arXiv, 2310.06130.
Pasche and Engelke, (2022) Pasche, O. C. and Engelke, S. (2022). Neural networks for extreme quantile regression with an application to forecasting of flood risk. arXiv preprint arXiv:2208.07590.
Prechelt, (2002) Prechelt, L. (2002). Early stop**-but when? In Neural Networks: Tricks of the Trade, pages 55–69. Springer.
Ramos and Ledford, (2009) Ramos, A. and Ledford, A. (2009). A new class of models for bivariate joint tails. Journal of the Royal Statistical Society. Series B: Methodology, 71:219–241.
Reistad et al., (2011) Reistad, M., Øyvind Breivik, Haakenstad, H., Aarnes, O. J., Furevik, B. R., and Bidlot, J.-R. (2011). A high-resolution hindcast of wind and waves for the North Sea, the Norwegian Sea, and the Barents Sea. Journal of Geophysical Research, 116:C05019.
Resnick, (2002) Resnick, S. (2002). Hidden Regular Variation, Second Order Regular Variation and Asymptotic Independence. Extremes, 5:303–336.
Richards and Huser, (2022) Richards, J. and Huser, R. (2022). Regression modelling of spatiotemporal extreme US wildfires via partially-interpretable neural networks. arXiv preprint arXiv:2208.07581.
Richards and Huser, (2024) Richards, R. and Huser, R. (2024). Extreme Quantile Regression with Deep Learning. In de Carvalho, M., Huser, R., Naveau, P., and Reich, B. J., editors, Handbook on Statistics of Extremes. Chapman & Hall / CRC.
Rootzén and Tajvidi, (2006) Rootzén, H. and Tajvidi, N. (2006). Multivariate generalized Pareto distributions. Bernoulli, 12:917–930.
Rothfuss et al., (2019) Rothfuss, J., Ferreira, F., Walther, S., and Ulrich, M. (2019). Conditional density estimation with neural networks: Best practices and benchmarks. arXiv preprint arXiv:1903.00954.
Ruzgas et al., (2021) Ruzgas, T., Lukauskas, M., and Čepkauskas, G. (2021). Nonparametric Multivariate Density Estimation: Case Study of Cauchy Mixture Model. Mathematics, 9:2717.
Shooter et al., (2021) Shooter, R., Ross, E., Ribal, A., Young, I. R., and Jonathan, P. (2021). Spatial dependence of extreme seas in the North East Atlantic from satellite altimeter measurements. Environmetrics, 32:1–15.
Shooter et al., (2022) Shooter, R., Ross, E., Ribal, A., Young, I. R., and Jonathan, P. (2022). Multivariate spatial conditional extremes for extreme ocean environments. Ocean Engineering, 247:110647.
(51) Simpson, E. S. and Tawn, J. A. (2024a). Estimating the limiting shape of bivariate scaled sample clouds: with additional benefits of self-consistent inference for existing extremal dependence properties. arXiv, 2207.02626.
(52) Simpson, E. S. and Tawn, J. A. (2024b). Inference for new environmental contours using extreme value analysis. Journal of Agricultural, Biological and Environmental Statistics.
Simpson et al., (2020) Simpson, E. S., Wadsworth, J. L., and Tawn, J. A. (2020). Determining the dependence structure of multivariate extremes. Biometrika, 107:513–532.
Tawn, (1990) Tawn, J. A. (1990). Modelling multivariate extreme value distributions. Biometrika, 77:245–253.
Wadsworth and Campbell, (2024) Wadsworth, J. L. and Campbell, R. (2024). Statistical inference for multivariate extremes via a geometric approach. Journal of the Royal Statistical Society Series B: Methodology, to appear.
Wadsworth and Tawn, (2013) Wadsworth, J. L. and Tawn, J. A. (2013). A new representation for multivariate tail probabilities. Bernoulli, 19:2689–2714.
Zhang et al., (2023) Zhang, L., Ma, X., Wikle, C. K., and Huser, R. (2023). Flexible and efficient spatial extremes emulation via variational autoencoders. arXiv preprint arXiv:2307.08079.

1 Introduction

2 Theoretical developments in Geometric Extremes

2.1 Overview of the angular-radial decomposition

2.2 Geometric extremes on Laplace margins

Proposition 2.1.

Remark 1.

2.3 Constructing valid unit-level sets

Proposition 2.2.

Proof.

Proposition 2.3.

Proposition 2.4.

Lemma 2.1.

Corollary 2.1.

Corollary 2.2.

Remark 2.

2.4 Extended Angular Dependence Function

Proposition 2.5.

Proof.

Remark 3.

Proposition 2.6.

Proposition 2.7.

Remark 4.

Proposition 2.8.

Corollary 2.3.

3 Inference

3.1 Overview

3.2 Modelling assumptions for the conditional radii

3.3 Estimation of the gauge function using neural networks

3.4 Estimating the extended ADF

3.5 Assessing goodness-of-fit for DeepGauge model fits

Truncated gamma QQ plot:

Extended ADF diagnostic:

Return level sets and probabilities:

Three-dimensional unit-level and extended ADF sets:

Bivariate unit-level set slices:

4 Simulation study

4.1 Overview

4.2 Models and performance metrics

4.3 Results

5 Analysis of the NORA10 Metocean Data

5.1 Overview

5.2 Results

5.2.1 Joint distribution of hs, ws, and mslp: d=3𝑑3d=3italic_d = 3

5.2.2 Joint distribution of hs along transect: d=5𝑑5d=5italic_d = 5

6 Discussion

Acknowledgements

Appendix

Appendix A Neural network construction and inference

A.1 Neural network representation

A.2 Training and pre-training

Appendix B Additional proofs

B.1 Proof of Proposition 2.1

Proof.

B.2 Proof of Proposition 2.3

Proof.

B.3 Proof of Proposition 2.4

Proof of Lemma 2.1.

Proof of Proposition 2.4.

B.4 Proof of Corollary 2.1

Proof.

B.5 Proof of Proposition 2.6

Proof.

B.6 Proof of Proposition 2.7

Proof.

B.7 Proof of Proposition 2.8

Proof.

B.8 Proof of Corollary 2.3

Proof.

B.9 Algorithm for computing the extended ADF diagnostic

Appendix C Supplementary figures and tables

References

5.2.1 Joint distribution of hs, ws, and mslp: $d=3$

5.2.2 Joint distribution of hs along transect: $d=5$