On estimation and order selection for multivariate extremes via clustering

Shiyuan Deng He Tang Shuyang Bai Department of Statistics, 310 Herty Dr., University of Georgia, Athens, GA 30602
Abstract

We investigate the estimation of multivariate extreme models with a discrete spectral measure using spherical clustering techniques. The primary contribution involves devising a method for selecting the order, that is, the number of clusters. The method consistently identifies the true order, i.e., the number of spectral atoms, and enjoys intuitive implementation in practice. Specifically, we introduce an extra penalty term to the well-known simplified average silhouette width, which penalizes small cluster sizes and small dissimilarities between cluster centers. Consequently, we provide a consistent method for determining the order of a max-linear factor model, where a typical information-based approach is not viable. Our second contribution is a large-deviation-type analysis for estimating the discrete spectral measure through clustering methods, which serves as an assessment of the convergence quality of clustering-based estimation for multivariate extremes. Additionally, as a third contribution, we discuss how estimating the discrete measure can lead to parameter estimations of heavy-tailed factor models. We also present simulations and real-data studies that demonstrate order selection and factor model estimation.

keywords:
clustering , factor models , max-linear models , multivariate extremes , order selection , silhouettes
MSC:
[2020] Primary 62G32 , Secondary 60G70

1 Introduction

The multivariate extreme value theory concerns the statistical pattern of concurrent extreme values of multiple variables [1, 13]. As a common approach to investigating this pattern, after standardizing the marginal distributions of the variables, one examines the angular distribution of the extreme samples, that is, data points with the largest norms. This angular distribution, under a natural assumption in the theory of multivariate extremes (i.e., the multivariate maximum domain of attraction), approximates a limit distribution on the unit sphere, known as the spectral (or angular) measure. See Section 2 below for more details.

Given that extremes inherently correspond to a reduced sample size, the challenge of handling high dimensionality is of heightened importance in this context. As noted in the review article [6], many efforts have focused on employing parsimonious modeling assumptions and techniques to reduce complexity. A particular parsimonious structure is a discrete spectral measure with a finite number of atoms; that is, the angular distribution of the extreme data points is approximately concentrated on a finite number of directions. Despite its simplicity, [8] showed that any extremal dependence structure can be arbitrarily well approximated by such a discrete spectral measure. In addition, a number of parametric models, including heavy-tailed max-linear and sum-linear models (see, e.g., [5]), as well as the recently introduced transformed-linear model of [2], are known to have a discrete spectral measure.

Recently, as attempts that can also be viewed as providing a parsimonious summary of the angular distribution of multivariate extremes, several authors considered applying clustering algorithms over the sphere on which the spectral measure resides. Einmahl et al. [5] and Janßen and Wan [20] applied the spherical k𝑘kitalic_k-means algorithm based on cosine dissimilarity [3] and addressed its relation to the estimation of max-linear factor models. Fomichov and Ivanovs [7] proposed the spherical k𝑘kitalic_k-principal-component (k𝑘kitalic_k-pc) algorithm which is based on a modified cosine dissimilarity, and discussed its superiority in terms of detecting the concentration of the spectral measure on lower-dimensional faces. Medina et al. [22] considered applying the spectral clustering algorithm [23] to the k𝑘kitalic_k-nearest neighbor graph constructed from the angular part of the extreme samples, and related it to sum-linear factor models.

As readily observed in the aforementioned works, there is a natural connection between a discrete spectral measure and spherical clustering: Each atom in the spectral measure can be viewed as a cluster center (prototype), and the angular part of the extreme samples form clusters around these atoms. In fact, this intuition has been rigorously explored by [20] and [22], where consistent recovery of the spectral measure based on their clustering algorithms was established (the consistency result of [20] also applies to the k𝑘kitalic_k-pc clustering of [7]). Since models such as the heavy-tailed max-linear and sum-linear factor models are essentially characterized by the spectral measure, the consistent estimation of spectral measure can be, in principle, converted to the consistent estimation of parameters of the factor models.

So far, in all the theoretical analysis of the works linking a discrete spectral measure to a clustering algorithm, the number of atoms, or equivalently speaking, the number of clusters, is assumed to be known. We refer to this number as the order, since it also relates to the order of the factor models mentioned above. In [20, 7, 22], ad hoc methods such as elbow plot and scree plot were used to guide the selection of the order in their real data analysis. These ad hoc methods are based solely on human visuals to locate the vaguely defined “elbow” point, and lack theoretical justification.

In this work, we further explore clustering-based estimation of multivariate extreme models with a discrete spectral measure. The contributions of this work are threefold. The main contribution involves the development of an order selection method that, on the theoretical side, consistently recovers the true order, and on the practical side, enjoys intuitive and simple implementation. Our method is based on a variant of the well-known Silhoutte method [25, 17]. In particular, we introduce an additional penalty term to the so-called simplified average silhouette width, which discourages small cluster sizes and small dissimilarity between cluster centers. As a consequence, we provide a method to consistently estimate the order of a max-linear factor model, for which a usual information-based method is not applicable due to the unavailability of likelihood (e.g., [5, 28]). Our second contribution concerns a large-deviation-type result on the discrete spectral measure estimation via the clustering methods such as the spherical k𝑘kitalic_k-means and k𝑘kitalic_k-pc. This constitutes an attempt to address the quality of convergence for clustering-based estimation in the context of multivariate extremes. As a third contribution, we discuss how the discrete spectral measure estimation can be translated into parameter estimates of the heavy-tailed max-linear and sum-linear factor models. Simulation and real-data studies illustrating order selection and factor model estimation are also provided.

The paper is organized as follows. Section 2 provides background and preliminary results on multivariate extremes, spherical clustering, and their connection. The penalized silhouette method for order selection is introduced in Section 3. Section 4 offers some large-deviation-type analysis of convergence of clustering-based spectral estimation. Section 5 relates clustering-based spectral estimation to the estimation of certain heavy-tailed factor models. Section 6 presents simulation and real-data demonstrations of order selection and factor model estimation. By default, all vectors are column vectors. The notation δ𝐰subscript𝛿𝐰\delta_{\mathbf{w}}italic_δ start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT stands for a delta measure with unit mass at the point 𝐰𝐰\mathbf{w}bold_w in an appropriate measurable space.

2 Background

In this section, we provide some background information on multivariate extreme value theory, spherical clustering, and their connection.

2.1 Multivariate extreme value theory

In this section, we review some important elements of multivariate extreme value theory. We refer to [1, 13, 24] for more details.

Suppose that 𝐗=(X1,,Xd)𝐗subscript𝑋1subscript𝑋𝑑\mathbf{X}=(X_{1},\ldots,X_{d})bold_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) is a d𝑑ditalic_d-dimensional random vector taking values in [0,)dsuperscript0𝑑[0,\infty)^{d}[ 0 , ∞ ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with continuous marginal distributions, where d2𝑑2d\geq 2italic_d ≥ 2. Many discussions in this paper can be extended to the case of dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT-valued 𝐗𝐗\mathbf{X}bold_X, although for simplicity, we restrict ourselves to the nonnegative orthant [0,)dsuperscript0𝑑[0,\infty)^{d}[ 0 , ∞ ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, which is also most commonly encountered in practice. As a conventional practice in the analysis of multivariate extremes, the modeling of marginals and extremal dependence is often separate. We assume that 𝐗𝐗\mathbf{X}bold_X has been marginally standardized to share a standard α𝛼\alphaitalic_α-Pareto-like tail asymptotically:

limxxαPr(X1>x)==limxxαPr(Xd>x)=1,subscript𝑥superscript𝑥𝛼Prsubscript𝑋1𝑥subscript𝑥superscript𝑥𝛼Prsubscript𝑋𝑑𝑥1\lim_{x\rightarrow\infty}x^{\alpha}\Pr(X_{1}>x)=\ldots=\lim_{x\rightarrow% \infty}x^{\alpha}\Pr(X_{d}>x)=1,roman_lim start_POSTSUBSCRIPT italic_x → ∞ end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT roman_Pr ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_x ) = … = roman_lim start_POSTSUBSCRIPT italic_x → ∞ end_POSTSUBSCRIPT italic_x start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT roman_Pr ( italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT > italic_x ) = 1 , (1)

where α>0𝛼0\alpha>0italic_α > 0 is known, often chosen as α=1𝛼1\alpha=1italic_α = 1 or α=2𝛼2\alpha=2italic_α = 2 in literature. The so-called multivariate regular variation (MRV) assumption on 𝐗𝐗\mathbf{X}bold_X requires

uPr(u1/α𝐗)𝑣Λ(),as u,𝑢Prsuperscript𝑢1𝛼𝐗𝑣Λas 𝑢u\Pr\left(u^{-1/\alpha}\mathbf{X}\in\cdot\right)\overset{v}{\rightarrow}% \Lambda(\cdot),\quad\text{as }u\rightarrow\infty,italic_u roman_Pr ( italic_u start_POSTSUPERSCRIPT - 1 / italic_α end_POSTSUPERSCRIPT bold_X ∈ ⋅ ) overitalic_v start_ARG → end_ARG roman_Λ ( ⋅ ) , as italic_u → ∞ , (2)

where 𝑣𝑣\overset{v}{\rightarrow}overitalic_v start_ARG → end_ARG denotes vague convergence of measures on 𝔼d:=[0,)d{𝟎}assignsubscript𝔼𝑑superscript0𝑑0\mathbb{E}_{d}:=[0,\infty)^{d}\setminus\{\mathbf{0}\}blackboard_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT := [ 0 , ∞ ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∖ { bold_0 }, and ΛΛ\Lambdaroman_Λ is an infinite measure on 𝔼dsubscript𝔼𝑑\mathbb{E}_{d}blackboard_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT known as the exponent measure. For the notion of vague convergence, we follow the formulation of [19] (termed 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT-convergence there) that does not involve a compactification of [0,)dsuperscript0𝑑[0,\infty)^{d}[ 0 , ∞ ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT; see also [21] (termed vague#superscriptvague#\text{vague}^{\#}vague start_POSTSUPERSCRIPT # end_POSTSUPERSCRIPT-convergence there). In particular, convergence (2) is characterized by convergence at any Borel subset E𝐸Eitalic_E of 𝔼dsubscript𝔼𝑑\mathbb{E}_{d}blackboard_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT whose boundary E𝐸\partial E∂ italic_E is not charged by ΛΛ\Lambdaroman_Λ (i.e., E𝐸Eitalic_E is a ΛΛ\Lambdaroman_Λ-continuity set), and which is bounded away from the origin 𝟎0\mathbf{0}bold_0.

The MRV assumption on 𝐗𝐗\mathbf{X}bold_X is equivalent to 𝐗𝐗\mathbf{X}bold_X being in the multivariate max-domain of attraction, i.e., convergence in distribution of the normalized component-wise maximum of i.i.d. samples of 𝐗𝐗\mathbf{X}bold_X towards a multivariate α𝛼\alphaitalic_α-Fréchet distribution with joint distribution function

Fα(𝐱):=exp[Λ([0,)d[𝟎,𝐱])],assignsubscript𝐹𝛼𝐱Λsuperscript0𝑑0𝐱F_{\alpha}(\mathbf{x}):=\exp\left[-\Lambda([0,\infty)^{d}\setminus[\mathbf{0},% \mathbf{x}])\right],italic_F start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_x ) := roman_exp [ - roman_Λ ( [ 0 , ∞ ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∖ [ bold_0 , bold_x ] ) ] , (3)

where [𝟎,𝐱]=[0,x1]×[0,xd]0𝐱0subscript𝑥10subscript𝑥𝑑[\mathbf{0},\mathbf{x}]=[0,x_{1}]\times\ldots[0,x_{d}][ bold_0 , bold_x ] = [ 0 , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] × … [ 0 , italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ], xi>0subscript𝑥𝑖0x_{i}>0italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0, i{1,,d}𝑖1𝑑i\in\left\{1,\ldots,d\right\}italic_i ∈ { 1 , … , italic_d }. Moreover, the measure ΛΛ\Lambdaroman_Λ satisfies the homogeneity property Λ(c)=cαΛ()\Lambda(c\,\cdot)=c^{-\alpha}\Lambda(\cdot)roman_Λ ( italic_c ⋅ ) = italic_c start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT roman_Λ ( ⋅ ), and therefore admits a polar decomposition into a product of radial and angular parts. We shall follow the formulation in [1, Section 8.2.5], which allows the use of different norms for the radial and angular components. Suppose that (r)\|\cdot\|_{(r)}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT and (s)\|\cdot\|_{(s)}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT denote two arbitrary norms on dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Slightly abusing the notation, we still use ΛΛ\Lambdaroman_Λ denote the push-forward measure of ΛΛ\Lambdaroman_Λ under the one-to-one map** 𝔼d(0,)×𝕊+d1maps-tosubscript𝔼𝑑0superscriptsubscript𝕊𝑑1\mathbb{E}_{d}\mapsto(0,\infty)\times\mathbb{S}_{+}^{d-1}blackboard_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ↦ ( 0 , ∞ ) × blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, 𝐱(r,𝐰=(w1,,wd)):=(𝐱(r),𝐱/𝐱(s))maps-to𝐱𝑟𝐰subscript𝑤1subscript𝑤𝑑assignsubscriptnorm𝐱𝑟𝐱subscriptnorm𝐱𝑠\mathbf{x}\mapsto(r,\mathbf{w}=(w_{1},\ldots,w_{d})):=\left(\|\mathbf{x}\|_{(r% )},\mathbf{x}/\|\mathbf{x}\|_{(s)}\right)bold_x ↦ ( italic_r , bold_w = ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) := ( ∥ bold_x ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT , bold_x / ∥ bold_x ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT ), where

𝕊+d1={𝐱[0,)d:𝐱(s)=1},superscriptsubscript𝕊𝑑1conditional-set𝐱superscript0𝑑subscriptnorm𝐱𝑠1\mathbb{S}_{+}^{d-1}=\{\mathbf{x}\in[0,\infty)^{d}:\ \|\mathbf{x}\|_{(s)}=1\},blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT = { bold_x ∈ [ 0 , ∞ ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : ∥ bold_x ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT = 1 } , (4)

we have

Λ(dr,d𝐰)=Λα(dr,d𝐰)=c(r)αrα1dr×H(d𝐰),Λ𝑑𝑟𝑑𝐰subscriptΛ𝛼𝑑𝑟𝑑𝐰subscript𝑐𝑟𝛼superscript𝑟𝛼1𝑑𝑟𝐻𝑑𝐰\Lambda(dr,d\mathbf{w})=\Lambda_{\alpha}(dr,d\mathbf{w})=c_{(r)}\alpha r^{-% \alpha-1}dr\times H(d\mathbf{w}),roman_Λ ( italic_d italic_r , italic_d bold_w ) = roman_Λ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_d italic_r , italic_d bold_w ) = italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT italic_α italic_r start_POSTSUPERSCRIPT - italic_α - 1 end_POSTSUPERSCRIPT italic_d italic_r × italic_H ( italic_d bold_w ) , (5)

where

c(r)=Λ({𝐱[0,)d:𝐱(r)1}),subscript𝑐𝑟Λconditional-set𝐱superscript0𝑑subscriptnorm𝐱𝑟1c_{(r)}=\Lambda(\{\mathbf{x}\in[0,\infty)^{d}:\ \|\mathbf{x}\|_{(r)}\geq 1\}),italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT = roman_Λ ( { bold_x ∈ [ 0 , ∞ ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT : ∥ bold_x ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ 1 } ) , (6)

and H𝐻Hitalic_H is a probability measure on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT known as the (normalized) spectral measure. The measure H𝐻Hitalic_H describes the angular distribution of the concurrence of the extreme values and characterizes the extremal dependence of 𝐗𝐗\mathbf{X}bold_X. As a consequence of the marginal standardization, we have

𝕊+d1(w1𝐰(r))αH(d𝐰)==𝕊+d1(wd𝐰(r))αH(d𝐰)=1c(r).subscriptsuperscriptsubscript𝕊𝑑1superscriptsubscript𝑤1subscriptnorm𝐰𝑟𝛼𝐻𝑑𝐰subscriptsuperscriptsubscript𝕊𝑑1superscriptsubscript𝑤𝑑subscriptnorm𝐰𝑟𝛼𝐻𝑑𝐰1subscript𝑐𝑟\int_{\mathbb{S}_{+}^{d-1}}\left(\frac{w_{1}}{\|\mathbf{w}\|_{(r)}}\right)^{% \alpha}H(d\mathbf{w})=\ldots=\int_{\mathbb{S}_{+}^{d-1}}\left(\frac{w_{d}}{\|% \mathbf{w}\|_{(r)}}\right)^{\alpha}H(d\mathbf{w})=\frac{1}{c_{(r)}}.∫ start_POSTSUBSCRIPT blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( divide start_ARG italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_w ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT italic_H ( italic_d bold_w ) = … = ∫ start_POSTSUBSCRIPT blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_w ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT italic_H ( italic_d bold_w ) = divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT end_ARG . (7)

In practice, commonly used norms are the p𝑝pitalic_p-norm 𝐱p=(j=1d|xj|p)1/psubscriptnorm𝐱𝑝superscriptsuperscriptsubscript𝑗1𝑑superscriptsubscript𝑥𝑗𝑝1𝑝\|\mathbf{x}\|_{p}=\left(\sum_{j=1}^{d}|x_{j}|^{p}\right)^{1/p}∥ bold_x ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT, p(0,)𝑝0p\in(0,\infty)italic_p ∈ ( 0 , ∞ ), and the sup-norm 𝐱=max(|x1|,,|xd|)subscriptnorm𝐱subscript𝑥1subscript𝑥𝑑\|\mathbf{x}\|_{\infty}=\max\left(|x_{1}|,\ldots,|x_{d}|\right)∥ bold_x ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = roman_max ( | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | , … , | italic_x start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | ). In addition, the following weak convergence on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT holds

Pr(𝐗/𝐗(s)𝐗(r)u)𝑑H()=c(r)1Λ(𝐱𝔼d:𝐱/𝐱(s),𝐱(r)1)\Pr\left(\mathbf{X}/\|\mathbf{X}\|_{(s)}\in\cdot\mid\|\mathbf{X}\|_{(r)}\geq u% \right)\overset{d}{\rightarrow}H(\cdot)=c_{(r)}^{-1}\Lambda\left(\mathbf{x}\in% \mathbb{E}_{d}:\ \mathbf{x}/\|\mathbf{x}\|_{(s)}\in\cdot\ ,\ \|\mathbf{x}\|_{(% r)}\geq 1\right)roman_Pr ( bold_X / ∥ bold_X ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT ∈ ⋅ ∣ ∥ bold_X ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ italic_u ) overitalic_d start_ARG → end_ARG italic_H ( ⋅ ) = italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_Λ ( bold_x ∈ blackboard_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT : bold_x / ∥ bold_x ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT ∈ ⋅ , ∥ bold_x ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ 1 ) (8)

as u𝑢u\rightarrow\inftyitalic_u → ∞.

2.2 Spherical clustering

The spherical clustering algorithms that have been considered so far are performed exclusively on the unit sphere 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT with respect to the 2222-norm (Euclidean norm), that is, take (s)\|\cdot\|_{(s)}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT in (4) as 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We do not make this assumption for generality unless discussing specific examples. We equip 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT with the subspace topology inherited from dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Next, we introduce a dissimilarity measure D𝐷Ditalic_D that follows the assumption below.

Assumption 1.

Suppose D:𝕊+d1×𝕊+d1[0,1]:𝐷superscriptsubscript𝕊𝑑1superscriptsubscript𝕊𝑑101D:\mathbb{S}_{+}^{d-1}\times\mathbb{S}_{+}^{d-1}\rightarrow[0,1]italic_D : blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT × blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT → [ 0 , 1 ] is continuous, and satisfies the following properties: for 𝐰i𝕊+d1subscript𝐰𝑖superscriptsubscript𝕊𝑑1\mathbf{w}_{i}\in\mathbb{S}_{+}^{d-1}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, i{1,2}𝑖12i\in\left\{1,2\right\}italic_i ∈ { 1 , 2 }, (i) D(𝐰1,𝐰2)=0𝐷subscript𝐰1subscript𝐰20D(\mathbf{w}_{1},\mathbf{w}_{2})=0italic_D ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 if and only if 𝐰1=𝐰2subscript𝐰1subscript𝐰2\mathbf{w}_{1}=\mathbf{w}_{2}bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT; (ii) D(𝐰1,𝐰2)=D(𝐰2,𝐰1)𝐷subscript𝐰1subscript𝐰2𝐷subscript𝐰2subscript𝐰1D(\mathbf{w}_{1},\mathbf{w}_{2})=D(\mathbf{w}_{2},\mathbf{w}_{1})italic_D ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_D ( bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ).

Remark 2.1.

Without loss of generality, we shall assume that D𝐷Ditalic_D is properly normalized so that D𝐷Ditalic_D is surjective over [0,1]01[0,1][ 0 , 1 ]. A nonnegative function D𝐷Ditalic_D satisfying (i) and (ii) is often referred to as a semimetric, which lacks the triangular inequality axiom of a metric. With the assumptions imposed, we have 𝐰n𝐰subscript𝐰𝑛𝐰\mathbf{w}_{n}\rightarrow\mathbf{w}bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → bold_w on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT if and only if D(𝐰n,𝐰)0𝐷subscript𝐰𝑛𝐰0D(\mathbf{w}_{n},\mathbf{w})\rightarrow 0italic_D ( bold_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_w ) → 0 as n𝑛n\rightarrow\inftyitalic_n → ∞, and the D𝐷Ditalic_D-neighborhoods

B(𝐰,r):={𝐮𝕊+d1:D(𝐰,𝐮)<r},assign𝐵𝐰𝑟conditional-set𝐮superscriptsubscript𝕊𝑑1𝐷𝐰𝐮𝑟B(\mathbf{w},r):=\{\mathbf{u}\in\mathbb{S}_{+}^{d-1}:\ D(\mathbf{w},\mathbf{u}% )<r\},italic_B ( bold_w , italic_r ) := { bold_u ∈ blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT : italic_D ( bold_w , bold_u ) < italic_r } ,

𝐰𝕊+d1,r>0formulae-sequence𝐰superscriptsubscript𝕊𝑑1𝑟0\mathbf{w}\in\mathbb{S}_{+}^{d-1},r>0bold_w ∈ blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT , italic_r > 0, form a topological basis of 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT; see, e.g., [27, 9]. Note that due to the compactness of 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT and the continuity of D𝐷Ditalic_D, the function

D(𝐰1,𝐰2):=sup𝐰𝕊+d1|D(𝐰,𝐰1)D(𝐰,𝐰2)|assignsuperscript𝐷subscript𝐰1subscript𝐰2subscriptsupremum𝐰superscriptsubscript𝕊𝑑1𝐷𝐰subscript𝐰1𝐷𝐰subscript𝐰2D^{\dagger}(\mathbf{w}_{1},\mathbf{w}_{2}):=\sup_{\mathbf{w}\in\mathbb{S}_{+}^% {d-1}}|D(\mathbf{w},\mathbf{w}_{1})-D(\mathbf{w},\mathbf{w}_{2})|italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) := roman_sup start_POSTSUBSCRIPT bold_w ∈ blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | italic_D ( bold_w , bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_D ( bold_w , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | (9)

is also a semimetric that is continuous on 𝕊+d1×𝕊+d1superscriptsubscript𝕊𝑑1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}\times\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT × blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT and maps surjectively to [0,1]01[0,1][ 0 , 1 ], which we refer to as the dual of D𝐷Ditalic_D. Following from its definition, we have DDsuperscript𝐷𝐷D^{\dagger}\geq Ditalic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ≥ italic_D, and a triangular-like inequality holds:

D(𝐰1,𝐰3)D(𝐰1,𝐰2)+D(𝐰2,𝐰3).𝐷subscript𝐰1subscript𝐰3𝐷subscript𝐰1subscript𝐰2superscript𝐷subscript𝐰2subscript𝐰3D(\mathbf{w}_{1},\mathbf{w}_{3})\leq D(\mathbf{w}_{1},\mathbf{w}_{2})+D^{% \dagger}\left(\mathbf{w}_{2},\mathbf{w}_{3}\right).italic_D ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ≤ italic_D ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) . (10)

Some common dissimilarity measures are only semimetrics but not metrics. Below, we consider (s)=2\|\cdot\|_{(s)}=\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT so that 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT is the 2222-norm sphere. The cosine dissimilarity adopted in the spherical k𝑘kitalic_k-means of [3, 20] is given by

Dcos(𝐰1,𝐰2)=1𝐰1𝐰2,subscript𝐷subscript𝐰1subscript𝐰21superscriptsubscript𝐰1topsubscript𝐰2D_{\cos}(\mathbf{w}_{1},\mathbf{w}_{2})=1-\mathbf{w}_{1}^{\top}\mathbf{w}_{2},italic_D start_POSTSUBSCRIPT roman_cos end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 - bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (11)

where 𝐰1,𝐰2𝕊+d1dsubscript𝐰1subscript𝐰2superscriptsubscript𝕊𝑑1superscript𝑑\mathbf{w}_{1},\mathbf{w}_{2}\in\mathbb{S}_{+}^{d-1}\subset\mathbb{R}^{d}bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT . The dissimilarity measure corresponding to the k𝑘kitalic_k-pc algorithm of [7] is given by

Dpc(𝐰1,𝐰2)=1(𝐰1𝐰2)2.subscript𝐷pcsubscript𝐰1subscript𝐰21superscriptsuperscriptsubscript𝐰1topsubscript𝐰22D_{\mathrm{pc}}(\mathbf{w}_{1},\mathbf{w}_{2})=1-\left(\mathbf{w}_{1}^{\top}% \mathbf{w}_{2}\right)^{2}.italic_D start_POSTSUBSCRIPT roman_pc end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 - ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (12)

These two dissimilarity measures enjoy computational advantages, although neither of them is a metric. Note that since |(𝐰1𝐰2)2(𝐰1𝐰3)2|2|𝐰1𝐰2𝐰1𝐰3|2𝐰2𝐰32superscriptsuperscriptsubscript𝐰1topsubscript𝐰22superscriptsuperscriptsubscript𝐰1topsubscript𝐰322superscriptsubscript𝐰1topsubscript𝐰2superscriptsubscript𝐰1topsubscript𝐰32subscriptnormsubscript𝐰2subscript𝐰32\left|\left(\mathbf{w}_{1}^{\top}\mathbf{w}_{2}\right)^{2}-\left(\mathbf{w}_{1% }^{\top}\mathbf{w}_{3}\right)^{2}\right|\leq 2|\mathbf{w}_{1}^{\top}\mathbf{w}% _{2}-\mathbf{w}_{1}^{\top}\mathbf{w}_{3}|\leq 2\|\mathbf{w}_{2}-\mathbf{w}_{3}% \mathbf{\|}_{2}| ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ≤ 2 | bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | ≤ 2 ∥ bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, 𝐰i𝕊+d1dsubscript𝐰𝑖superscriptsubscript𝕊𝑑1superscript𝑑\mathbf{w}_{i}\in\mathbb{S}_{+}^{d-1}\subset\mathbb{R}^{d}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, one obtains a bound for the dual semimetric as D(𝐰2,𝐰3)c𝐰2𝐰32superscript𝐷subscript𝐰2subscript𝐰3𝑐subscriptnormsubscript𝐰2subscript𝐰32D^{\dagger}\left(\mathbf{w}_{2},\mathbf{w}_{3}\right)\leq c\|\mathbf{w}_{2}-% \mathbf{w}_{3}\|_{2}italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ≤ italic_c ∥ bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for D=Dcos𝐷subscript𝐷cosD=D_{\mathrm{cos}}italic_D = italic_D start_POSTSUBSCRIPT roman_cos end_POSTSUBSCRIPT or Dpcsubscript𝐷pcD_{\mathrm{pc}}italic_D start_POSTSUBSCRIPT roman_pc end_POSTSUBSCRIPT, with constant c=1𝑐1c=1italic_c = 1 or 2222 respectively.

To simplify the mathematical description of clustering of sample data, it is convenient to use the notion of multiset. Recall that a multiset W𝑊Witalic_W on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT is a set that allows repetition of its elements, whose support, denoted as suppWsupp𝑊\operatorname{supp}{W}roman_supp italic_W, is a subset of 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT in the usual sense that eliminates repetitions in W𝑊Witalic_W. For instance, with two distinct points 𝐰1subscript𝐰1\mathbf{w}_{1}bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐰2subscript𝐰2\mathbf{w}_{2}bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, one can have W={𝐰1,𝐰1,𝐰2}𝑊subscript𝐰1subscript𝐰1subscript𝐰2W=\{\mathbf{w}_{1},\mathbf{w}_{1},\mathbf{w}_{2}\}italic_W = { bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } with suppW={𝐰1,𝐰2}supp𝑊subscript𝐰1subscript𝐰2\operatorname{supp}{W}=\{\mathbf{w}_{1},\mathbf{w}_{2}\}roman_supp italic_W = { bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. A multiset W𝑊Witalic_W can be characterized by the multiplicity function mW:𝕊+d1{0,1,2,}:subscript𝑚𝑊maps-tosuperscriptsubscript𝕊𝑑1012m_{W}:\mathbb{S}_{+}^{d-1}\mapsto\{0,1,2,\ldots\}italic_m start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT : blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT ↦ { 0 , 1 , 2 , … }, where mW(𝐰)subscript𝑚𝑊𝐰m_{W}(\mathbf{w})italic_m start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_w ) equals the number of repetitions of element 𝐰𝕊+d1𝐰superscriptsubscript𝕊𝑑1\mathbf{w}\in\mathbb{S}_{+}^{d-1}bold_w ∈ blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT (mW(𝐰)=0subscript𝑚𝑊𝐰0m_{W}(\mathbf{w})=0italic_m start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_w ) = 0 if 𝐰suppW𝐰supp𝑊\mathbf{w}\notin\operatorname{supp}{W}bold_w ∉ roman_supp italic_W). A subset of 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT in the usual sense can be understood as a multiset with the multiplicity taking value either 00 or 1111, with the empty set corresponding to a multiplicity function that is identically 00. When the notation 𝐰W𝐰𝑊\mathbf{w}\in Wbold_w ∈ italic_W is used for a multiset W𝑊Witalic_W, it means that 𝐰𝐰\mathbf{w}bold_w is an element in suppWsupp𝑊\operatorname{supp}{W}roman_supp italic_W. For multisets W1,W2subscript𝑊1subscript𝑊2W_{1},W_{2}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with multiplicity functions m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively, their union W1W2subscript𝑊1subscript𝑊2W_{1}\cup W_{2}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∪ italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is given by the multiset characterized by the multiplicity function m1m2subscript𝑚1subscript𝑚2m_{1}\vee m_{2}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∨ italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and their intersection W1W2subscript𝑊1subscript𝑊2W_{1}\cap W_{2}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∩ italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is given by the multiset characterized by m1m2subscript𝑚1subscript𝑚2m_{1}\wedge m_{2}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∧ italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The relation W1W2subscript𝑊1subscript𝑊2W_{1}\subset W_{2}italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⊂ italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is understood as m1m2subscript𝑚1subscript𝑚2m_{1}\leq m_{2}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Furthermore, if suppWsupp𝑊\operatorname{supp}{W}roman_supp italic_W is a finite set, a summation 𝐰Wf(𝐰)subscript𝐰𝑊𝑓𝐰\sum_{\mathbf{w}\in W}f(\mathbf{w})∑ start_POSTSUBSCRIPT bold_w ∈ italic_W end_POSTSUBSCRIPT italic_f ( bold_w ) for a suitable function f𝑓fitalic_f is understood as 𝐰suppWf(𝐰)mW(𝐰)subscript𝐰supp𝑊𝑓𝐰subscript𝑚𝑊𝐰\sum_{\mathbf{w}\in\operatorname{supp}{W}}f(\mathbf{w})m_{W}(\mathbf{w})∑ start_POSTSUBSCRIPT bold_w ∈ roman_supp italic_W end_POSTSUBSCRIPT italic_f ( bold_w ) italic_m start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_w ). For example, the cardinality of W𝑊Witalic_W is defined as

|W|=𝐰suppWmW(𝐰).𝑊subscript𝐰supp𝑊subscript𝑚𝑊𝐰|W|=\sum_{\mathbf{w}\in\operatorname{supp}{W}}m_{W}(\mathbf{w}).| italic_W | = ∑ start_POSTSUBSCRIPT bold_w ∈ roman_supp italic_W end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_w ) .

Also we write D(𝐰,W)=inf𝐬suppWD(𝐰,𝐬)𝐷𝐰𝑊subscriptinfimum𝐬supp𝑊𝐷𝐰𝐬D(\mathbf{w},W)=\inf_{\mathbf{s}\in\operatorname{supp}{W}}D(\mathbf{w},\mathbf% {s})italic_D ( bold_w , italic_W ) = roman_inf start_POSTSUBSCRIPT bold_s ∈ roman_supp italic_W end_POSTSUBSCRIPT italic_D ( bold_w , bold_s ).

Now suppose W𝑊Witalic_W is a multiset on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT with cardinality |W|<𝑊|W|<\infty| italic_W | < ∞. Suppose k+𝑘subscriptk\in\mathbb{Z}_{+}italic_k ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT and k|W|𝑘𝑊k\leq|W|italic_k ≤ | italic_W |. Let Ak={𝐚1,,𝐚k}superscriptsubscript𝐴𝑘superscriptsubscript𝐚1superscriptsubscript𝐚𝑘A_{k}^{*}=\left\{\mathbf{a}_{1}^{*},\ldots,\mathbf{a}_{k}^{*}\right\}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , bold_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } be a multiset on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT with cardinality k𝑘kitalic_k, which satisfies

𝐰WD(𝐰,Ak)=inf{𝐰WD(𝐰,A):suppA𝕊+d1,|A|=k}.subscript𝐰𝑊𝐷𝐰superscriptsubscript𝐴𝑘infimumconditional-setsubscript𝐰𝑊𝐷𝐰𝐴formulae-sequencesupp𝐴superscriptsubscript𝕊𝑑1𝐴𝑘\sum_{\mathbf{w}\in W}D\left(\mathbf{w},A_{k}^{*}\right)=\inf\left\{\sum_{% \mathbf{w}\in W}D(\mathbf{w},A):\ \operatorname{supp}{A}\subset\mathbb{S}_{+}^% {d-1},\ |A|=k\right\}.∑ start_POSTSUBSCRIPT bold_w ∈ italic_W end_POSTSUBSCRIPT italic_D ( bold_w , italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_inf { ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W end_POSTSUBSCRIPT italic_D ( bold_w , italic_A ) : roman_supp italic_A ⊂ blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT , | italic_A | = italic_k } . (13)

The existence of Aksuperscriptsubscript𝐴𝑘A_{k}^{*}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is guaranteed by the continuity of D𝐷Ditalic_D and the compactness of 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, although it may not be unique. Notice that when |suppW|ksupp𝑊𝑘|\operatorname{supp}{W}|\geq k| roman_supp italic_W | ≥ italic_k, the infimum in (13) must be achieved with a distinct set of 𝐚isuperscriptsubscript𝐚𝑖\mathbf{a}_{i}^{*}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT’s. Below when multisets C1,,Cksubscript𝐶1subscript𝐶𝑘C_{1},\ldots,C_{k}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT with multiplicity functions m1,,mksubscript𝑚1subscript𝑚𝑘m_{1},\ldots,m_{k}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are said to form a partition of a multiset W𝑊Witalic_W with multiplicity function m𝑚mitalic_m, it means that m=m1++mk𝑚subscript𝑚1subscript𝑚𝑘m=m_{1}+\ldots+m_{k}italic_m = italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_m start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, and mi0subscript𝑚𝑖0m_{i}\neq 0italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ 0 for all i{1,,k}𝑖1𝑘i\in\{1,\cdots,k\}italic_i ∈ { 1 , ⋯ , italic_k }.

Definition 2.2.

A k𝑘kitalic_k-clustering of a multiset W𝑊Witalic_W, 1k|W|1𝑘𝑊1\leq k\leq|W|1 ≤ italic_k ≤ | italic_W |, with respect to the dissimilarity measure D𝐷Ditalic_D refers to a pair (Ak,k)superscriptsubscript𝐴𝑘subscript𝑘(A_{k}^{*},\mathfrak{C}_{k})( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). Here Aksuperscriptsubscript𝐴𝑘A_{k}^{*}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is as described above, and k={C1,,Ck}subscript𝑘subscript𝐶1subscript𝐶𝑘\mathfrak{C}_{k}=\{C_{1},\ldots,C_{k}\}fraktur_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } is a partition of W𝑊Witalic_W into a collection of multisets Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s such that D(𝐰,Ak)=D(𝐰,𝐚i)𝐷𝐰superscriptsubscript𝐴𝑘𝐷𝐰subscript𝐚𝑖D\left(\mathbf{w},A_{k}^{*}\right)=D\left(\mathbf{w},\mathbf{a}_{i}\right)italic_D ( bold_w , italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_D ( bold_w , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) for all 𝐰Ci𝐰subscript𝐶𝑖\mathbf{w}\in C_{i}bold_w ∈ italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i{1,,k}𝑖1𝑘i\in\left\{1,\ldots,k\right\}italic_i ∈ { 1 , … , italic_k }. We refer to Aksuperscriptsubscript𝐴𝑘A_{k}^{*}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as the set of centers and each Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as a cluster.

Remark 2.3.

A k𝑘kitalic_k-clustering of W𝑊Witalic_W always exists, although it may not be unique even when Aksuperscriptsubscript𝐴𝑘A_{k}^{*}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is unique: there may be points in suppWsupp𝑊\operatorname{supp}{W}roman_supp italic_W with the same D𝐷Ditalic_D-dissimilarity to multiple centers. On the other hand, it is always possible to ensure non-emptiness of each cluster Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT when k|W|𝑘𝑊k\leq|W|italic_k ≤ | italic_W |.

With the choices D=Dcos𝐷subscript𝐷D=D_{\cos}italic_D = italic_D start_POSTSUBSCRIPT roman_cos end_POSTSUBSCRIPT and Dpcsubscript𝐷pcD_{\rm{pc}}italic_D start_POSTSUBSCRIPT roman_pc end_POSTSUBSCRIPT in (11) and (12), respectively, a k𝑘kitalic_k-clustering corresponds to the spherical k𝑘kitalic_k-means and k𝑘kitalic_k-pc clustering of [3] and [7], respectively. Solving a k𝑘kitalic_k-clustering problem can be computationally hard, and typically, the solution can only be approximated by a heuristic algorithm such as a Lloyd-type iterative algorithm as in [3] and [7]. In the theoretical analysis of this paper, we assume that a k𝑘kitalic_k-clustering can be found accurately. In addition, when W𝑊Witalic_W is later given by a random subsample Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of the total sample (𝐗i)i=1,,nsubscriptsubscript𝐗𝑖𝑖1𝑛(\mathbf{X}_{i})_{i=1,\ldots,n}( bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_n end_POSTSUBSCRIPT, we assume that the elements in Aksuperscriptsubscript𝐴𝑘A_{k}^{*}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the labels 𝟏{𝐗iCj}1subscript𝐗𝑖subscript𝐶𝑗\mathbf{1}\left\{\mathbf{X}_{i}\in C_{j}\right\}bold_1 { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }, i{1,,n}𝑖1𝑛i\in\{1,\ldots,n\}italic_i ∈ { 1 , … , italic_n }, j{1,,k}𝑗1𝑘j\in\{1,\ldots,k\}italic_j ∈ { 1 , … , italic_k }, are measurable.

2.3 Spherical clustering for multivariate extremes

We follow [20] and [7] to relate the spherical clustering to the analysis of multivariate extremes. Suppose that (𝐗1,,𝐗n)subscript𝐗1subscript𝐗𝑛\left(\mathbf{X}_{1},\ldots,\mathbf{X}_{n}\right)( bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), n+𝑛subscriptn\in\mathbb{Z}_{+}italic_n ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, are i.i.d. samples of 𝐗𝐗\mathbf{X}bold_X, which is marginally standardized and regularly varying on 𝔼dsubscript𝔼𝑑\mathbb{E}_{d}blackboard_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT with spectral measure H𝐻Hitalic_H on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT as assumed in Section 2.1. We shall also follow the notation introduced in the same section. Let nsubscript𝑛\ell_{n}roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT be an intermediate sequence satisfying nsubscript𝑛\ell_{n}\rightarrow\inftyroman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → ∞ and n/n0subscript𝑛𝑛0\ell_{n}/n\rightarrow 0roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_n → 0 as n𝑛n\rightarrow\inftyitalic_n → ∞. Introduce a multiset on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT representing the extremal subsample:

Wn={𝐗i/𝐗i(s):𝐗i(r)(n/n)1/α,i{1,,n}}.{W}_{n}=\left\{\mathbf{X}_{i}/\|\mathbf{X}_{i}\|_{(s)}:\ \|\mathbf{X}_{i}\|_{(% r)}\geq\left(n/\ell_{n}\right)^{1/\alpha},\ i\in\left\{1,\ldots,n\right\}% \right\}.italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT : ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ ( italic_n / roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT , italic_i ∈ { 1 , … , italic_n } } . (14)

In words, the extremal subsample is selected by sample points with largest (r)\|\cdot\|_{(r)}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT norms projected onto the (s)\|\cdot\|_{(s)}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT-norm sphere 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT. The choice of nsubscript𝑛\ell_{n}roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the regular variation assumption together imply

E|Wn|=nPr(𝐗n(r)(n/n)1/α)nc(r)Esubscript𝑊𝑛𝑛Prsubscriptnormsubscript𝐗𝑛𝑟superscript𝑛subscript𝑛1𝛼similar-tosubscript𝑛subscript𝑐𝑟\mathrm{E}|W_{n}|=n\Pr\left(\|\mathbf{X}_{n}\|_{(r)}\geq\left(n/\ell_{n}\right% )^{1/\alpha}\right)\sim\ell_{n}c_{(r)}\rightarrow\inftyroman_E | italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | = italic_n roman_Pr ( ∥ bold_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ ( italic_n / roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT ) ∼ roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT → ∞ (15)

as n𝑛n\rightarrow\inftyitalic_n → ∞, where c(r)subscript𝑐𝑟c_{(r)}italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT is in (6). Notice that the set of the form {𝐱𝔼d:𝐱(r)x}conditional-set𝐱subscript𝔼𝑑subscriptnorm𝐱𝑟𝑥\{\mathbf{x}\in\mathbb{E}_{d}:\ \|\mathbf{x}\|_{(r)}\geq x\}{ bold_x ∈ blackboard_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT : ∥ bold_x ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ italic_x }, x>0𝑥0x>0italic_x > 0, is always a ΛΛ\Lambdaroman_Λ-continuity set due to the homogeneity of ΛΛ\Lambdaroman_Λ. Then by a triangular-array version of the Strong Law of Large Numbers (see, e.g., [18]), we have

|Wn|/n=1ni=1n𝟏{𝐗i(r)(n/n)1/α}Λ({𝐱𝔼d:𝐱(r)1})=c(r)subscript𝑊𝑛subscript𝑛1subscript𝑛superscriptsubscript𝑖1𝑛1subscriptnormsubscript𝐗𝑖𝑟superscript𝑛subscript𝑛1𝛼Λconditional-set𝐱subscript𝔼𝑑subscriptnorm𝐱𝑟1subscript𝑐𝑟|W_{n}|/\ell_{n}=\frac{1}{\ell_{n}}\sum_{i=1}^{n}\mathbf{1}\left\{\|\mathbf{X}% _{i}\|_{(r)}\geq\left(n/\ell_{n}\right)^{1/\alpha}\right\}\rightarrow\Lambda% \left(\{\mathbf{x}\in\mathbb{E}_{d}:\ \|\mathbf{x}\|_{(r)}\geq 1\}\right)=c_{(% r)}| italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | / roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_1 { ∥ bold_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ ( italic_n / roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT } → roman_Λ ( { bold_x ∈ blackboard_E start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT : ∥ bold_x ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ 1 } ) = italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT (16)

almost surely as n𝑛n\rightarrow\inftyitalic_n → ∞.

Next, define the following empirical spectral measure on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT as

Hn=1|Wn|𝐰Wnδ𝐰,subscript𝐻𝑛1subscript𝑊𝑛subscript𝐰subscript𝑊𝑛subscript𝛿𝐰H_{n}=\frac{1}{|W_{n}|}\sum_{\mathbf{w}\in W_{n}}\delta_{\mathbf{w}},italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT , (17)

where Hnsubscript𝐻𝑛H_{n}italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is understood as a zero measure if |Wn|=0subscript𝑊𝑛0|W_{n}|=0| italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | = 0. Then we have the following basic consistency result.

Proposition 2.4.

For any S𝑆Sitalic_S that is a H𝐻Hitalic_H-continuity Borel subset of 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, we have Hn(S)H(S)subscript𝐻𝑛𝑆𝐻𝑆H_{n}(S)\rightarrow H(S)italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_S ) → italic_H ( italic_S ) almost surely as n𝑛n\rightarrow\inftyitalic_n → ∞.

Proof.

It follows from a triangular-array Strong Law of Large Numbers with the relations (2), (8), (15) and (16). ∎

Now we consider applying the k𝑘kitalic_k-clustering in Definition 2.2 to the random subsample Wnsubscript𝑊𝑛{W}_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. In particular, suppose that Ak,n=(𝐚1,nk,,𝐚k,nk)subscript𝐴𝑘𝑛superscriptsubscript𝐚1𝑛𝑘superscriptsubscript𝐚𝑘𝑛𝑘{A}_{k,n}=\left({\mathbf{a}}_{1,n}^{k},\ldots,{\mathbf{a}}_{k,n}^{k}\right)italic_A start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT = ( bold_a start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , … , bold_a start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) is a multiset on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT with cardinality k𝑘kitalic_k that Ci,nk,i{1,,k}superscriptsubscript𝐶𝑖𝑛𝑘𝑖1𝑘{C}_{i,n}^{k},i\in\left\{1,\ldots,k\right\}italic_C start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_i ∈ { 1 , … , italic_k }, are multisets on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, such that (Ak,n,k,n={C1,nk,,Ck,nk})subscript𝐴𝑘𝑛subscript𝑘𝑛superscriptsubscript𝐶1𝑛𝑘superscriptsubscript𝐶𝑘𝑛𝑘\left({A}_{k,n},\mathfrak{C}_{k,n}=\left\{{C}_{1,n}^{k},\ldots,{C}_{k,n}^{k}% \right\}\right)( italic_A start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT = { italic_C start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT } ) form a k𝑘kitalic_k-clustering of Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Corollary 2.5.

Suppose 𝐗𝐗\mathbf{X}bold_X has a spectral measure of the following form:

H=i=1kpiδ𝐚i,𝐻superscriptsubscript𝑖1𝑘subscript𝑝𝑖subscript𝛿subscript𝐚𝑖H=\sum_{i=1}^{k}p_{i}\delta_{\mathbf{a}_{i}},italic_H = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (18)

where 𝐚isubscript𝐚𝑖\mathbf{a}_{i}bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s are distinct points on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT, and pi>0subscript𝑝𝑖0p_{i}>0italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0, p1++pk=1subscript𝑝1subscript𝑝𝑘1p_{1}+\ldots+p_{k}=1italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1. Let

pi,nk=|Ci,nk||Wn|,superscriptsubscript𝑝𝑖𝑛𝑘superscriptsubscript𝐶𝑖𝑛𝑘subscript𝑊𝑛p_{i,n}^{k}=\frac{|C_{i,n}^{k}|}{|W_{n}|},italic_p start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG | italic_C start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT | end_ARG start_ARG | italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | end_ARG , (19)

if |Wn|>0subscript𝑊𝑛0|W_{n}|>0| italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | > 0, and set pi,nksuperscriptsubscript𝑝𝑖𝑛𝑘p_{i,n}^{k}italic_p start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as 00 if |Wn|=0subscript𝑊𝑛0|W_{n}|=0| italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | = 0. Then there exist bijections πn:{1,,k}{1,,k}:subscript𝜋𝑛maps-to1𝑘1𝑘\pi_{n}:\{1,\ldots,k\}\mapsto\{1,\ldots,k\}italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : { 1 , … , italic_k } ↦ { 1 , … , italic_k }, n+𝑛subscriptn\in\mathbb{Z}_{+}italic_n ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, such that

𝐚πn(i),nk𝐚i, and pπn(i),nkpi,i{1,,k},formulae-sequencesuperscriptsubscript𝐚subscript𝜋𝑛𝑖𝑛𝑘subscript𝐚𝑖formulae-sequence and superscriptsubscript𝑝subscript𝜋𝑛𝑖𝑛𝑘subscript𝑝𝑖𝑖1𝑘\mathbf{a}_{\pi_{n}(i),n}^{k}\rightarrow\mathbf{a}_{i},\ \text{ and }\ p_{\pi_% {n}(i),n}^{k}\rightarrow p_{i},\quad i\in\{1,\ldots,k\},bold_a start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , and italic_p start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT → italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ { 1 , … , italic_k } ,

almost surely.

Proof.

The convergence of 𝐚πn(i),nksuperscriptsubscript𝐚subscript𝜋𝑛𝑖𝑛𝑘\mathbf{a}_{\pi_{n}(i),n}^{k}bold_a start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT follows from [20, Theorem 3.1] (stated as convergence in Hausdorff distance between Ak,nsubscript𝐴𝑘𝑛{A}_{k,n}italic_A start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT and {𝐚1,,𝐚k}subscript𝐚1subscript𝐚𝑘\{\mathbf{a}_{1},\ldots,\mathbf{a}_{k}\}{ bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }) and Proposition 2.4 above; see also the discussion in [20, Section 4]). It remains to show the convergence of pπn(i),nksuperscriptsubscript𝑝subscript𝜋𝑛𝑖𝑛𝑘p_{\pi_{n}(i),n}^{k}italic_p start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT, i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k }. Set

rA=sup{r>0:B(𝐚i,r),i{1,,k}, are disjoint}>0.subscript𝑟𝐴supremumconditional-set𝑟0formulae-sequence𝐵subscript𝐚𝑖𝑟𝑖1𝑘 are disjoint0r_{A}=\sup\left\{r>0:B(\mathbf{a}_{i},r),\ i\in\left\{1,\ldots,k\right\},\text% { are disjoint}\right\}>0.italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT = roman_sup { italic_r > 0 : italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r ) , italic_i ∈ { 1 , … , italic_k } , are disjoint } > 0 . (20)

Fix ϵ(0,rA/3)italic-ϵ0subscript𝑟𝐴3\epsilon\in(0,r_{A}/3)italic_ϵ ∈ ( 0 , italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT / 3 ). By what has been proved and the continuity of Dsuperscript𝐷D^{\dagger}italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT, at almost every outcome ω𝜔\omegaitalic_ω of the sample space ΩΩ\Omegaroman_Ω, for n𝑛nitalic_n sufficiently large, we have D(𝐚πn(i),nk,𝐚i)<ϵsuperscript𝐷superscriptsubscript𝐚subscript𝜋𝑛𝑖𝑛𝑘subscript𝐚𝑖italic-ϵD^{\dagger}(\mathbf{a}_{\pi_{n}(i),n}^{k},\mathbf{a}_{i})<\epsilonitalic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_a start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_ϵ , i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k }. Fix for now such an ω𝜔\omegaitalic_ω and let n𝑛nitalic_n be sufficiently large (possibly depending on ω𝜔\omegaitalic_ω). Then by the triangular inequality (10), we have B(𝐚i,ϵ)B(𝐚πn(i),nk,2ϵ)B(𝐚i,3ϵ)𝐵subscript𝐚𝑖italic-ϵ𝐵superscriptsubscript𝐚subscript𝜋𝑛𝑖𝑛𝑘2italic-ϵ𝐵subscript𝐚𝑖3italic-ϵB(\mathbf{a}_{i},\epsilon)\subset B\left(\mathbf{a}_{\pi_{n}(i),n}^{k},2% \epsilon\right)\subset B(\mathbf{a}_{i},3\epsilon)italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ⊂ italic_B ( bold_a start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , 2 italic_ϵ ) ⊂ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , 3 italic_ϵ ), i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k }. Note that B(𝐚πn(i),nk,2ϵ)Wn𝐵superscriptsubscript𝐚subscript𝜋𝑛𝑖𝑛𝑘2italic-ϵsubscript𝑊𝑛B\left(\mathbf{a}_{\pi_{n}(i),n}^{k},2\epsilon\right)\cap W_{n}italic_B ( bold_a start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , 2 italic_ϵ ) ∩ italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are disjoint for i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k }. So in view of Definition 2.2, we have B(𝐚i,ϵ)WnB(𝐚πn(i),nk,2ϵ)WnCπn(i),nk(B(𝐚i,ϵ)jiB(𝐚j,ϵ)c)WnB(\mathbf{a}_{i},\epsilon)\cap W_{n}\subset B\left(\mathbf{a}_{\pi_{n}(i),n}^{% k},2\epsilon\right)\cap W_{n}\subset C_{\pi_{n}(i),n}^{k}\subset\left(B(% \mathbf{a}_{i},\epsilon)\cup\cap_{j\neq i}B(\mathbf{a}_{j},\epsilon)^{c}\right% )\cap W_{n}italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ∩ italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ italic_B ( bold_a start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , 2 italic_ϵ ) ∩ italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⊂ italic_C start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⊂ ( italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ∪ ∩ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_B ( bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ϵ ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ∩ italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and hence

Hn(B(𝐚i,ϵ))pπn(i),nkHn(B(𝐚i,ϵ)jiB(𝐚j,ϵ)c),i{1,,k}.H_{n}\left(B(\mathbf{a}_{i},\epsilon)\right)\leq p_{\pi_{n}(i),n}^{k}\leq H_{n% }\left(B(\mathbf{a}_{i},\epsilon)\cup\cap_{j\neq i}B(\mathbf{a}_{j},\epsilon)^% {c}\right),\quad i\in\{1,\ldots,k\}.italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ) ≤ italic_p start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≤ italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ∪ ∩ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT italic_B ( bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ϵ ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) , italic_i ∈ { 1 , … , italic_k } .

The conclusion then follows from Proposition 2.4 since both sides above converges almost surely to pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as n𝑛n\rightarrow\inftyitalic_n → ∞. ∎

Remark 2.6.

Comparing [20, Proposition 3.3] with Proposition 2.4 and Corollary 2.5 here, we have chosen to work directly under the marginal standardization assumption in (1) and not to treat the empirical marginal transformations as in [20, Eq. (3.5)] for simplicity. Nevertheless, the consistency result of the order selection below (Theorem 3.1) can be extended to the setup of [20] based on the results there.

3 Order selection via penalized silhouette

3.1 The method

Following the notation and setup in Section 2.2, suppose W𝑊Witalic_W is a multiset on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT and 1k|W|<1𝑘𝑊1\leq k\leq|W|<\infty1 ≤ italic_k ≤ | italic_W | < ∞. Let (Ak={𝐚1,,𝐚k},k={C1,,Ck})formulae-sequencesuperscriptsubscript𝐴𝑘superscriptsubscript𝐚1superscriptsubscript𝐚𝑘subscript𝑘subscript𝐶1subscript𝐶𝑘\left(A_{k}^{*}=\left\{\mathbf{a}_{1}^{*},\ldots,\mathbf{a}_{k}^{*}\right\},% \mathfrak{C}_{k}=\left\{C_{1},\ldots,C_{k}\right\}\right)( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = { bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , bold_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } , fraktur_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ) be a k𝑘kitalic_k-clustering of W𝑊Witalic_W with respect to a dissimilarity measure D𝐷Ditalic_D as in Definition 2.2. Define for 𝐰W𝐰𝑊\mathbf{w}\in Wbold_w ∈ italic_W that

a(𝐰)=D(𝐰,Ak),andb(𝐰)=i=1kD(𝐰,Ak{𝐚i}),formulae-sequence𝑎𝐰𝐷𝐰superscriptsubscript𝐴𝑘and𝑏𝐰superscriptsubscript𝑖1𝑘𝐷𝐰superscriptsubscript𝐴𝑘superscriptsubscript𝐚𝑖a(\mathbf{w})=D\left(\mathbf{w},A_{k}^{*}\right),\quad\text{and}\quad b(% \mathbf{w})=\bigvee_{i=1}^{k}D\left(\mathbf{w},A_{k}^{*}\setminus\left\{% \mathbf{a}_{i}^{*}\right\}\right),italic_a ( bold_w ) = italic_D ( bold_w , italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , and italic_b ( bold_w ) = ⋁ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_D ( bold_w , italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∖ { bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } ) ,

which are respectively the dissimilarities of 𝐰𝐰\mathbf{w}bold_w to the closest center (i.e., the center of the cluster it belongs to) and to the second closest center. When k=1𝑘1k=1italic_k = 1. we understand b(𝐰)=1𝑏𝐰1b(\mathbf{w})=1italic_b ( bold_w ) = 1. The (simplified) average silhouette width (ASW) [17] of this k𝑘kitalic_k-clustering is then defined as

S¯=S¯(W;Ak)=1|W|𝐰Wb(𝐰)a(𝐰)b(𝐰)=11|W|𝐰Wa(𝐰)b(𝐰).¯𝑆¯𝑆𝑊superscriptsubscript𝐴𝑘1𝑊subscript𝐰𝑊𝑏𝐰𝑎𝐰𝑏𝐰11𝑊subscript𝐰𝑊𝑎𝐰𝑏𝐰\bar{S}=\bar{S}\left(W;A_{k}^{*}\right)=\frac{1}{|W|}\sum_{\mathbf{w}\in W}% \frac{b(\mathbf{w})-a(\mathbf{w})}{b(\mathbf{w})}=1-\frac{1}{|W|}\sum_{\mathbf% {w}\in W}\frac{a(\mathbf{w})}{b(\mathbf{w})}.over¯ start_ARG italic_S end_ARG = over¯ start_ARG italic_S end_ARG ( italic_W ; italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W end_POSTSUBSCRIPT divide start_ARG italic_b ( bold_w ) - italic_a ( bold_w ) end_ARG start_ARG italic_b ( bold_w ) end_ARG = 1 - divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W end_POSTSUBSCRIPT divide start_ARG italic_a ( bold_w ) end_ARG start_ARG italic_b ( bold_w ) end_ARG . (21)

A well-clustered dataset is expected to have small a(𝐰)𝑎𝐰a(\mathbf{w})italic_a ( bold_w ) values relative to b(𝐰)𝑏𝐰b(\mathbf{w})italic_b ( bold_w ) across the majority of 𝐰𝐰\mathbf{w}bold_w points. Hence, one often uses S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG to guide the selection of the number of clusters, that is, to choose k𝑘kitalic_k which maximizes S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG. However, when experimenting applying the ASW to multivariate extremes with a discrete spectral measure as described in Section 2.3, the performance is unsatisfactory: it tends to respond insensitively when the number of clusters exceeds the true k𝑘kitalic_k, i.e., the number of atoms of the spectral measure; see, for example, the curve corresponding to t=0𝑡0t=0italic_t = 0 in Figure 1. In particular, we observe two behaviors of ASW that lead to the issue: 1) it tends to treat a tiny fraction of isolated points as a cluster; 2) it sometimes splits a single cluster center into multiple centers that are close to each other.

Motivated by these observations, we propose to introduce a penalty term that discourages small cluster size and small D𝐷Ditalic_D dissimilarity between nearest centers. There is arguably some arbitrariness in the choice of this penalty. Through some mathematical heuristics and extensive experiments, we find that the following penalty works relatively well. Let t0𝑡0t\geq 0italic_t ≥ 0 be a tuning parameter. Set

Pt=Pt(W;Ak,k)=1(mini=1,,k(|Ci||W|/k))t(min1i<jkD(𝐚i,𝐚j))t,subscript𝑃𝑡subscript𝑃𝑡𝑊superscriptsubscript𝐴𝑘subscript𝑘1superscriptsubscript𝑖1𝑘subscript𝐶𝑖𝑊𝑘𝑡superscriptsubscript1𝑖𝑗𝑘𝐷superscriptsubscript𝐚𝑖superscriptsubscript𝐚𝑗𝑡P_{t}=P_{t}\left(W;A_{k}^{*},\mathfrak{C}_{k}\right)=1-\left(\min_{i=1,\ldots,% k}\left(\frac{|C_{i}|}{|W|/k}\right)\right)^{t}\left(\min_{1\leq i<j\leq k}D% \left(\mathbf{a}_{i}^{*},\mathbf{a}_{j}^{*}\right)\right)^{t},italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W ; italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = 1 - ( roman_min start_POSTSUBSCRIPT italic_i = 1 , … , italic_k end_POSTSUBSCRIPT ( divide start_ARG | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_W | / italic_k end_ARG ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( roman_min start_POSTSUBSCRIPT 1 ≤ italic_i < italic_j ≤ italic_k end_POSTSUBSCRIPT italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , (22)

where min1i<jkD(𝐚i,𝐚j)subscript1𝑖𝑗𝑘𝐷superscriptsubscript𝐚𝑖superscriptsubscript𝐚𝑗\min_{1\leq i<j\leq k}D\left(\mathbf{a}_{i}^{*},\mathbf{a}_{j}^{*}\right)roman_min start_POSTSUBSCRIPT 1 ≤ italic_i < italic_j ≤ italic_k end_POSTSUBSCRIPT italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is understood as 1111 when k=1𝑘1k=1italic_k = 1. Note that min1ik|Ci||W|/ksubscript1𝑖𝑘subscript𝐶𝑖𝑊𝑘\min_{1\leq i\leq k}|C_{i}|\leq|W|/kroman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_k end_POSTSUBSCRIPT | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ | italic_W | / italic_k, which explains the normalization in the first denominator above. Recall also that D𝐷Ditalic_D maps surjectively onto [0,1]01[0,1][ 0 , 1 ]. Then we form the penalized ASW defined by

St=St(W;Ak,k)=S¯Pt=(mini=1,,k(|Ci||W|/k))t(min1i<jkD(𝐚i,𝐚j)))t1|W|𝐰Wa(𝐰)b(𝐰).S_{t}=S_{t}(W;A_{k}^{*},\mathfrak{C}_{k})=\bar{S}-P_{t}=\left(\min_{i=1,\ldots% ,k}\left(\frac{|C_{i}|}{|W|/k}\right)\right)^{t}\left(\min_{1\leq i<j\leq k}D% \left(\mathbf{a}_{i}^{*},\mathbf{a}_{j}^{*})\right)\right)^{t}-\frac{1}{|W|}% \sum_{\mathbf{w}\in W}\frac{a(\mathbf{w})}{b(\mathbf{w})}.italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W ; italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = over¯ start_ARG italic_S end_ARG - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( roman_min start_POSTSUBSCRIPT italic_i = 1 , … , italic_k end_POSTSUBSCRIPT ( divide start_ARG | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_W | / italic_k end_ARG ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( roman_min start_POSTSUBSCRIPT 1 ≤ italic_i < italic_j ≤ italic_k end_POSTSUBSCRIPT italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W end_POSTSUBSCRIPT divide start_ARG italic_a ( bold_w ) end_ARG start_ARG italic_b ( bold_w ) end_ARG . (23)

Notice that when t=0𝑡0t=0italic_t = 0, we have P0=0subscript𝑃00P_{0}=0italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 and hence S0=S¯subscript𝑆0¯𝑆S_{0}=\bar{S}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = over¯ start_ARG italic_S end_ARG. As t>0𝑡0t>0italic_t > 0 increases, the penalty Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT increases and hence Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT decreases.

We have the following consistency result regarding applying the penalized ASW for order selection for a multivariate extreme model with a discrete spectral measure. We follow the notation in Section 2.3.

Theorem 3.1.

Suppose that the assumption in Corollary 2.5 holds, where the true order is k+𝑘subscriptk\in\mathbb{Z}_{+}italic_k ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT in the discrete spectral measure H=i=1kpiδ𝐚i𝐻superscriptsubscript𝑖1𝑘subscript𝑝𝑖subscript𝛿subscript𝐚𝑖H=\sum_{i=1}^{k}p_{i}\delta_{\mathbf{a}_{i}}italic_H = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in (18). Let rAsubscript𝑟𝐴r_{A}italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT be defined as in (20) and define

pmin=min1ikpi.subscript𝑝subscript1𝑖𝑘subscript𝑝𝑖p_{\min}=\min_{1\leq i\leq k}p_{i}.italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (24)

Suppose (Am,n,m,n)subscript𝐴𝑚𝑛subscript𝑚𝑛(A_{m,n},\mathfrak{C}_{m,n})( italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) is an m𝑚mitalic_m-clustering of Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, m+𝑚subscriptm\in\mathbb{Z}_{+}italic_m ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Then for any t(0,t0)𝑡0subscript𝑡0t\in(0,t_{0})italic_t ∈ ( 0 , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), where t0:=ln(1rApmin)/ln(rAkpmin)assignsubscript𝑡01subscript𝑟𝐴subscript𝑝subscript𝑟𝐴𝑘subscript𝑝t_{0}:=\ln\left(1-r_{A}p_{\min}\right)/\ln\left(r_{A}kp_{\min}\right)italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := roman_ln ( 1 - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) / roman_ln ( italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_k italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ), we have

lim infn{St(Wn;Ak,n,k,n)St(Wn;Am,n,m,n)}Δtsubscriptlimit-infimum𝑛subscript𝑆𝑡subscript𝑊𝑛subscript𝐴𝑘𝑛subscript𝑘𝑛subscript𝑆𝑡subscript𝑊𝑛subscript𝐴𝑚𝑛subscript𝑚𝑛subscriptΔ𝑡\liminf_{n}\left\{S_{t}\left(W_{n};A_{k,n},\mathfrak{C}_{k,n}\right)-S_{t}% \left(W_{n};A_{m,n},\mathfrak{C}_{m,n}\right)\right\}\geq\Delta_{t}lim inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT { italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) - italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) } ≥ roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

almost surely for any mk𝑚𝑘m\neq kitalic_m ≠ italic_k, where Δt:=(rAkpmin)t1+rApmin>0assignsubscriptΔ𝑡superscriptsubscript𝑟𝐴𝑘subscript𝑝𝑡1subscript𝑟𝐴subscript𝑝0\Delta_{t}:=\left(r_{A}kp_{\min}\right)^{t}-1+r_{A}p_{\min}>0roman_Δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ( italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_k italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - 1 + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT > 0 when t(0,t0)𝑡0subscript𝑡0t\in(0,t_{0})italic_t ∈ ( 0 , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).

The theorem implies that as long as the tuning parameter is in an appropriate range, with probability tending to 1111 as n𝑛n\rightarrow\inftyitalic_n → ∞, the true order m=k𝑚𝑘m=kitalic_m = italic_k uniquely maximizes the penalized ASW. The proof of Theorem 3.1 can be found in Section 3.2. In Proposition 4.5 below, we will provide a rate of how fast the probability of false order selection decays to zero.

In practice, we suggest plotting the penalized ASW Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as a function of m=1,2,𝑚12m=1,2,\ldotsitalic_m = 1 , 2 , …, for a range of small t𝑡titalic_t values starting from t=0𝑡0t=0italic_t = 0. We increase t𝑡titalic_t until the curves start to show an obvious upward bend. We then identify the turning point m𝑚mitalic_m as the choice of the order k𝑘kitalic_k. As a quick illustration, we follow a simulation setup of (d=6,k=6)formulae-sequence𝑑6𝑘6(d=6,k=6)( italic_d = 6 , italic_k = 6 ) described in 6.1 below to simulate a max-linear factor model (Section 5.1). See Figure 1. It would be desirable to develop a fully automatic data-driven method for choosing t𝑡titalic_t, which we leave for a future work to explore.

Refer to caption
Fig. 1: A simulation instance taken from Section 6.1 d=6𝑑6d=6italic_d = 6, k=6𝑘6k=6italic_k = 6 setup. Penalized Average Silhouette Width (ASW) Stsubscript𝑆𝑡S_{t}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (vertical axis) for spherical k𝑘kitalic_k-means clustering is plotted as a function of test order m𝑚mitalic_m (horizontal axis). The different penalty values of t𝑡titalic_t are illustrated by different colors. The true discrete spectral measure in (2.5) is given by (𝐚1,p1)=((0.29,0.21,0.50,0.45,0.43,0.49),0.22)subscript𝐚1subscript𝑝1superscript0.290.210.500.450.430.49top0.22(\mathbf{a}_{1},p_{1})=((0.29,0.21,0.50,0.45,0.43,0.49)^{\top},0.22)( bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = ( ( 0.29 , 0.21 , 0.50 , 0.45 , 0.43 , 0.49 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , 0.22 ), (𝐚2,p2)=((0.74,0.00,0.59,0.00,0.32,0.00),0.10)subscript𝐚2subscript𝑝2superscript0.740.000.590.000.320.00top0.10(\mathbf{a}_{2},p_{2})=((0.74,0.00,0.59,0.00,0.32,0.00)^{\top},0.10)( bold_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( ( 0.74 , 0.00 , 0.59 , 0.00 , 0.32 , 0.00 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , 0.10 ), (𝐚3,p3)=((0.00,0.27,0.00,0.47,0.00,0.84),0.13)subscript𝐚3subscript𝑝3superscript0.000.270.000.470.000.84top0.13(\mathbf{a}_{3},p_{3})=((0.00,0.27,0.00,0.47,0.00,0.84)^{\top},0.13)( bold_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) = ( ( 0.00 , 0.27 , 0.00 , 0.47 , 0.00 , 0.84 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , 0.13 ), (𝐚4,p4)=((0.33,0.70,0.63,0.00,0.00,0.00),0.14)subscript𝐚4subscript𝑝4superscript0.330.700.630.000.000.00top0.14(\mathbf{a}_{4},p_{4})=((0.33,0.70,0.63,0.00,0.00,0.00)^{\top},0.14)( bold_a start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) = ( ( 0.33 , 0.70 , 0.63 , 0.00 , 0.00 , 0.00 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , 0.14 ), (𝐚5,p5)=((0.00,0.00,0.00,0.81,0.47,0.34),0.09)subscript𝐚5subscript𝑝5superscript0.000.000.000.810.470.34top0.09(\mathbf{a}_{5},p_{5})=((0.00,0.00,0.00,0.81,0.47,0.34)^{\top},0.09)( bold_a start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ) = ( ( 0.00 , 0.00 , 0.00 , 0.81 , 0.47 , 0.34 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , 0.09 ), (𝐚6,p6)=((0.48,0.49,0.25,0.33,0.53,0.29),0.32)subscript𝐚6subscript𝑝6superscript0.480.490.250.330.530.29top0.32(\mathbf{a}_{6},p_{6})=((0.48,0.49,0.25,0.33,0.53,0.29)^{\top},0.32)( bold_a start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) = ( ( 0.48 , 0.49 , 0.25 , 0.33 , 0.53 , 0.29 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , 0.32 ).

3.2 Consistency of order selection via penalized silhouette

In this section, we prove Theorem 3.1.

3.2.1 Some deterministic estimates

We first prepare some deterministic estimates regarding the k𝑘kitalic_k-clustering in Definition 2.2 and ASW. We shall need the setup in Assumption 2 below in the subsequent developments.

Assumption 2.

Suppose D𝐷Ditalic_D is a dissimilarity measure that satisfies Assumption 1 and W𝑊Witalic_W is a multiset on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT with |W|<𝑊|W|<\infty| italic_W | < ∞. Let A={𝐚1,,𝐚k}𝐴subscript𝐚1subscript𝐚𝑘A=\left\{\mathbf{a}_{1},\ldots,\mathbf{a}_{k}\right\}italic_A = { bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } be a set of k𝑘kitalic_k distinct points on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT and pi>0subscript𝑝𝑖0p_{i}>0italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > 0 with p1++pk=1subscript𝑝1subscript𝑝𝑘1p_{1}+\ldots+p_{k}=1italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + … + italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1, k+𝑘subscriptk\in\mathbb{Z}_{+}italic_k ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, k|W|𝑘𝑊k\leq|W|italic_k ≤ | italic_W |. Let ϵ(0,rA)italic-ϵ0subscript𝑟𝐴\epsilon\in(0,r_{A})italic_ϵ ∈ ( 0 , italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) with rAsubscript𝑟𝐴r_{A}italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT in (20), and δ(0,pmin)𝛿0subscript𝑝\delta\in\left(0,p_{\min}\right)italic_δ ∈ ( 0 , italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ), with pminsubscript𝑝p_{\min}italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT in (24), and suppose

|WB(𝐚i,ϵ)||W|piδ,i{1,,k}.formulae-sequence𝑊𝐵subscript𝐚𝑖italic-ϵ𝑊subscript𝑝𝑖𝛿𝑖1𝑘\frac{|W\cap B(\mathbf{a}_{i},\epsilon)|}{|W|}\geq p_{i}-\delta,\quad i\in% \left\{1,\ldots,k\right\}.divide start_ARG | italic_W ∩ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) | end_ARG start_ARG | italic_W | end_ARG ≥ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_δ , italic_i ∈ { 1 , … , italic_k } . (25)

Set for s>0𝑠0s>0italic_s > 0 that

rA(s)=sup{D(𝐚i,𝐰):i{1,,k},𝐰B(𝐚i,s)}.superscriptsubscript𝑟𝐴𝑠supremumconditional-setsuperscript𝐷subscript𝐚𝑖𝐰formulae-sequence𝑖1𝑘𝐰𝐵subscript𝐚𝑖𝑠r_{A}^{\dagger}(s)=\sup\left\{D^{\dagger}(\mathbf{a}_{i},\mathbf{w}):i\in\left% \{1,\ldots,k\right\},\ \mathbf{w}\in B(\mathbf{a}_{i},s)\right\}.italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_s ) = roman_sup { italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_w ) : italic_i ∈ { 1 , … , italic_k } , bold_w ∈ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s ) } . (26)

Note that rA(s)>0superscriptsubscript𝑟𝐴𝑠0r_{A}^{\dagger}(s)>0italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_s ) > 0 for any s>0𝑠0s>0italic_s > 0, and rA(s)0superscriptsubscript𝑟𝐴𝑠0r_{A}^{\dagger}(s)\rightarrow 0italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_s ) → 0 as s0𝑠0s\rightarrow 0italic_s → 0; see Remark 2.1. The next lemma provides an upper bound for the ASW S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG in (21) when the number of clusters is less than k𝑘kitalic_k in Assumption 2.

Lemma 3.2.

Suppose Assumptions 1 and 2 hold and 1m<k1𝑚𝑘1\leq m<k1 ≤ italic_m < italic_k. Let (Am,m)subscriptsuperscript𝐴𝑚subscript𝑚(A^{*}_{m},\mathfrak{C}_{m})( italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) be an m𝑚mitalic_m-clustering of W𝑊Witalic_W. Assume in addition rA(ϵ)<rAsuperscriptsubscript𝑟𝐴italic-ϵsubscript𝑟𝐴r_{A}^{\dagger}(\epsilon)<r_{A}italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) < italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. Then the (unpenalized) ASW S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG satisfies

S¯=S¯(W;Am,m)1(pminδ)(rArA(ϵ)).¯𝑆¯𝑆𝑊subscriptsuperscript𝐴𝑚subscript𝑚1subscript𝑝𝛿subscript𝑟𝐴superscriptsubscript𝑟𝐴italic-ϵ\bar{S}=\bar{S}(W;A^{*}_{m},\mathfrak{C}_{m})\leq 1-\left(p_{\min}-\delta% \right)\left(r_{A}-r_{A}^{\dagger}(\epsilon)\right).over¯ start_ARG italic_S end_ARG = over¯ start_ARG italic_S end_ARG ( italic_W ; italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ≤ 1 - ( italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_δ ) ( italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) ) .
Proof.

Since m<k𝑚𝑘m<kitalic_m < italic_k and B(𝐚i,rA)𝐵subscript𝐚𝑖subscript𝑟𝐴B(\mathbf{a}_{i},r_{A})italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT )’s are disjoint, i{1,,k}𝑖1𝑘i\in\left\{1,\ldots,k\right\}italic_i ∈ { 1 , … , italic_k }, there exists {1,,k}1𝑘\ell\in\{1,\ldots,k\}roman_ℓ ∈ { 1 , … , italic_k } such that B(𝐚,rA)Am=𝐵subscript𝐚subscript𝑟𝐴superscriptsubscript𝐴𝑚B(\mathbf{a}_{\ell},r_{A})\cap A_{m}^{*}=\emptysetitalic_B ( bold_a start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) ∩ italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ∅ . Hence for any 𝐰WB(𝐚,ϵ)𝐰𝑊𝐵subscript𝐚italic-ϵ\mathbf{w}\in W\cap B(\mathbf{a}_{\ell},\epsilon)bold_w ∈ italic_W ∩ italic_B ( bold_a start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_ϵ ), we have by the triangular inequality (10) that

a(𝐰)=D(𝐰,Am)D(𝐚,Am)D(𝐰,𝐚)rArA(ϵ).𝑎𝐰𝐷𝐰superscriptsubscript𝐴𝑚𝐷subscript𝐚superscriptsubscript𝐴𝑚superscript𝐷𝐰subscript𝐚subscript𝑟𝐴superscriptsubscript𝑟𝐴italic-ϵa(\mathbf{w})=D\left(\mathbf{w},A_{m}^{*}\right)\geq D\left(\mathbf{a}_{\ell},% A_{m}^{*}\right)-D^{\dagger}\left(\mathbf{w},\mathbf{a}_{\ell}\right)\geq r_{A% }-r_{A}^{\dagger}(\epsilon).italic_a ( bold_w ) = italic_D ( bold_w , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_D ( bold_a start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_w , bold_a start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ≥ italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) .

Then since b(𝐰)1𝑏𝐰1b(\mathbf{w})\leq 1italic_b ( bold_w ) ≤ 1, we have

1|W|𝐰Wa(𝐰)b(𝐰)1|W|𝐰Wa(𝐰)|WB(𝐚,ϵ)||W|(rArA(ϵ))(pminδ)(rArA(ϵ)),1𝑊subscript𝐰𝑊𝑎𝐰𝑏𝐰1𝑊subscript𝐰𝑊𝑎𝐰𝑊𝐵subscript𝐚italic-ϵ𝑊subscript𝑟𝐴superscriptsubscript𝑟𝐴italic-ϵsubscript𝑝𝛿subscript𝑟𝐴superscriptsubscript𝑟𝐴italic-ϵ\displaystyle\frac{1}{|W|}\sum_{\mathbf{w}\in W}\frac{a(\mathbf{w})}{b(\mathbf% {w})}\geq\frac{1}{|W|}\sum_{\mathbf{w}\in W}a(\mathbf{w})\geq\frac{|W\cap B(% \mathbf{a}_{\ell},\epsilon)|}{|W|}\left(r_{A}-r_{A}^{\dagger}(\epsilon)\right)% \geq\left(p_{\min}-\delta\right)\left(r_{A}-r_{A}^{\dagger}(\epsilon)\right),divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W end_POSTSUBSCRIPT divide start_ARG italic_a ( bold_w ) end_ARG start_ARG italic_b ( bold_w ) end_ARG ≥ divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W end_POSTSUBSCRIPT italic_a ( bold_w ) ≥ divide start_ARG | italic_W ∩ italic_B ( bold_a start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , italic_ϵ ) | end_ARG start_ARG | italic_W | end_ARG ( italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) ) ≥ ( italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_δ ) ( italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) ) ,

which implies the desired result. ∎

The next lemma states that when the number of clusters is at least k𝑘kitalic_k, there will exist at least k𝑘kitalic_k centers which are close to A𝐴Aitalic_A in Assumption 2.

Lemma 3.3.

Suppose Assumptions 1 and 2 hold. Let (Am={𝐚1,,𝐚m},m)subscriptsuperscript𝐴𝑚superscriptsubscript𝐚1superscriptsubscript𝐚𝑚subscript𝑚\left(A^{*}_{m}=\{\mathbf{a}_{1}^{*},\ldots,\mathbf{a}_{m}^{*}\},\mathfrak{C}_% {m}\right)( italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , bold_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } , fraktur_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) be an m𝑚mitalic_m-clustering of W𝑊Witalic_W, mk𝑚𝑘m\geq kitalic_m ≥ italic_k. Then for any i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k }, there exists j{1,,m}𝑗1𝑚j\in\{1,\ldots,m\}italic_j ∈ { 1 , … , italic_m }, such that D(𝐚j,𝐚i)<ϵ𝐷superscriptsubscript𝐚𝑗subscript𝐚𝑖superscriptitalic-ϵD\left(\mathbf{a}_{j}^{*},\mathbf{a}_{i}\right)<\epsilon^{\prime}italic_D ( bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, where

ϵ=ϵ(ϵ,δ)=(1kδ)ϵ+kδpminδ+rA(ϵ).superscriptitalic-ϵsuperscriptitalic-ϵitalic-ϵ𝛿1𝑘𝛿italic-ϵ𝑘𝛿subscript𝑝𝛿superscriptsubscript𝑟𝐴italic-ϵ\epsilon^{\prime}=\epsilon^{\prime}(\epsilon,\delta)=\frac{(1-k\delta)\epsilon% +k\delta}{p_{\min}-\delta}+r_{A}^{\dagger}(\epsilon).italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_ϵ , italic_δ ) = divide start_ARG ( 1 - italic_k italic_δ ) italic_ϵ + italic_k italic_δ end_ARG start_ARG italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_δ end_ARG + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) . (27)

In particular, when m=k𝑚𝑘m=kitalic_m = italic_k, and ϵitalic-ϵ\epsilonitalic_ϵ and δ𝛿\deltaitalic_δ are small enough so that ϵ<rAsuperscriptitalic-ϵsubscript𝑟𝐴\epsilon^{\prime}<r_{A}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, there exists a bijection π:{1,,k}{1,,k}:𝜋maps-to1𝑘1𝑘\pi:\{1,\ldots,k\}\mapsto\{1,\ldots,k\}italic_π : { 1 , … , italic_k } ↦ { 1 , … , italic_k }, such that D(𝐚π(i),𝐚i)<ϵ𝐷superscriptsubscript𝐚𝜋𝑖subscript𝐚𝑖superscriptitalic-ϵD\left(\mathbf{a}_{\pi(i)}^{*},\mathbf{a}_{i}\right)<\epsilon^{\prime}italic_D ( bold_a start_POSTSUBSCRIPT italic_π ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for all i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k }.

Proof.

We prove by contradiction. Suppose there exists i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k } such that D(𝐚j,𝐚i)ϵ𝐷superscriptsubscript𝐚𝑗subscript𝐚𝑖superscriptitalic-ϵD\left(\mathbf{a}_{j}^{*},\mathbf{a}_{i}\right)\geq\epsilon^{\prime}italic_D ( bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for all j{1,,m}𝑗1𝑚j\in\{1,\ldots,m\}italic_j ∈ { 1 , … , italic_m }. Then for any 𝐰WB(𝐚i,ϵ)𝐰𝑊𝐵subscript𝐚𝑖italic-ϵ\mathbf{w}\in W\cap B(\mathbf{a}_{i},\epsilon)bold_w ∈ italic_W ∩ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ), we have by the triangular inequality (10) that

D(𝐰,Am)D(𝐚i,Am)D(𝐰,𝐚i)ϵrA(ϵ).𝐷𝐰superscriptsubscript𝐴𝑚𝐷subscript𝐚𝑖superscriptsubscript𝐴𝑚superscript𝐷𝐰subscript𝐚𝑖superscriptitalic-ϵsuperscriptsubscript𝑟𝐴italic-ϵD\left(\mathbf{w},A_{m}^{*}\right)\geq D\left(\mathbf{a}_{i},A_{m}^{*}\right)-% D^{\dagger}\left(\mathbf{w},\mathbf{a}_{i}\right)\geq\epsilon^{\prime}-r_{A}^{% \dagger}(\epsilon).italic_D ( bold_w , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_w , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) .

Hence combining this and Assumption 2,

1|W|𝐰WD(𝐰,Am)1|W|𝐰WB(𝐚i,ϵ)D(𝐰,Am)(piδ)(ϵrA(ϵ))(pminδ)(ϵrA(ϵ)).1𝑊subscript𝐰𝑊𝐷𝐰superscriptsubscript𝐴𝑚1𝑊subscript𝐰𝑊𝐵subscript𝐚𝑖italic-ϵ𝐷𝐰superscriptsubscript𝐴𝑚subscript𝑝𝑖𝛿superscriptitalic-ϵsuperscriptsubscript𝑟𝐴italic-ϵsubscript𝑝𝛿superscriptitalic-ϵsuperscriptsubscript𝑟𝐴italic-ϵ\frac{1}{|W|}\sum_{\mathbf{w}\in W}D\left(\mathbf{w},A_{m}^{*}\right)\geq\frac% {1}{|W|}\sum_{\mathbf{w}\in W\cap B(\mathbf{a}_{i},\epsilon)}D\left(\mathbf{w}% ,A_{m}^{*}\right)\geq\left(p_{i}-\delta\right)\left(\epsilon^{\prime}-r_{A}^{% \dagger}(\epsilon)\right)\geq\left(p_{\min}-\delta\right)\left(\epsilon^{% \prime}-r_{A}^{\dagger}(\epsilon)\right).divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W end_POSTSUBSCRIPT italic_D ( bold_w , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W ∩ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) end_POSTSUBSCRIPT italic_D ( bold_w , italic_A start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_δ ) ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) ) ≥ ( italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_δ ) ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) ) . (28)

Next, suppose that a multiset S𝑆Sitalic_S on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT contains A𝐴Aitalic_A and |S|=m𝑆𝑚|S|=m| italic_S | = italic_m, which is only possible when mk𝑚𝑘m\geq kitalic_m ≥ italic_k as assumed. Then we have D(𝐰,S)D(𝐰,A)𝐷𝐰𝑆𝐷𝐰𝐴D(\mathbf{w},S)\leq D(\mathbf{w},A)italic_D ( bold_w , italic_S ) ≤ italic_D ( bold_w , italic_A ). Set Uϵ:=W(i=1kB(𝐚i,ϵ))assignsubscript𝑈italic-ϵ𝑊superscriptsubscript𝑖1𝑘𝐵subscript𝐚𝑖italic-ϵU_{\epsilon}:=W\cap\left(\cup_{i=1}^{k}B(\mathbf{a}_{i},\epsilon)\right)italic_U start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT := italic_W ∩ ( ∪ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ), we have that

1|W|𝐰WD(𝐰,S)1|W|(𝐰UϵD(𝐰,A)+𝐰WUϵ1)<(1kδ)ϵ+kδ,1𝑊subscript𝐰𝑊𝐷𝐰𝑆1𝑊subscript𝐰subscript𝑈italic-ϵ𝐷𝐰𝐴subscript𝐰𝑊subscript𝑈italic-ϵ11𝑘𝛿italic-ϵ𝑘𝛿\frac{1}{|W|}\sum_{\mathbf{w}\in W}D(\mathbf{w},S)\leq\frac{1}{|W|}\left(\sum_% {\mathbf{w}\in U_{\epsilon}}D(\mathbf{w},A)+\sum_{\mathbf{w}\in W\setminus U_{% \epsilon}}1\right)<\left(1-k\delta\right)\epsilon+k\delta,divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W end_POSTSUBSCRIPT italic_D ( bold_w , italic_S ) ≤ divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ( ∑ start_POSTSUBSCRIPT bold_w ∈ italic_U start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_D ( bold_w , italic_A ) + ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W ∖ italic_U start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT end_POSTSUBSCRIPT 1 ) < ( 1 - italic_k italic_δ ) italic_ϵ + italic_k italic_δ , (29)

where the last inequality is obtained by maximizing |WUϵ|𝑊subscript𝑈italic-ϵ|W\setminus U_{\epsilon}|| italic_W ∖ italic_U start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT | with the constraint (25). Now in view of (13), the first expression in (28) is less than or equal to the first expression in (29), and hence these two inequalities imply:

ϵ<{(1kδ)ϵ+kδ}/(pminδ)+rA(ϵ),superscriptitalic-ϵ1𝑘𝛿italic-ϵ𝑘𝛿subscript𝑝𝛿superscriptsubscript𝑟𝐴italic-ϵ\epsilon^{\prime}<\left\{(1-k\delta)\epsilon+k\delta\right\}/\left(p_{\min}-% \delta\right)+r_{A}^{\dagger}(\epsilon),italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < { ( 1 - italic_k italic_δ ) italic_ϵ + italic_k italic_δ } / ( italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_δ ) + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) ,

which contradicts the choice of ϵsuperscriptitalic-ϵ\epsilon^{\prime}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. ∎

As a consequence of the previous lemma, when the number of clusters exceeds k𝑘kitalic_k, either some cluster has a small size or at least two centers are close to each other, as articulated in the next lemma.

Lemma 3.4.

Suppose Assumptions 1 and 2 hold. Assume additionally that δ𝛿\deltaitalic_δ and ϵitalic-ϵ\epsilonitalic_ϵ in Assumption 2 are small enough so that ϵsuperscriptitalic-ϵ\epsilon^{\prime}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in (27) satisfies ϵ<rAsuperscriptitalic-ϵsubscript𝑟𝐴\epsilon^{\prime}<r_{A}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT. Let (Am={𝐚1,,𝐚m},m={C1,,Cm})formulae-sequencesubscriptsuperscript𝐴𝑚superscriptsubscript𝐚1superscriptsubscript𝐚𝑚subscript𝑚subscript𝐶1subscript𝐶𝑚\left(A^{*}_{m}=\{\mathbf{a}_{1}^{*},\ldots,\mathbf{a}_{m}^{*}\},\mathfrak{C}_% {m}=\{C_{1},\ldots,C_{m}\}\right)( italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , bold_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } , fraktur_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT } ) be an m𝑚mitalic_m-clustering of W𝑊Witalic_W, m>k𝑚𝑘m>kitalic_m > italic_k. Then either of the following happens:

mini=1,,m|Ci||W|kδ or min1i<jmD(𝐚i,𝐚j)ϵ+2rA(ϵ)+rA(ϵ).formulae-sequencesubscript𝑖1𝑚subscript𝐶𝑖𝑊𝑘𝛿 or subscript1𝑖𝑗𝑚𝐷superscriptsubscript𝐚𝑖superscriptsubscript𝐚𝑗superscriptitalic-ϵ2superscriptsubscript𝑟𝐴italic-ϵsuperscriptsubscript𝑟𝐴superscriptitalic-ϵ\min_{i=1,\ldots,m}\frac{|C_{i}|}{|W|}\leq k\delta\quad\text{ or }\quad\min_{1% \leq i<j\leq m}D\left(\mathbf{a}_{i}^{*},\mathbf{a}_{j}^{*}\right)\leq\epsilon% ^{\prime}+2r_{A}^{\dagger}(\epsilon)+r_{A}^{\dagger}(\epsilon^{\prime}).roman_min start_POSTSUBSCRIPT italic_i = 1 , … , italic_m end_POSTSUBSCRIPT divide start_ARG | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_W | end_ARG ≤ italic_k italic_δ or roman_min start_POSTSUBSCRIPT 1 ≤ italic_i < italic_j ≤ italic_m end_POSTSUBSCRIPT italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 2 italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .
Proof.

Since B(𝐚i,ϵ)𝐵subscript𝐚𝑖superscriptitalic-ϵB\left(\mathbf{a}_{i},\epsilon^{\prime}\right)italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), i{1,,k}𝑖1𝑘i\in\left\{1,\ldots,k\right\}italic_i ∈ { 1 , … , italic_k }, are disjoint (because ϵ<rAsuperscriptitalic-ϵsubscript𝑟𝐴\epsilon^{\prime}<r_{A}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT), by Lemma 3.3, we can, without loss of generality, assume that 𝐚iB(𝐚i,ϵ)superscriptsubscript𝐚𝑖𝐵subscript𝐚𝑖superscriptitalic-ϵ\mathbf{a}_{i}^{*}\in B\left(\mathbf{a}_{i},\epsilon^{\prime}\right)bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), i{1,,k}𝑖1𝑘i\in\left\{1,\ldots,k\right\}italic_i ∈ { 1 , … , italic_k } . We now divide into two cases as follows.

Case 1: there exists one j{k+1,,m}𝑗𝑘1𝑚j\in\{k+1,\ldots,m\}italic_j ∈ { italic_k + 1 , … , italic_m } (fixed below in the discussion of this case) which satisfies D(𝐚j,A)>ϵ+2rA(ϵ)𝐷superscriptsubscript𝐚𝑗𝐴superscriptitalic-ϵ2superscriptsubscript𝑟𝐴italic-ϵD\left(\mathbf{a}_{j}^{*},A\right)>\epsilon^{\prime}+2r_{A}^{\dagger}(\epsilon)italic_D ( bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_A ) > italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 2 italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ). Then for any i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k } and any 𝐰WB(𝐚i,ϵ)𝐰𝑊𝐵subscript𝐚𝑖italic-ϵ\mathbf{w}\in W\cap B\left(\mathbf{a}_{i},\epsilon\right)bold_w ∈ italic_W ∩ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ), we have by the triangular inequality (10) that D(𝐰,𝐚i)D(𝐚i,𝐚i)+D(𝐚i,𝐰)ϵ+rA(ϵ)𝐷𝐰superscriptsubscript𝐚𝑖𝐷subscript𝐚𝑖superscriptsubscript𝐚𝑖superscript𝐷subscript𝐚𝑖𝐰superscriptitalic-ϵsuperscriptsubscript𝑟𝐴italic-ϵD\left(\mathbf{w},\mathbf{a}_{i}^{*}\right)\leq D\left(\mathbf{a}_{i},\mathbf{% a}_{i}^{*}\right)+D^{\dagger}\left(\mathbf{a}_{i},\mathbf{w}\right)\leq% \epsilon^{\prime}+r_{A}^{\dagger}(\epsilon)italic_D ( bold_w , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_w ) ≤ italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ), and hence

D(𝐰,𝐚j)D(𝐚j,𝐚i)D(𝐰,𝐚i)>ϵ+2rA(ϵ)rA(ϵ)D(𝐰,𝐚i).𝐷𝐰superscriptsubscript𝐚𝑗𝐷superscriptsubscript𝐚𝑗subscript𝐚𝑖superscript𝐷𝐰subscript𝐚𝑖superscriptitalic-ϵ2superscriptsubscript𝑟𝐴italic-ϵsuperscriptsubscript𝑟𝐴italic-ϵ𝐷𝐰superscriptsubscript𝐚𝑖D\left(\mathbf{w},\mathbf{a}_{j}^{*}\right)\geq D\left(\mathbf{a}_{j}^{*},% \mathbf{a}_{i}\right)-D^{\dagger}\left(\mathbf{w},\mathbf{a}_{i}\right)>% \epsilon^{\prime}+2r_{A}^{\dagger}(\epsilon)-r_{A}^{\dagger}(\epsilon)\geq D% \left(\mathbf{w},\mathbf{a}_{i}^{*}\right).italic_D ( bold_w , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_D ( bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_w , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 2 italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) ≥ italic_D ( bold_w , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

This in view of Definition 2.2 implies that WB(𝐚i,ϵ)WCjc𝑊𝐵subscript𝐚𝑖italic-ϵ𝑊superscriptsubscript𝐶𝑗𝑐W\cap B\left(\mathbf{a}_{i},\epsilon\right)\subset W\cap C_{j}^{c}italic_W ∩ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ⊂ italic_W ∩ italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT for all i{1,,k}𝑖1𝑘i\in\left\{1,\ldots,k\right\}italic_i ∈ { 1 , … , italic_k }. Therefore, we have by Assumption 2 that

mini=1,,m|Ci||W||Cj||W||Wi=1,,kB(𝐚i,ϵ)c||W|kδ.subscript𝑖1𝑚subscript𝐶𝑖𝑊subscript𝐶𝑗𝑊𝑊subscript𝑖1𝑘𝐵superscriptsubscript𝐚𝑖italic-ϵ𝑐𝑊𝑘𝛿\min_{i=1,\ldots,m}\frac{|C_{i}|}{|W|}\leq\frac{|C_{j}|}{|W|}\leq\frac{|W\cap% \bigcap_{i=1,\ldots,k}B\left(\mathbf{a}_{i},\epsilon\right)^{c}|}{|W|}\leq k\delta.roman_min start_POSTSUBSCRIPT italic_i = 1 , … , italic_m end_POSTSUBSCRIPT divide start_ARG | italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_W | end_ARG ≤ divide start_ARG | italic_C start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG | italic_W | end_ARG ≤ divide start_ARG | italic_W ∩ ⋂ start_POSTSUBSCRIPT italic_i = 1 , … , italic_k end_POSTSUBSCRIPT italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT | end_ARG start_ARG | italic_W | end_ARG ≤ italic_k italic_δ .

Case 2: for any j{k+1,,m}𝑗𝑘1𝑚j\in\{k+1,\ldots,m\}italic_j ∈ { italic_k + 1 , … , italic_m }, we have D(𝐚j,𝐚i)ϵ+2rA(ϵ)𝐷superscriptsubscript𝐚𝑗subscript𝐚𝑖superscriptitalic-ϵ2superscriptsubscript𝑟𝐴italic-ϵD\left(\mathbf{a}_{j}^{*},\mathbf{a}_{i}\right)\leq\epsilon^{\prime}+2r_{A}^{% \dagger}(\epsilon)italic_D ( bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 2 italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) for all i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k }. Then for any such pair of j𝑗jitalic_j and i𝑖iitalic_i, we have

D(𝐚i,𝐚j)D(𝐚i,𝐚j)+D(𝐚i,𝐚i)ϵ+2rA(ϵ)+rA(ϵ).𝐷superscriptsubscript𝐚𝑖superscriptsubscript𝐚𝑗𝐷subscript𝐚𝑖superscriptsubscript𝐚𝑗superscript𝐷subscript𝐚𝑖superscriptsubscript𝐚𝑖superscriptitalic-ϵ2superscriptsubscript𝑟𝐴italic-ϵsuperscriptsubscript𝑟𝐴superscriptitalic-ϵD\left(\mathbf{a}_{i}^{*},\mathbf{a}_{j}^{*}\right)\leq D\left(\mathbf{a}_{i},% \mathbf{a}_{j}^{*}\right)+D^{\dagger}(\mathbf{a}_{i},\mathbf{a}_{i}^{*})\leq% \epsilon^{\prime}+2r_{A}^{\dagger}(\epsilon)+r_{A}^{\dagger}(\epsilon^{\prime}).italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 2 italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

In the scenario where the number of clusters matches the specified value k𝑘kitalic_k in Assumption 2, the next lemma establishes a lower bound for the (unpenalized) ASW S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG. Furthermore, it provides lower bounds for both the sizes of individual clusters and the dissimilarities between cluster centers.

Lemma 3.5.

Suppose Assumptions 1 and 2 hold. Let (Ak={𝐚1,,𝐚k},k={C1,,Ck})formulae-sequencesubscriptsuperscript𝐴𝑘superscriptsubscript𝐚1superscriptsubscript𝐚𝑘subscript𝑘subscript𝐶1subscript𝐶𝑘\left(A^{*}_{k}=\{\mathbf{a}_{1}^{*},\ldots,\mathbf{a}_{k}^{*}\},\mathfrak{C}_% {k}=\{C_{1},\ldots,C_{k}\}\right)( italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , … , bold_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT } , fraktur_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } ) be a k𝑘kitalic_k-clustering of W𝑊Witalic_W. Suppose in addition

rA>ϵ+2rA(ϵ)+rA(ϵ)subscript𝑟𝐴superscriptitalic-ϵ2superscriptsubscript𝑟𝐴italic-ϵsuperscriptsubscript𝑟𝐴superscriptitalic-ϵr_{A}>\epsilon^{\prime}+2r_{A}^{\dagger}(\epsilon)+r_{A}^{\dagger}(\epsilon^{% \prime})italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT > italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 2 italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (30)

with ϵsuperscriptitalic-ϵ\epsilon^{\prime}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in (27). Then the (unpenalized) ASW S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG satisfies

S¯=S¯(Wn;Ak,k)1(1kδ)ϵ+rA(ϵ)rArA(ϵ)rA(ϵ)kδ.¯𝑆¯𝑆subscript𝑊𝑛subscriptsuperscript𝐴𝑘subscript𝑘11𝑘𝛿superscriptitalic-ϵsuperscriptsubscript𝑟𝐴italic-ϵsubscript𝑟𝐴superscriptsubscript𝑟𝐴italic-ϵsuperscriptsubscript𝑟𝐴superscriptitalic-ϵ𝑘𝛿\bar{S}=\bar{S}(W_{n};A^{*}_{k},\mathfrak{C}_{k})\geq 1-\left(1-k\delta\right)% \frac{\epsilon^{\prime}+r_{A}^{\dagger}(\epsilon)}{r_{A}-r_{A}^{\dagger}(% \epsilon)-r_{A}^{\dagger}(\epsilon^{\prime})}-k\delta.over¯ start_ARG italic_S end_ARG = over¯ start_ARG italic_S end_ARG ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ≥ 1 - ( 1 - italic_k italic_δ ) divide start_ARG italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG - italic_k italic_δ .

In addition, with the same permutation π:{1,,k}{1,,k}:𝜋maps-to1𝑘1𝑘\pi:\{1,\ldots,k\}\mapsto\{1,\ldots,k\}italic_π : { 1 , … , italic_k } ↦ { 1 , … , italic_k } found in Lemma 3.3, we have

|Cπ(i)||W|piδ for each i and min1i<jkD(𝐚i,𝐚j)rA2rA(ϵ),formulae-sequencesubscript𝐶𝜋𝑖𝑊subscript𝑝𝑖𝛿 for each 𝑖 and subscript1𝑖𝑗𝑘𝐷superscriptsubscript𝐚𝑖superscriptsubscript𝐚𝑗subscript𝑟𝐴2superscriptsubscript𝑟𝐴superscriptitalic-ϵ\frac{|C_{\pi(i)}|}{|W|}\geq p_{i}-\delta\text{ for each }i\quad\text{ and }% \quad\min_{1\leq i<j\leq k}D\left(\mathbf{a}_{i}^{*},\mathbf{a}_{j}^{*}\right)% \geq r_{A}-2r_{A}^{\dagger}(\epsilon^{\prime}),divide start_ARG | italic_C start_POSTSUBSCRIPT italic_π ( italic_i ) end_POSTSUBSCRIPT | end_ARG start_ARG | italic_W | end_ARG ≥ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_δ for each italic_i and roman_min start_POSTSUBSCRIPT 1 ≤ italic_i < italic_j ≤ italic_k end_POSTSUBSCRIPT italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - 2 italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ,

where when k=1𝑘1k=1italic_k = 1, min1i<jkD(𝐚i,𝐚j)subscript1𝑖𝑗𝑘𝐷superscriptsubscript𝐚𝑖superscriptsubscript𝐚𝑗\min_{1\leq i<j\leq k}D\left(\mathbf{a}_{i}^{*},\mathbf{a}_{j}^{*}\right)roman_min start_POSTSUBSCRIPT 1 ≤ italic_i < italic_j ≤ italic_k end_POSTSUBSCRIPT italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is understood as 1, and the inequalities still hold.

Proof.

Since rA>ϵsubscript𝑟𝐴superscriptitalic-ϵr_{A}>\epsilon^{\prime}italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT > italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, by Lemma 3.3, there exists a permutation π:{1,,k}{1,,k}:𝜋maps-to1𝑘1𝑘\pi:\{1,\ldots,k\}\mapsto\{1,\ldots,k\}italic_π : { 1 , … , italic_k } ↦ { 1 , … , italic_k }, such that D(𝐚i,𝐚π(i))<ϵ𝐷subscript𝐚𝑖superscriptsubscript𝐚𝜋𝑖superscriptitalic-ϵD\left(\mathbf{a}_{i},\mathbf{a}_{\pi(i)}^{*}\right)<\epsilon^{\prime}italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_π ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, i{1,,k}𝑖1𝑘i\in\left\{1,\ldots,k\right\}italic_i ∈ { 1 , … , italic_k }. Then for each i𝑖iitalic_i and any 𝐰B(𝐚i,ϵ)𝐰𝐵subscript𝐚𝑖italic-ϵ\mathbf{w}\in B(\mathbf{a}_{i},\epsilon)bold_w ∈ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ), we have by the triangular inequality (10) that

D(𝐰,𝐚π(i))D(𝐚i,𝐚π(i))+D(𝐰,𝐚i)<ϵ+rA(ϵ),𝐷𝐰superscriptsubscript𝐚𝜋𝑖𝐷subscript𝐚𝑖superscriptsubscript𝐚𝜋𝑖superscript𝐷𝐰subscript𝐚𝑖superscriptitalic-ϵsuperscriptsubscript𝑟𝐴italic-ϵD\left(\mathbf{w},\mathbf{a}_{\pi(i)}^{*}\right)\leq D\left(\mathbf{a}_{i},% \mathbf{a}_{\pi(i)}^{*}\right)+D^{\dagger}\left(\mathbf{w},\mathbf{a}_{i}% \right)<\epsilon^{\prime}+r_{A}^{\dagger}(\epsilon),italic_D ( bold_w , bold_a start_POSTSUBSCRIPT italic_π ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_π ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_w , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) , (31)

and for ji𝑗𝑖j\neq iitalic_j ≠ italic_i that

D(𝐰,𝐚π(j))D(𝐚i,𝐚j)D(𝐚j,𝐚π(j))D(𝐰,𝐚i)rArA(ϵ)rA(ϵ),𝐷𝐰superscriptsubscript𝐚𝜋𝑗𝐷subscript𝐚𝑖subscript𝐚𝑗superscript𝐷subscript𝐚𝑗superscriptsubscript𝐚𝜋𝑗superscript𝐷𝐰subscript𝐚𝑖subscript𝑟𝐴superscriptsubscript𝑟𝐴superscriptitalic-ϵsuperscriptsubscript𝑟𝐴italic-ϵD\left(\mathbf{w},\mathbf{a}_{\pi(j)}^{*}\right)\geq D\left(\mathbf{a}_{i},% \mathbf{a}_{j}\right)-D^{\dagger}(\mathbf{a}_{j},\mathbf{a}_{\pi(j)}^{*})-D^{% \dagger}(\mathbf{w},\mathbf{a}_{i})\geq r_{A}-r_{A}^{\dagger}(\epsilon^{\prime% })-r_{A}^{\dagger}(\epsilon),italic_D ( bold_w , bold_a start_POSTSUBSCRIPT italic_π ( italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_π ( italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_w , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) , (32)

where if k=1𝑘1k=1italic_k = 1, the left-hand side D(𝐰,𝐚π(j))𝐷𝐰superscriptsubscript𝐚𝜋𝑗D\left(\mathbf{w},\mathbf{a}_{\pi(j)}^{*}\right)italic_D ( bold_w , bold_a start_POSTSUBSCRIPT italic_π ( italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) in (32) is understood as 1, and the inequality still holds. Writing as before Uϵ=1ikB(𝐚i,ϵ)Wsubscript𝑈italic-ϵsubscript1𝑖𝑘𝐵subscript𝐚𝑖italic-ϵ𝑊U_{\epsilon}=\bigcup_{1\leq i\leq k}B(\mathbf{a}_{i},\epsilon)\cap Witalic_U start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT = ⋃ start_POSTSUBSCRIPT 1 ≤ italic_i ≤ italic_k end_POSTSUBSCRIPT italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ∩ italic_W. In view of Assumption 2 and the inequalities above, we have

S¯=1|W|(𝐰Uϵ+𝐰WUϵ)(1a(𝐰)b(𝐰))(1ϵ+rA(ϵ)rArA(ϵ)rA(ϵ))(1kδ)+0,¯𝑆1𝑊subscript𝐰subscript𝑈italic-ϵsubscript𝐰𝑊subscript𝑈italic-ϵ1𝑎𝐰𝑏𝐰1superscriptitalic-ϵsuperscriptsubscript𝑟𝐴italic-ϵsubscript𝑟𝐴superscriptsubscript𝑟𝐴italic-ϵsuperscriptsubscript𝑟𝐴superscriptitalic-ϵ1𝑘𝛿0\bar{S}=\frac{1}{|W|}\left(\sum_{\mathbf{w}\in U_{\epsilon}}+\sum_{\mathbf{w}% \in W\setminus U_{\epsilon}}\right)\left(1-\frac{a(\mathbf{w})}{b(\mathbf{w})}% \right)\geq\left(1-\frac{\epsilon^{\prime}+r_{A}^{\dagger}(\epsilon)}{r_{A}-r_% {A}^{\dagger}(\epsilon)-r_{A}^{\dagger}(\epsilon^{\prime})}\right)\left(1-k% \delta\right)+0,over¯ start_ARG italic_S end_ARG = divide start_ARG 1 end_ARG start_ARG | italic_W | end_ARG ( ∑ start_POSTSUBSCRIPT bold_w ∈ italic_U start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT bold_w ∈ italic_W ∖ italic_U start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ( 1 - divide start_ARG italic_a ( bold_w ) end_ARG start_ARG italic_b ( bold_w ) end_ARG ) ≥ ( 1 - divide start_ARG italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) ( 1 - italic_k italic_δ ) + 0 ,

which implies the first claim.

For the second claim, in view of Definition 2.2, (30), (31), (32), we have WB(𝐚i,ϵ)Cπ(i)𝑊𝐵subscript𝐚𝑖italic-ϵsubscript𝐶𝜋𝑖W\cap B\left(\mathbf{a}_{i},\epsilon\right)\subset C_{\pi(i)}italic_W ∩ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ⊂ italic_C start_POSTSUBSCRIPT italic_π ( italic_i ) end_POSTSUBSCRIPT, i{1,,k}𝑖1𝑘i\in\left\{1,\ldots,k\right\}italic_i ∈ { 1 , … , italic_k }. Hence by Assumption 2,

|Cπ(i)||W||WB(𝐚i,ϵ)||W|piδ.subscript𝐶𝜋𝑖𝑊𝑊𝐵subscript𝐚𝑖italic-ϵ𝑊subscript𝑝𝑖𝛿\frac{|C_{\pi(i)}|}{|W|}\geq\frac{|W\cap B\left(\mathbf{a}_{i},\epsilon\right)% |}{|W|}\geq p_{i}-\delta.divide start_ARG | italic_C start_POSTSUBSCRIPT italic_π ( italic_i ) end_POSTSUBSCRIPT | end_ARG start_ARG | italic_W | end_ARG ≥ divide start_ARG | italic_W ∩ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) | end_ARG start_ARG | italic_W | end_ARG ≥ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_δ .

Furthermore, for any 1i<jk1𝑖𝑗𝑘1\leq i<j\leq k1 ≤ italic_i < italic_j ≤ italic_k and k>1𝑘1k>1italic_k > 1,

D(𝐚π(i),𝐚π(j))D(𝐚i,𝐚j)D(𝐚i,𝐚π(i))D(𝐚j,𝐚π(j))rA2rA(ϵ).𝐷superscriptsubscript𝐚𝜋𝑖superscriptsubscript𝐚𝜋𝑗𝐷subscript𝐚𝑖subscript𝐚𝑗superscript𝐷subscript𝐚𝑖superscriptsubscript𝐚𝜋𝑖superscript𝐷subscript𝐚𝑗superscriptsubscript𝐚𝜋𝑗subscript𝑟𝐴2superscriptsubscript𝑟𝐴superscriptitalic-ϵD\left(\mathbf{a}_{\pi(i)}^{*},\mathbf{a}_{\pi(j)}^{*}\right)\geq D\left(% \mathbf{a}_{i},\mathbf{a}_{j}\right)-D^{\dagger}\left(\mathbf{a}_{i},\mathbf{a% }_{\pi(i)}^{*}\right)-D^{\dagger}\left(\mathbf{a}_{j},\mathbf{a}_{\pi(j)}^{*}% \right)\geq r_{A}-2r_{A}^{\dagger}(\epsilon^{\prime}).italic_D ( bold_a start_POSTSUBSCRIPT italic_π ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_π ( italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_D ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_π ( italic_i ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - italic_D start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_π ( italic_j ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≥ italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - 2 italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

3.2.2 Proof of Theorem 3.1

Following the setup and notation of Sections 2.1 and Section 2.3, recall Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT denotes the extremal subsample as in (14), and Am,n={𝐚1,nm,,𝐚m,nm}subscript𝐴𝑚𝑛superscriptsubscript𝐚1𝑛𝑚superscriptsubscript𝐚𝑚𝑛𝑚{A}_{m,n}=\left\{{\mathbf{a}}_{1,n}^{m},\ldots,{\mathbf{a}}_{m,n}^{m}\right\}italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = { bold_a start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , … , bold_a start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT }, Ci,nmsuperscriptsubscript𝐶𝑖𝑛𝑚{C}_{i,n}^{m}italic_C start_POSTSUBSCRIPT italic_i , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, i{1,,m}𝑖1𝑚i\in\{1,\ldots,m\}italic_i ∈ { 1 , … , italic_m }, are random multisets on 𝕊+d1superscriptsubscript𝕊𝑑1\mathbb{S}_{+}^{d-1}blackboard_S start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT such that (Am,n,m,n={C1,nm,,Cm,nm})subscript𝐴𝑚𝑛subscript𝑚𝑛superscriptsubscript𝐶1𝑛𝑚superscriptsubscript𝐶𝑚𝑛𝑚\left(A_{m,n},\mathfrak{C}_{m,n}=\{{C}_{1,n}^{m},\ldots,{C}_{m,n}^{m}\}\right)( italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = { italic_C start_POSTSUBSCRIPT 1 , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT , … , italic_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } ) form an m𝑚mitalic_m-clustering of Wnsubscript𝑊𝑛W_{n}italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, m+𝑚subscriptm\in\mathbb{Z}_{+}italic_m ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, nm𝑛𝑚n\geq mitalic_n ≥ italic_m. Throughout this section, we follow the assumption of a discrete spectral measure as in Corollary 2.5, and suppose the dissimilarity measure D𝐷Ditalic_D satisfies Assumption 1.

We first state a result regarding the ASW S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG in (21) when the number of clusters is less than or equal to k𝑘kitalic_k, the true order of the discrete spectral measure (18).

Proposition 3.6.

If m<k𝑚𝑘m<kitalic_m < italic_k, then almost surely,

lim supnS¯(Wn;Am,n,m,n)1rApmin.subscriptlimit-supremum𝑛¯𝑆subscript𝑊𝑛subscript𝐴𝑚𝑛subscript𝑚𝑛1subscript𝑟𝐴subscript𝑝\limsup_{n}\bar{S}\left(W_{n};A_{m,n},\mathfrak{C}_{m,n}\right)\leq 1-r_{A}p_{% \min}.lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) ≤ 1 - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT .

If m=k𝑚𝑘m=kitalic_m = italic_k, then almost surely,

limnS¯(Wn;Ak,n,k,n)=1.subscript𝑛¯𝑆subscript𝑊𝑛subscript𝐴𝑘𝑛subscript𝑘𝑛1\lim_{n}\bar{S}\left(W_{n};A_{k,n},\mathfrak{C}_{k,n}\right)=1.roman_lim start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT over¯ start_ARG italic_S end_ARG ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) = 1 .
Proof.

For ϵ,δ>0italic-ϵ𝛿0\epsilon,\delta>0italic_ϵ , italic_δ > 0 as in Assumption 2, choose them small enough such that (30) is satisfied. Define the event

En(ϵ,δ)={|WnB(𝐚i,ϵ)||Wn|(piδ) for all i{1,,k}}.subscript𝐸𝑛italic-ϵ𝛿subscript𝑊𝑛𝐵subscript𝐚𝑖italic-ϵsubscript𝑊𝑛subscript𝑝𝑖𝛿 for all 𝑖1𝑘E_{n}(\epsilon,\delta)=\{|W_{n}\cap B(\mathbf{a}_{i},\epsilon)|\geq|W_{n}|(p_{% i}-\delta)\text{ for all }i\in\left\{1,\ldots,k\right\}\}.italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϵ , italic_δ ) = { | italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∩ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) | ≥ | italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_δ ) for all italic_i ∈ { 1 , … , italic_k } } . (33)

By Proposition 2.4 and the choice ϵ<rAitalic-ϵsubscript𝑟𝐴\epsilon<r_{A}italic_ϵ < italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, we have each |WnB(𝐚i,ϵ)|/|Wn|subscript𝑊𝑛𝐵subscript𝐚𝑖italic-ϵsubscript𝑊𝑛|W_{n}\cap B(\mathbf{a}_{i},\epsilon)|/|W_{n}|| italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∩ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) | / | italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | converges almost surely to pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Hence, with probability 1111, the event En(ϵ,δ)subscript𝐸𝑛italic-ϵ𝛿E_{n}(\epsilon,\delta)italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϵ , italic_δ ) happens eventually as n𝑛n\rightarrow\inftyitalic_n → ∞, namely, Pr(lim infn𝟏{En(ϵ,δ)}=1)=1Prsubscriptlimit-infimum𝑛1subscript𝐸𝑛italic-ϵ𝛿11\Pr\left(\liminf_{n}\mathbf{1}\left\{E_{n}(\epsilon,\delta)\right\}=1\right)=1roman_Pr ( lim inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_1 { italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϵ , italic_δ ) } = 1 ) = 1. So, by Lemmas 3.2 and 3.5, for almost every outcome ω𝜔\omegaitalic_ω in the sample space ΩΩ\Omegaroman_Ω, when n𝑛nitalic_n is sufficiently large, we have when m<k𝑚𝑘m<kitalic_m < italic_k that

S¯(Wn;Am,n,m,n)𝟏{En(ϵ,δ)}{1(pminδ)(rArA(ϵ))}𝟏{En(ϵ,δ)}¯𝑆subscript𝑊𝑛subscript𝐴𝑚𝑛subscript𝑚𝑛1subscript𝐸𝑛italic-ϵ𝛿1subscript𝑝𝛿subscript𝑟𝐴superscriptsubscript𝑟𝐴italic-ϵ1subscript𝐸𝑛italic-ϵ𝛿\bar{S}(W_{n};A_{m,n},\mathfrak{C}_{m,n})\mathbf{1}\left\{E_{n}(\epsilon,% \delta)\right\}\leq\left\{1-\left(p_{\min}-\delta\right)\left(r_{A}-r_{A}^{% \dagger}(\epsilon)\right)\right\}\mathbf{1}\left\{E_{n}(\epsilon,\delta)\right\}over¯ start_ARG italic_S end_ARG ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) bold_1 { italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϵ , italic_δ ) } ≤ { 1 - ( italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_δ ) ( italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) ) } bold_1 { italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϵ , italic_δ ) }

and

S¯(Wn;Ak,n,k,n)𝟏{En(ϵ,δ)}{1(1kδ)ϵ+rA(ϵ)rArA(ϵ)rA(ϵ)kδ}𝟏{En(ϵ,δ)}.¯𝑆subscript𝑊𝑛subscript𝐴𝑘𝑛subscript𝑘𝑛1subscript𝐸𝑛italic-ϵ𝛿11𝑘𝛿superscriptitalic-ϵsuperscriptsubscript𝑟𝐴italic-ϵsubscript𝑟𝐴superscriptsubscript𝑟𝐴italic-ϵsuperscriptsubscript𝑟𝐴superscriptitalic-ϵ𝑘𝛿1subscript𝐸𝑛italic-ϵ𝛿\bar{S}(W_{n};A_{k,n},\mathfrak{C}_{k,n})\mathbf{1}\left\{E_{n}(\epsilon,% \delta)\right\}\geq\left\{1-\left(1-k\delta\right)\frac{\epsilon^{\prime}+r_{A% }^{\dagger}(\epsilon)}{r_{A}-r_{A}^{\dagger}(\epsilon)-r_{A}^{\dagger}(% \epsilon^{\prime})}-k\delta\right\}\mathbf{1}\left\{E_{n}(\epsilon,\delta)% \right\}.over¯ start_ARG italic_S end_ARG ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) bold_1 { italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϵ , italic_δ ) } ≥ { 1 - ( 1 - italic_k italic_δ ) divide start_ARG italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG - italic_k italic_δ } bold_1 { italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϵ , italic_δ ) } .

The desired results follow if one takes lim supnsubscriptlimit-supremum𝑛\limsup_{n}lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and lim infnsubscriptlimit-infimum𝑛\liminf_{n}lim inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT respectively in the two inequalities above, and then lets δ,ϵ0𝛿italic-ϵ0\delta,\epsilon\rightarrow 0italic_δ , italic_ϵ → 0. ∎

Next, we state a result on the penalty Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (22) when the number of clusters exceeds or equals k𝑘kitalic_k.

Proposition 3.7.

Suppose t>0𝑡0t>0italic_t > 0. If m>k𝑚𝑘m>kitalic_m > italic_k, we have almost surely

limnPt(Wn;Am,n,m,n)=1.subscript𝑛subscript𝑃𝑡subscript𝑊𝑛subscript𝐴𝑚𝑛subscript𝑚𝑛1\lim_{n}P_{t}(W_{n};A_{m,n},\mathfrak{C}_{m,n})=1.roman_lim start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) = 1 .

If m=k𝑚𝑘m=kitalic_m = italic_k, we have almost surely

lim supnPt(Wn;Ak,n,k,n)1(rAkpmin)t.subscriptlimit-supremum𝑛subscript𝑃𝑡subscript𝑊𝑛subscript𝐴𝑘𝑛subscript𝑘𝑛1superscriptsubscript𝑟𝐴𝑘subscript𝑝𝑡\limsup_{n}P_{t}(W_{n};A_{k,n},\mathfrak{C}_{k,n})\leq 1-\left(r_{A}kp_{\min}% \right)^{t}.lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ≤ 1 - ( italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_k italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT .
Proof.

The argument is similar to that of Proposition 3.6. In particular, under the restriction to the event En(ϵ,δ)subscript𝐸𝑛italic-ϵ𝛿E_{n}(\epsilon,\delta)italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϵ , italic_δ ) in (33), we have by Lemma 3.4 that for m>k𝑚𝑘m>kitalic_m > italic_k

Pt(Wn;Am,n,m,n)1(k2δ)t(ϵ+2rA(ϵ)+rA(ϵ))t,subscript𝑃𝑡subscript𝑊𝑛subscript𝐴𝑚𝑛subscript𝑚𝑛1superscriptsuperscript𝑘2𝛿𝑡superscriptsuperscriptitalic-ϵ2superscriptsubscript𝑟𝐴italic-ϵsuperscriptsubscript𝑟𝐴superscriptitalic-ϵ𝑡P_{t}(W_{n};A_{m,n},\mathfrak{C}_{m,n})\geq 1-\left(k^{2}\delta\right)^{t}\vee% \left(\epsilon^{\prime}+2r_{A}^{\dagger}(\epsilon)+r_{A}^{\dagger}(\epsilon^{% \prime})\right)^{t},italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) ≥ 1 - ( italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∨ ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 2 italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ,

and by Lemma 3.5 that

Pt(Wn;Ak,n,k,n)1[k(pminδ)(rA2rA(ϵ))]t.subscript𝑃𝑡subscript𝑊𝑛subscript𝐴𝑘𝑛subscript𝑘𝑛1superscriptdelimited-[]𝑘subscript𝑝𝛿subscript𝑟𝐴2superscriptsubscript𝑟𝐴superscriptitalic-ϵ𝑡P_{t}(W_{n};A_{k,n},\mathfrak{C}_{k,n})\leq 1-[k(p_{\min}-\delta)(r_{A}-2r_{A}% ^{\dagger}(\epsilon^{\prime}))]^{t}.italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ≤ 1 - [ italic_k ( italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_δ ) ( italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - 2 italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT .

We omit the rest of the details. ∎

Now we are ready to prove Theorem 3.1.

Proof of Theorem 3.1.

Putting together Propositions 3.6 and 3.7, and using the facts that S¯[0,1]¯𝑆01\bar{S}\in[0,1]over¯ start_ARG italic_S end_ARG ∈ [ 0 , 1 ] and Pt[0,1]subscript𝑃𝑡01P_{t}\in[0,1]italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ], we have almost surely that

{lim supnSt(Wn;Am,n,m,n)1rApmin, if m<k;lim infnSt(Wn;Ak,n,k,n)(rAkpmin)t; if m=k;lim supnSt(Wn;Am,n,m,n)0, if m>k.casessubscriptlimit-supremum𝑛subscript𝑆𝑡subscript𝑊𝑛subscript𝐴𝑚𝑛subscript𝑚𝑛1subscript𝑟𝐴subscript𝑝 if 𝑚𝑘subscriptlimit-infimum𝑛subscript𝑆𝑡subscript𝑊𝑛subscript𝐴𝑘𝑛subscript𝑘𝑛superscriptsubscript𝑟𝐴𝑘subscript𝑝𝑡 if 𝑚𝑘subscriptlimit-supremum𝑛subscript𝑆𝑡subscript𝑊𝑛subscript𝐴𝑚𝑛subscript𝑚𝑛0 if 𝑚𝑘\begin{cases}\limsup_{n}S_{t}(W_{n};A_{m,n},\mathfrak{C}_{m,n})\leq 1-r_{A}p_{% \min},&\text{ if }m<k;\\ \liminf_{n}S_{t}(W_{n};A_{k,n},\mathfrak{C}_{k,n})\geq\left(r_{A}kp_{\min}% \right)^{t};&\text{ if }m=k;\\ \limsup_{n}S_{t}(W_{n};A_{m,n},\mathfrak{C}_{m,n})\leq 0,&\text{ if }m>k.\end{cases}{ start_ROW start_CELL lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) ≤ 1 - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , end_CELL start_CELL if italic_m < italic_k ; end_CELL end_ROW start_ROW start_CELL lim inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ≥ ( italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_k italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; end_CELL start_CELL if italic_m = italic_k ; end_CELL end_ROW start_ROW start_CELL lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) ≤ 0 , end_CELL start_CELL if italic_m > italic_k . end_CELL end_ROW

Therefore, the desired claim follows. ∎

4 Large deviation analysis of clustering-based spectral estimation

In this section, we provide a quantitative assessment of the consistency result in Corollary 2.5 through large-deviation-type bounds. This analysis is made possible through certain estimates used in the proof of Theorem 3.1 (see Section 3.2).

First, we pepare a Chernoff-Hoeffding-type bound for the sum of a Binomial random number of Bernoulli random variables, which may be of some independent interest.

Lemma 4.1.

Suppose Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, i+𝑖subscripti\in\mathbb{Z}_{+}italic_i ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT, are independent Bernoulli random variables with Pr(Bi=1)=q1(0,1)Prsubscript𝐵𝑖1subscript𝑞101\Pr(B_{i}=1)=q_{1}\in(0,1)roman_Pr ( italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 ) = italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ ( 0 , 1 ) and N𝑁Nitalic_N is a Binomial(n,q2)𝑛subscript𝑞2(n,q_{2})( italic_n , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) random variable which is independent of Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s, n+𝑛subscriptn\in\mathbb{Z}_{+}italic_n ∈ blackboard_Z start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. Then we have for any r(0,1q1)𝑟01subscript𝑞1r\in(0,1-q_{1})italic_r ∈ ( 0 , 1 - italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ),

Pr(1Ni=1NBi>q1+r)Pr1𝑁superscriptsubscript𝑖1𝑁subscript𝐵𝑖subscript𝑞1𝑟\displaystyle\Pr\left(\frac{1}{N}\sum_{i=1}^{N}B_{i}>q_{1}+r\right)roman_Pr ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r ) exp{nq2[e𝒟(q1+rq1)1]}exp{nq2(e2r21)},absent𝑛subscript𝑞2delimited-[]superscript𝑒𝒟subscript𝑞1conditional𝑟subscript𝑞11𝑛subscript𝑞2superscript𝑒2superscript𝑟21\displaystyle\leq\exp\left\{nq_{2}\left[e^{-\mathcal{D}\left(q_{1}+r\parallel q% _{1}\right)}-1\right]\right\}\leq\exp\left\{nq_{2}\left(e^{-2r^{2}}-1\right)% \right\},≤ roman_exp { italic_n italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT - caligraphic_D ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r ∥ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT - 1 ] } ≤ roman_exp { italic_n italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT - 2 italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - 1 ) } , (34)

and for any r(0,q1)𝑟0subscript𝑞1r\in(0,q_{1})italic_r ∈ ( 0 , italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ),

Pr(1Ni=1NBi<q1r)Pr1𝑁superscriptsubscript𝑖1𝑁subscript𝐵𝑖subscript𝑞1𝑟\displaystyle\Pr\left(\frac{1}{N}\sum_{i=1}^{N}B_{i}<q_{1}-r\right)roman_Pr ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_r ) exp{nq2[e𝒟(q1rq1)1]}exp{nq2(e2r21)},absent𝑛subscript𝑞2delimited-[]superscript𝑒𝒟subscript𝑞1conditional𝑟subscript𝑞11𝑛subscript𝑞2superscript𝑒2superscript𝑟21\displaystyle\leq\exp\left\{nq_{2}\left[e^{-\mathcal{D}\left(q_{1}-r\parallel q% _{1}\right)}-1\right]\right\}\leq\exp\left\{nq_{2}\left(e^{-2r^{2}}-1\right)% \right\},≤ roman_exp { italic_n italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT - caligraphic_D ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_r ∥ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT - 1 ] } ≤ roman_exp { italic_n italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_e start_POSTSUPERSCRIPT - 2 italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - 1 ) } , (35)

where 𝒟(xy)=xln(x/y)+(1x)ln{(1x)/(1y)}𝒟conditional𝑥𝑦𝑥𝑥𝑦1𝑥1𝑥1𝑦\mathcal{D}(x\parallel y)=x\ln(x/y)+(1-x)\ln\left\{\left(1-x\right)/\left(1-y% \right)\right\}caligraphic_D ( italic_x ∥ italic_y ) = italic_x roman_ln ( italic_x / italic_y ) + ( 1 - italic_x ) roman_ln { ( 1 - italic_x ) / ( 1 - italic_y ) } if x,y(0,1)𝑥𝑦01x,y\in(0,1)italic_x , italic_y ∈ ( 0 , 1 ) (the Kullback–Leibler divergence between two Bernoulli distributions). Here i=1mBi/msuperscriptsubscript𝑖1𝑚subscript𝐵𝑖𝑚\sum_{i=1}^{m}B_{i}/m∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_m is understood as 00 when m=0𝑚0m=0italic_m = 0.

Proof.

We only prove the (34) and the proof of (35) is similar. It follows from a version of Hoeffding’s inequality for Binomial [15, Equation (2.1)] that for any m0𝑚0m\geq 0italic_m ≥ 0,

Pr(1mi=1mBi>q1+r)em𝒟(q1+rq1).Pr1𝑚superscriptsubscript𝑖1𝑚subscript𝐵𝑖subscript𝑞1𝑟superscript𝑒𝑚𝒟subscript𝑞1conditional𝑟subscript𝑞1\Pr\left(\frac{1}{m}\sum_{i=1}^{m}B_{i}>q_{1}+r\right)\leq e^{-m\mathcal{D}% \left(q_{1}+r\parallel q_{1}\right)}.roman_Pr ( divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r ) ≤ italic_e start_POSTSUPERSCRIPT - italic_m caligraphic_D ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r ∥ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT .

Hence

Pr(1Ni=1NBi>q1+r)Pr1𝑁superscriptsubscript𝑖1𝑁subscript𝐵𝑖subscript𝑞1𝑟absent\displaystyle\Pr\left(\frac{1}{N}\sum_{i=1}^{N}B_{i}>q_{1}+r\right)\leqroman_Pr ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r ) ≤ m=0n(nm)q2mem𝒟(q1+rq1)(1q2)nmsuperscriptsubscript𝑚0𝑛binomial𝑛𝑚superscriptsubscript𝑞2𝑚superscript𝑒𝑚𝒟subscript𝑞1conditional𝑟subscript𝑞1superscript1subscript𝑞2𝑛𝑚\displaystyle\sum_{m=0}^{n}{n\choose m}q_{2}^{m}e^{-m\mathcal{D}\left(q_{1}+r% \parallel q_{1}\right)}(1-q_{2})^{n-m}∑ start_POSTSUBSCRIPT italic_m = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( binomial start_ARG italic_n end_ARG start_ARG italic_m end_ARG ) italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_m caligraphic_D ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r ∥ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( 1 - italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_n - italic_m end_POSTSUPERSCRIPT
=\displaystyle== [q2{e𝒟(q1+rq1)1}+1]nexp{nq2[e𝒟(q1+rq1)1]},superscriptdelimited-[]subscript𝑞2superscript𝑒𝒟subscript𝑞1conditional𝑟subscript𝑞111𝑛𝑛subscript𝑞2delimited-[]superscript𝑒𝒟subscript𝑞1conditional𝑟subscript𝑞11\displaystyle\left[q_{2}\left\{e^{-\mathcal{D}\left(q_{1}+r\parallel q_{1}% \right)}-1\right\}+1\right]^{n}\leq\exp\left\{nq_{2}\left[e^{-\mathcal{D}\left% (q_{1}+r\parallel q_{1}\right)}-1\right]\right\},[ italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT { italic_e start_POSTSUPERSCRIPT - caligraphic_D ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r ∥ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT - 1 } + 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ≤ roman_exp { italic_n italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT - caligraphic_D ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_r ∥ italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT - 1 ] } ,

where in the last inequality we have used the inequality x+1exp(x)𝑥1𝑥x+1\leq\exp\left(x\right)italic_x + 1 ≤ roman_exp ( italic_x ), x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R. To obtain the second inequality in (34), it suffices to note that in view of [15, Equation (2.3)] one has 𝒟(q1r|q1)2r2𝒟subscript𝑞1conditional𝑟subscript𝑞12superscript𝑟2\mathcal{D}(q_{1}-r|q_{1})\geq 2r^{2}caligraphic_D ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_r | italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≥ 2 italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. ∎

Remark 4.2.

Note that when r𝑟ritalic_r is small, this simplified bound is approximately exp(2nq2r2)2𝑛subscript𝑞2superscript𝑟2\exp(-2nq_{2}r^{2})roman_exp ( - 2 italic_n italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), a form identical to the usual Hoeffding’s inequality (recall nq2𝑛subscript𝑞2nq_{2}italic_n italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the effective sample size here).

Let H=i=1kpiδ𝐚i𝐻superscriptsubscript𝑖1𝑘subscript𝑝𝑖subscript𝛿subscript𝐚𝑖H=\sum_{i=1}^{k}p_{i}\delta_{\mathbf{a}_{i}}italic_H = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT be as defined in (18) and 𝐚πn(i),nk,pπn(i),nksuperscriptsubscript𝐚subscript𝜋𝑛𝑖𝑛𝑘superscriptsubscript𝑝subscript𝜋𝑛𝑖𝑛𝑘\mathbf{a}_{\pi_{n}(i),n}^{k},p_{\pi_{n}(i),n}^{k}bold_a start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_p start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT are as in Corollary 2.5. Note that an accurate estimation can be interpreted as that for small x,y>0𝑥𝑦0x,y>0italic_x , italic_y > 0, there exists permutation π𝜋\piitalic_π, such that D(𝐚π(i),nk,𝐚i)<x𝐷superscriptsubscript𝐚𝜋𝑖𝑛𝑘subscript𝐚𝑖𝑥D(\mathbf{a}_{\pi(i),n}^{k},\mathbf{a}_{i})<xitalic_D ( bold_a start_POSTSUBSCRIPT italic_π ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_x and |pπ(i),nkpi|<ysuperscriptsubscript𝑝𝜋𝑖𝑛𝑘subscript𝑝𝑖𝑦|p_{\pi(i),n}^{k}-p_{i}|<y| italic_p start_POSTSUBSCRIPT italic_π ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | < italic_y for all i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k }. Now consider the complement “large deviation” event

E(x,y)=πi=1k{|𝐚π(i),nk𝐚i|>x}{|pπ(i),nkpi|>y}.𝐸𝑥𝑦subscript𝜋superscriptsubscript𝑖1𝑘superscriptsubscript𝐚𝜋𝑖𝑛𝑘subscript𝐚𝑖𝑥superscriptsubscript𝑝𝜋𝑖𝑛𝑘subscript𝑝𝑖𝑦E(x,y)=\bigcap_{\pi}\bigcup_{i=1}^{k}\left\{|\mathbf{a}_{\pi(i),n}^{k}-\mathbf% {a}_{i}|>x\right\}\cup\left\{|p_{\pi(i),n}^{k}-p_{i}|>y\right\}.italic_E ( italic_x , italic_y ) = ⋂ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT { | bold_a start_POSTSUBSCRIPT italic_π ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | > italic_x } ∪ { | italic_p start_POSTSUBSCRIPT italic_π ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | > italic_y } .

We have the following result.

Proposition 4.3.

For any x,y>0𝑥𝑦0x,y>0italic_x , italic_y > 0,

lim supn1c(r)nlnPr(E(x,y))exp(2Δ(x,y)2)1subscriptlimit-supremum𝑛1subscript𝑐𝑟subscript𝑛Pr𝐸𝑥𝑦2Δsuperscript𝑥𝑦21\limsup_{n}\frac{1}{c_{(r)}\ell_{n}}\ln\Pr\left(E(x,y)\right)\leq\exp\left(-2% \Delta(x,y)^{2}\right)-1lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG roman_ln roman_Pr ( italic_E ( italic_x , italic_y ) ) ≤ roman_exp ( - 2 roman_Δ ( italic_x , italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - 1

where

Δ(x,y)={max{y/ck,pminx/(k+x)}when x<ϵ0y<ckpminϵ0/(k+ϵ0),pminϵ0/(k+ϵ0)otherwise,Δ𝑥𝑦cases𝑦subscript𝑐𝑘subscript𝑝𝑚𝑖𝑛𝑥𝑘𝑥when 𝑥subscriptitalic-ϵ0𝑦subscript𝑐𝑘subscript𝑝subscriptitalic-ϵ0𝑘subscriptitalic-ϵ0subscript𝑝subscriptitalic-ϵ0𝑘subscriptitalic-ϵ0𝑜𝑡𝑒𝑟𝑤𝑖𝑠𝑒\Delta(x,y)=\begin{cases}\max\{y/c_{k},p_{min}x/(k+x)\}&\text{when }x<\epsilon% _{0}\text{, }y<c_{k}p_{\min}\epsilon_{0}/(k+\epsilon_{0}),\\ p_{\min}\epsilon_{0}/(k+\epsilon_{0})&otherwise,\end{cases}roman_Δ ( italic_x , italic_y ) = { start_ROW start_CELL roman_max { italic_y / italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT italic_x / ( italic_k + italic_x ) } end_CELL start_CELL when italic_x < italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y < italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / ( italic_k + italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / ( italic_k + italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_o italic_t italic_h italic_e italic_r italic_w italic_i italic_s italic_e , end_CELL end_ROW

where ϵ0:=sup{ϵ>0:rA>ϵ+rA(ϵ)}assignsubscriptitalic-ϵ0supremumconditional-setitalic-ϵ0subscript𝑟𝐴italic-ϵsuperscriptsubscript𝑟𝐴italic-ϵ\epsilon_{0}:=\sup\{\epsilon>0:r_{A}>\epsilon+r_{A}^{\dagger}(\epsilon)\}italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT := roman_sup { italic_ϵ > 0 : italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT > italic_ϵ + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) } and ck:=(k21)assignsubscript𝑐𝑘𝑘21c_{k}:=(k\vee 2-1)italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := ( italic_k ∨ 2 - 1 ).

Proof.

If Hn(B(𝐚i,ϵ))=|WnB(𝐚i,ϵ)|/|Wn|piδsubscript𝐻𝑛𝐵subscript𝐚𝑖italic-ϵsubscript𝑊𝑛𝐵subscript𝐚𝑖italic-ϵsubscript𝑊𝑛subscript𝑝𝑖𝛿H_{n}\left(B(\mathbf{a}_{i},\epsilon)\right)=|W_{n}\cap B(\mathbf{a}_{i},% \epsilon)|/|W_{n}|\geq p_{i}-\deltaitalic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ) = | italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∩ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) | / | italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | ≥ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_δ for all i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k }, by Lemmas 3.3 and 3.5, as long as (30) holds, there exists a permutation π:{1,,k}{1,,k}:𝜋maps-to1𝑘1𝑘\pi:\{1,\ldots,k\}\mapsto\{1,\ldots,k\}italic_π : { 1 , … , italic_k } ↦ { 1 , … , italic_k }, such that D(𝐚π(i),nk,𝐚i)<ϵ𝐷superscriptsubscript𝐚𝜋𝑖𝑛𝑘subscript𝐚𝑖superscriptitalic-ϵD(\mathbf{a}_{\pi(i),n}^{k},\mathbf{a}_{i})<\epsilon^{\prime}italic_D ( bold_a start_POSTSUBSCRIPT italic_π ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and |pπ(i),nkpi|ckδsuperscriptsubscript𝑝𝜋𝑖𝑛𝑘subscript𝑝𝑖subscript𝑐𝑘𝛿|p_{\pi(i),n}^{k}-p_{i}|\leq c_{k}\delta| italic_p start_POSTSUBSCRIPT italic_π ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ≤ italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_δ for all i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k }. Hence under (30), whenever ϵxsuperscriptitalic-ϵ𝑥\epsilon^{\prime}\leq xitalic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_x or ckδysubscript𝑐𝑘𝛿𝑦c_{k}\delta\leq yitalic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_δ ≤ italic_y,

Pr(E(x,y))Pr(πi=1k{D(𝐚π(i),nk,𝐚i)>ϵ}{|pπ(i),nkpi|>ckδ})Pr(i=1k{Hn(B(𝐚i,ϵ))<piδ}),Pr𝐸𝑥𝑦Prsubscript𝜋superscriptsubscript𝑖1𝑘𝐷superscriptsubscript𝐚𝜋𝑖𝑛𝑘subscript𝐚𝑖superscriptitalic-ϵsuperscriptsubscript𝑝𝜋𝑖𝑛𝑘subscript𝑝𝑖subscript𝑐𝑘𝛿Prsuperscriptsubscript𝑖1𝑘subscript𝐻𝑛𝐵subscript𝐚𝑖italic-ϵsubscript𝑝𝑖𝛿\Pr\left(E(x,y)\right)\leq\Pr\left(\bigcap_{\pi}\bigcup_{i=1}^{k}\left\{D(% \mathbf{a}_{\pi(i),n}^{k},\mathbf{a}_{i})>\epsilon^{\prime}\right\}\cup\left\{% |p_{\pi(i),n}^{k}-p_{i}|>c_{k}\delta\right\}\right)\leq\Pr\left(\bigcup_{i=1}^% {k}\{H_{n}\left(B(\mathbf{a}_{i},\epsilon)\right)<p_{i}-\delta\}\right),roman_Pr ( italic_E ( italic_x , italic_y ) ) ≤ roman_Pr ( ⋂ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT { italic_D ( bold_a start_POSTSUBSCRIPT italic_π ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } ∪ { | italic_p start_POSTSUBSCRIPT italic_π ( italic_i ) , italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | > italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_δ } ) ≤ roman_Pr ( ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT { italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ) < italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_δ } ) ,

where Hnsubscript𝐻𝑛H_{n}italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is the empirical spectral measure in (17). Observe that for any i{1,,k}𝑖1𝑘i\in\{1,\ldots,k\}italic_i ∈ { 1 , … , italic_k },

(|Wn|,(𝟏{𝐗j/𝐗j(s)B(𝐚i,ϵ),𝐗j(r)(n/n)1/α})j=1,,n)=𝑑(N,(Bj)j=1,,n),subscript𝑊𝑛subscript1formulae-sequencesubscript𝐗𝑗subscriptnormsubscript𝐗𝑗𝑠𝐵subscript𝐚𝑖italic-ϵsubscriptnormsubscript𝐗𝑗𝑟superscript𝑛subscript𝑛1𝛼𝑗1𝑛𝑑𝑁subscriptsubscript𝐵𝑗𝑗1𝑛\left(|W_{n}|,\left(\mathbf{1}\left\{\mathbf{X}_{j}/\|\mathbf{X}_{j}\|_{(s)}% \in B(\mathbf{a}_{i},\epsilon),\ \|\mathbf{X}_{j}\|_{(r)}\geq\left(n/\ell_{n}% \right)^{1/\alpha}\right\}\right)_{j=1,\ldots,n}\right)\overset{d}{=}(N,(B_{j}% )_{j=1,\ldots,n}),( | italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | , ( bold_1 { bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / ∥ bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT ∈ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) , ∥ bold_X start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ ( italic_n / roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT } ) start_POSTSUBSCRIPT italic_j = 1 , … , italic_n end_POSTSUBSCRIPT ) overitalic_d start_ARG = end_ARG ( italic_N , ( italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j = 1 , … , italic_n end_POSTSUBSCRIPT ) ,

where N𝑁Nitalic_N and Bjsubscript𝐵𝑗B_{j}italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s are as in Lemma 4.1 with respective parameters q1subscript𝑞1q_{1}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and q2subscript𝑞2q_{2}italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT given as follows:

q1=q1(i,ϵ,n):=Pr(𝐗1/𝐗1(s)B(𝐚i,ϵ),𝐗1(r)(n/n)1/α)/Pr(𝐗1(r)(n/n)1/α)pisubscript𝑞1subscript𝑞1𝑖italic-ϵ𝑛assignPrsubscript𝐗1subscriptnormsubscript𝐗1𝑠𝐵subscript𝐚𝑖italic-ϵsubscriptnormsubscript𝐗1𝑟superscript𝑛subscript𝑛1𝛼Prsubscriptnormsubscript𝐗1𝑟superscript𝑛subscript𝑛1𝛼subscript𝑝𝑖q_{1}=q_{1}(i,\epsilon,n):=\Pr\left({\mathbf{X}_{1}/\|\mathbf{X}_{1}\|_{(s)}% \in B(\mathbf{a}_{i},\epsilon),\ \|\mathbf{X}_{1}\|_{(r)}\geq\left(n/\ell_{n}% \right)^{1/\alpha}}\right)/\Pr\left(\|\mathbf{X}_{1}\|_{(r)}\geq\left(n/\ell_{% n}\right)^{1/\alpha}\right)\rightarrow p_{i}italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_i , italic_ϵ , italic_n ) := roman_Pr ( bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / ∥ bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT ∈ italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) , ∥ bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ ( italic_n / roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT ) / roman_Pr ( ∥ bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ ( italic_n / roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT ) → italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (36)

as n𝑛n\rightarrow\inftyitalic_n → ∞, where the last convergence holds due to (8) and the fact that B(𝐚i,ϵ)𝐵subscript𝐚𝑖italic-ϵB(\mathbf{a}_{i},\epsilon)italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ )’s are disjoint under ϵ<rAitalic-ϵsubscript𝑟𝐴\epsilon<r_{A}italic_ϵ < italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, and

q2=q2(n)=Pr(𝐗1(r)(n/n)1/α)c(r)(n/n)subscript𝑞2subscript𝑞2𝑛Prsubscriptnormsubscript𝐗1𝑟superscript𝑛subscript𝑛1𝛼similar-tosubscript𝑐𝑟subscript𝑛𝑛q_{2}=q_{2}(n)=\Pr\left(\|\mathbf{X}_{1}\|_{(r)}\geq\left(n/\ell_{n}\right)^{1% /\alpha}\right)\sim c_{(r)}\left(\ell_{n}/n\right)italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n ) = roman_Pr ( ∥ bold_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ≥ ( italic_n / roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT ) ∼ italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT ( roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT / italic_n ) (37)

as n𝑛n\rightarrow\inftyitalic_n → ∞. Now applying Lemma 4.1, we have

Pr(i=1k{Hn(B(𝐚i,ϵ))<piδ})Prsuperscriptsubscript𝑖1𝑘subscript𝐻𝑛𝐵subscript𝐚𝑖italic-ϵsubscript𝑝𝑖𝛿\displaystyle\Pr\left(\bigcup_{i=1}^{k}\{H_{n}\left(B(\mathbf{a}_{i},\epsilon)% \right)<p_{i}-\delta\}\right)roman_Pr ( ⋃ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT { italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ) < italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_δ } ) i=1kPr(Hn(B(𝐚i,ϵ))<piδ)absentsuperscriptsubscript𝑖1𝑘Prsubscript𝐻𝑛𝐵subscript𝐚𝑖italic-ϵsubscript𝑝𝑖𝛿\displaystyle\leq\sum_{i=1}^{k}\Pr\left(H_{n}(B(\mathbf{a}_{i},\epsilon))<p_{i% }-\delta\right)≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_Pr ( italic_H start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_B ( bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ϵ ) ) < italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_δ )
kexp(nq2(n)[exp{2δ2}1]).absent𝑘𝑛subscript𝑞2𝑛delimited-[]2superscript𝛿21\displaystyle\leq k\exp\left(nq_{2}(n)\left[\exp\left\{-2\delta^{2}\right\}-1% \right]\right).≤ italic_k roman_exp ( italic_n italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_n ) [ roman_exp { - 2 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } - 1 ] ) . (38)

Therefore in view of also (36) and (37), we have

lim supn1nln{Pr(E(x,y))}c(r){exp(2δ2)1}.subscriptlimit-supremum𝑛1subscript𝑛Pr𝐸𝑥𝑦subscript𝑐𝑟2superscript𝛿21\displaystyle\limsup_{n}\frac{1}{\ell_{n}}\ln\left\{\Pr\left(E(x,y)\right)% \right\}\leq c_{(r)}\left\{\exp(-2\delta^{2})-1\right\}.lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG roman_ln { roman_Pr ( italic_E ( italic_x , italic_y ) ) } ≤ italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT { roman_exp ( - 2 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - 1 } .

The next step is to determine the largest value of δ𝛿\deltaitalic_δ as possible. Recall ϵ0=sup{ϵ>0:rA>ϵ+rA(ϵ)}subscriptitalic-ϵ0supremumconditional-setsuperscriptitalic-ϵ0subscript𝑟𝐴superscriptitalic-ϵsuperscriptsubscript𝑟𝐴superscriptitalic-ϵ\epsilon_{0}=\sup\{\epsilon^{\prime}>0:r_{A}>\epsilon^{\prime}+r_{A}^{\dagger}% (\epsilon^{\prime})\}italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_sup { italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 : italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT > italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) }. Then when ϵ(0,ϵ0)superscriptitalic-ϵ0subscriptitalic-ϵ0\epsilon^{\prime}\in(0,\epsilon_{0})italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ ( 0 , italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), for all ϵitalic-ϵ\epsilonitalic_ϵ small enough we have rA>ϵ+2rA(ϵ)+rA(ϵ)subscript𝑟𝐴superscriptitalic-ϵ2superscriptsubscript𝑟𝐴italic-ϵsuperscriptsubscript𝑟𝐴superscriptitalic-ϵr_{A}>\epsilon^{\prime}+2r_{A}^{\dagger}(\epsilon)+r_{A}^{\dagger}(\epsilon^{% \prime})italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT > italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + 2 italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ ) + italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), namely, (30) holds. Hence by taking ϵ0italic-ϵ0\epsilon\downarrow 0italic_ϵ ↓ 0 in (27), we get from ϵ<ϵ0superscriptitalic-ϵsubscriptitalic-ϵ0\epsilon^{\prime}<\epsilon_{0}italic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT the restriction δ<pminϵ0/(k+ϵ0)𝛿subscript𝑝subscriptitalic-ϵ0𝑘subscriptitalic-ϵ0\delta<p_{\min}\epsilon_{0}/(k+\epsilon_{0})italic_δ < italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / ( italic_k + italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Similarly, from ϵxsuperscriptitalic-ϵ𝑥\epsilon^{\prime}\leq xitalic_ϵ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≤ italic_x we get the restriction δ<pminx/(k+x)𝛿subscript𝑝𝑚𝑖𝑛𝑥𝑘𝑥\delta<p_{min}x/(k+x)italic_δ < italic_p start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT italic_x / ( italic_k + italic_x ). In addition, from ckδysubscript𝑐𝑘𝛿𝑦c_{k}\delta\leq yitalic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_δ ≤ italic_y we get the restriction δy/ck𝛿𝑦subscript𝑐𝑘\delta\leq y/c_{k}italic_δ ≤ italic_y / italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. At least one of the last two conditions should be satisfied. Therefore,

{δ<pminϵ0/(k+ϵ0)if xϵ0,δ<pminϵ0/(k+ϵ0)if x<ϵ0,yckpminϵ0/(k+ϵ0),δ<max{y/ck,pminx/(k+x)}if x<ϵ0,y<ckpminϵ0/(k+ϵ0).cases𝛿subscript𝑝subscriptitalic-ϵ0𝑘subscriptitalic-ϵ0if 𝑥subscriptitalic-ϵ0𝛿subscript𝑝subscriptitalic-ϵ0𝑘subscriptitalic-ϵ0formulae-sequenceif 𝑥subscriptitalic-ϵ0𝑦subscript𝑐𝑘subscript𝑝subscriptitalic-ϵ0𝑘subscriptitalic-ϵ0𝛿𝑦subscript𝑐𝑘subscript𝑝𝑚𝑖𝑛𝑥𝑘𝑥formulae-sequenceif 𝑥subscriptitalic-ϵ0𝑦subscript𝑐𝑘subscript𝑝subscriptitalic-ϵ0𝑘subscriptitalic-ϵ0\begin{cases}\delta<p_{\min}\epsilon_{0}/(k+\epsilon_{0})&\text{if }x\geq% \epsilon_{0},\\ \delta<p_{\min}\epsilon_{0}/(k+\epsilon_{0})&\text{if }x<\epsilon_{0},y\geq c_% {k}p_{\min}\epsilon_{0}/(k+\epsilon_{0}),\\ \delta<\max\{y/c_{k},p_{min}x/(k+x)\}&\text{if }x<\epsilon_{0},y<c_{k}p_{\min}% \epsilon_{0}/(k+\epsilon_{0}).\end{cases}{ start_ROW start_CELL italic_δ < italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / ( italic_k + italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_x ≥ italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_δ < italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / ( italic_k + italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL start_CELL if italic_x < italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ≥ italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / ( italic_k + italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_δ < roman_max { italic_y / italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT italic_x / ( italic_k + italic_x ) } end_CELL start_CELL if italic_x < italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y < italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / ( italic_k + italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . end_CELL end_ROW

The result then follows. ∎

Remark 4.4.

The large-deviation-type estimates in Proposition 4.3 say that the probability Pr(E(x,y))Pr𝐸𝑥𝑦\Pr\left(E(x,y)\right)roman_Pr ( italic_E ( italic_x , italic_y ) ) decays exponentially in the expected extremal subsample size c(r)nsubscript𝑐𝑟subscript𝑛c_{(r)}\ell_{n}italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. It is worth observing that the expression of Δ(x,y)Δ𝑥𝑦\Delta(x,y)roman_Δ ( italic_x , italic_y ) reflects the following: The difficulty of clustering-based estimation measured by the aforementioned large error probabilities depends negatively on pminsubscript𝑝𝑚𝑖𝑛p_{min}italic_p start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT and positively on k𝑘kitalic_k.

We also have the following result which states that in the context of Theorem 3.1, the probability of false order election tends to 00 exponentially fast.

Proposition 4.5.

Under the assumption and notation of Theorem 3.1, fix t(0,t0)𝑡0subscript𝑡0t\in(0,t_{0})italic_t ∈ ( 0 , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). Then

lim supn1c(r)nln{Pr(St(Wn;Ak,n,k,n)St(Wn;Am,n,m,n) for all mk)}exp(2δt(k,pmin)2)1subscriptlimit-supremum𝑛1subscript𝑐𝑟subscript𝑛Prsubscript𝑆𝑡subscript𝑊𝑛subscript𝐴𝑘𝑛subscript𝑘𝑛subscript𝑆𝑡subscript𝑊𝑛subscript𝐴𝑚𝑛subscript𝑚𝑛 for all 𝑚𝑘2subscript𝛿𝑡superscript𝑘subscript𝑝21\limsup_{n}\frac{1}{c_{(r)}\ell_{n}}\ln\left\{\Pr\left(S_{t}(W_{n};A_{k,n},% \mathfrak{C}_{k,n})\leq S_{t}(W_{n};A_{m,n},\mathfrak{C}_{m,n})\text{ for all % }m\neq k\right)\right\}\leq\exp\left(-2\delta_{t}(k,p_{\min})^{2}\right)-1lim sup start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG roman_ln { roman_Pr ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_k , italic_n end_POSTSUBSCRIPT ) ≤ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) for all italic_m ≠ italic_k ) } ≤ roman_exp ( - 2 italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - 1

where δt(k,pmin)>0subscript𝛿𝑡𝑘subscript𝑝0\delta_{t}(k,p_{\min})>0italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) > 0 is the solution δ𝛿\deltaitalic_δ of the equation [k(pminδ)rA]tkδ=(k2δ)t((1(pminδ)rA)𝟏{k2})superscriptdelimited-[]𝑘subscript𝑝𝛿subscript𝑟𝐴𝑡𝑘𝛿superscriptsuperscript𝑘2𝛿𝑡1subscript𝑝𝛿subscript𝑟𝐴1𝑘2[k(p_{\min}-\delta)r_{A}]^{t}-k\delta=\left(k^{2}\delta\right)^{t}\vee\left(% \left(1-\left(p_{\min}-\delta\right)r_{A}\right)\mathbf{1}\left\{k\geq 2\right% \}\right)[ italic_k ( italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_δ ) italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_k italic_δ = ( italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∨ ( ( 1 - ( italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_δ ) italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) bold_1 { italic_k ≥ 2 } ).

Proof.

Writing St(m)=St(Wn;Am,n,m,n)subscript𝑆𝑡𝑚subscript𝑆𝑡subscript𝑊𝑛subscript𝐴𝑚𝑛subscript𝑚𝑛S_{t}(m)=S_{t}(W_{n};A_{m,n},\mathfrak{C}_{m,n})italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_m ) = italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_W start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ; italic_A start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , fraktur_C start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ), we have

Pr(St(k)St(m),mk)Pr({St(k)St(m),mk}En(ϵ,δ))+Pr(En(ϵ,δ)c),Prsubscript𝑆𝑡𝑘subscript𝑆𝑡𝑚𝑚𝑘Prformulae-sequencesubscript𝑆𝑡𝑘subscript𝑆𝑡𝑚𝑚𝑘subscript𝐸𝑛italic-ϵ𝛿Prsubscript𝐸𝑛superscriptitalic-ϵ𝛿𝑐\displaystyle\Pr\left(S_{t}(k)\leq S_{t}(m),\ m\neq k\right)\leq\Pr\left(\{S_{% t}(k)\leq S_{t}(m),\ m\neq k\}\cap E_{n}(\epsilon,\delta)\right)+\Pr\left(E_{n% }(\epsilon,\delta)^{c}\right),roman_Pr ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k ) ≤ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_m ) , italic_m ≠ italic_k ) ≤ roman_Pr ( { italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k ) ≤ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_m ) , italic_m ≠ italic_k } ∩ italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϵ , italic_δ ) ) + roman_Pr ( italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϵ , italic_δ ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ,

where En(ϵ,δ)subscript𝐸𝑛italic-ϵ𝛿E_{n}(\epsilon,\delta)italic_E start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_ϵ , italic_δ ) is in (33). Combining the inequalities regarding S¯¯𝑆\bar{S}over¯ start_ARG italic_S end_ARG in the proof of Proposition 3.6, and the inequalities regarding Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the proof of Proposition 3.7, the event in the first probability on the right-hand side above is empty as long as δ>0𝛿0\delta>0italic_δ > 0 satisfies

[k(pminδ)rA]tkδ>(k2δ)t((1(pminδ)rA)𝟏{k2})superscriptdelimited-[]𝑘subscript𝑝𝛿subscript𝑟𝐴𝑡𝑘𝛿superscriptsuperscript𝑘2𝛿𝑡1subscript𝑝𝛿subscript𝑟𝐴1𝑘2[k(p_{\min}-\delta)r_{A}]^{t}-k\delta>\left(k^{2}\delta\right)^{t}\vee\left(% \left(1-\left(p_{\min}-\delta\right)r_{A}\right)\mathbf{1}\left\{k\geq 2\right% \}\right)[ italic_k ( italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_δ ) italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_k italic_δ > ( italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_δ ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∨ ( ( 1 - ( italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_δ ) italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) bold_1 { italic_k ≥ 2 } )

and ϵitalic-ϵ\epsilonitalic_ϵ is sufficiently small (depending on δ𝛿\deltaitalic_δ). Note that the inequality above holds when δ𝛿\deltaitalic_δ is sufficiently small due to 0<t<t0=ln(1rApmin)/ln(rAkpmin)0𝑡subscript𝑡01subscript𝑟𝐴subscript𝑝subscript𝑟𝐴𝑘subscript𝑝0<t<t_{0}=\ln\left(1-r_{A}p_{\min}\right)/\ln\left(r_{A}kp_{\min}\right)0 < italic_t < italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_ln ( 1 - italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) / roman_ln ( italic_r start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_k italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ), and its left-hand side is decreasing (to negative values) and its right-hand side is increasing with as δ𝛿\deltaitalic_δ increases to pminsubscript𝑝p_{\min}italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT. Then for any δ(0,δt(k,pmin))𝛿0subscript𝛿𝑡𝑘subscript𝑝\delta\in(0,\delta_{t}(k,p_{\min}))italic_δ ∈ ( 0 , italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) ), we have in view of (37) and (4) that

limn1c(r)nln{Pr(St(k)St(m),mk)}exp(2δ2)1.subscript𝑛1subscript𝑐𝑟subscript𝑛Prsubscript𝑆𝑡𝑘subscript𝑆𝑡𝑚𝑚𝑘2superscript𝛿21\lim_{n}\frac{1}{c_{(r)}\ell_{n}}\ln\left\{\Pr\left(S_{t}(k)\leq S_{t}(m),\ m% \neq k\right)\right\}\leq\exp\left(-2\delta^{2}\right)-1.roman_lim start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG roman_ln { roman_Pr ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k ) ≤ italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_m ) , italic_m ≠ italic_k ) } ≤ roman_exp ( - 2 italic_δ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - 1 .

The proof is concluded by letting δδt(k,pmin)𝛿subscript𝛿𝑡𝑘subscript𝑝\delta\uparrow\delta_{t}(k,p_{\min})italic_δ ↑ italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_k , italic_p start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ). ∎

5 Clustering and heavy-tailed factor models

5.1 The models

As observed by Einmahl et al. [5] and Janßen and Wan [20], one may relate a k𝑘kitalic_k-clustering algorithm to the estimation of certain factor-like models that are often considered in the analysis of multivariate extremes. Suppose B=(bij)i=1,,d,j=1,,k=(𝐛1,,𝐛k)𝐵subscriptsubscript𝑏𝑖𝑗formulae-sequence𝑖1𝑑𝑗1𝑘subscript𝐛1subscript𝐛𝑘B=\left(b_{ij}\right)_{i=1,\ldots,d,j=1,\ldots,k}=\left(\mathbf{b}_{1},\ldots,% \mathbf{b}_{k}\right)italic_B = ( italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , … , italic_d , italic_j = 1 , … , italic_k end_POSTSUBSCRIPT = ( bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), where 𝐛j=(b1j,,bdj)subscript𝐛𝑗superscriptsubscript𝑏1𝑗subscript𝑏𝑑𝑗top\mathbf{b}_{j}=(b_{1j},\ldots,b_{dj})^{\top}bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( italic_b start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT , … , italic_b start_POSTSUBSCRIPT italic_d italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, j{1,,k}𝑗1𝑘j\in\{1,\ldots,k\}italic_j ∈ { 1 , … , italic_k }, are k𝑘kitalic_k distinct d𝑑ditalic_d-dimensional vectors, bij0subscript𝑏𝑖𝑗0b_{ij}\geq 0italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≥ 0, and that each column and row vector of B𝐵Bitalic_B is nonzero (otherwise, the dimension d𝑑ditalic_d or the factor order k𝑘kitalic_k can be reduced). Assume that 𝐙=(Z1,,Zk)𝐙superscriptsubscript𝑍1subscript𝑍𝑘top\mathbf{Z}=(Z_{1},\ldots,Z_{k})^{\top}bold_Z = ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a vector of i.i.d.  positive continuous random variables satisfying Pr(Z1>z)zαsimilar-toPrsubscript𝑍1𝑧superscript𝑧𝛼\Pr(Z_{1}>z)\sim z^{-\alpha}roman_Pr ( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_z ) ∼ italic_z start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT as z𝑧z\rightarrow\inftyitalic_z → ∞, α(0,)𝛼0\alpha\in(0,\infty)italic_α ∈ ( 0 , ∞ ). Then the sum-linear model is given as

𝐗=(X1,,Xd)=(j=1kb1jZj,,j=1kbdjZj)=B𝐙.𝐗superscriptsubscript𝑋1subscript𝑋𝑑topsuperscriptsuperscriptsubscript𝑗1𝑘subscript𝑏1𝑗subscript𝑍𝑗superscriptsubscript𝑗1𝑘subscript𝑏𝑑𝑗subscript𝑍𝑗top𝐵𝐙\mathbf{X}=\left(X_{1},\ldots,X_{d}\right)^{\top}=\left(\sum_{j=1}^{k}b_{1j}Z_% {j},\ldots,\sum_{j=1}^{k}b_{dj}Z_{j}\right)^{\top}=B\mathbf{Z}.bold_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , … , ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_d italic_j end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = italic_B bold_Z . (39)

On the other hand, we also have the max-linear model as

𝐗=(X1,,Xd)=(j=1kb1jZj,,j=1kbdjZj)=B𝐙,𝐗superscriptsubscript𝑋1subscript𝑋𝑑topsuperscriptsuperscriptsubscript𝑗1𝑘subscript𝑏1𝑗subscript𝑍𝑗superscriptsubscript𝑗1𝑘subscript𝑏𝑑𝑗subscript𝑍𝑗topdirect-product𝐵𝐙\mathbf{X}=\left(X_{1},\ldots,X_{d}\right)^{\top}=\left(\bigvee_{j=1}^{k}b_{1j% }Z_{j},\ldots,\bigvee_{j=1}^{k}b_{dj}Z_{j}\right)^{\top}=B\odot\mathbf{Z},bold_X = ( italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_X start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = ( ⋁ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , … , ⋁ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_d italic_j end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT = italic_B ⊙ bold_Z , (40)

where direct-product\odot is interpreted as the matrix product with the sum operation replaced by the maximum operation. Note that due to the exchangeability of (Z1,,Zk)subscript𝑍1subscript𝑍𝑘\left(Z_{1},\ldots,Z_{k}\right)( italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), either model is identifiable only up to a permutation of the vectors 𝐛jsubscript𝐛𝑗\mathbf{b}_{j}bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, j{1,,k}𝑗1𝑘j\in\{1,\ldots,k\}italic_j ∈ { 1 , … , italic_k }, i.e. the distribution of 𝐗𝐗\mathbf{X}bold_X is unchanged if B𝐵Bitalic_B is replaced by Bπ:=(𝐛π(1),,𝐛π(k))assignsubscript𝐵𝜋subscript𝐛𝜋1subscript𝐛𝜋𝑘B_{\pi}:=\left(\mathbf{b}_{\pi(1)},\ldots,\mathbf{b}_{\pi(k)}\right)italic_B start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT := ( bold_b start_POSTSUBSCRIPT italic_π ( 1 ) end_POSTSUBSCRIPT , … , bold_b start_POSTSUBSCRIPT italic_π ( italic_k ) end_POSTSUBSCRIPT ) for any permutation π:{1,,k}{1,,k}:𝜋maps-to1𝑘1𝑘\pi:\{1,\ldots,k\}\mapsto\{1,\ldots,k\}italic_π : { 1 , … , italic_k } ↦ { 1 , … , italic_k }. The models of types (39) and (40) have recently attracted considerable interest in connection with causal structural equations for extremes; see, e.g., [11, 12].

It is known that both models above satisfy MRV with (1), and have a discrete spectral measure as in (18) with

pj=𝐛j(r)α=1k𝐛(r)α,𝐚j=𝐛j𝐛j(s),j{1,,k}.formulae-sequencesubscript𝑝𝑗superscriptsubscriptnormsubscript𝐛𝑗𝑟𝛼superscriptsubscript1𝑘superscriptsubscriptnormsubscript𝐛𝑟𝛼formulae-sequencesubscript𝐚𝑗subscript𝐛𝑗subscriptnormsubscript𝐛𝑗𝑠𝑗1𝑘p_{j}=\frac{\|\mathbf{b}_{j}\|_{(r)}^{\alpha}}{\sum_{\ell=1}^{k}\|\mathbf{b}_{% \ell}\|_{(r)}^{\alpha}},\quad\mathbf{a}_{j}=\frac{\mathbf{b}_{j}}{\|\mathbf{b}% _{j}\|_{(s)}},\quad j\in\left\{1,\ldots,k\right\}.italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG ∥ bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ bold_b start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT end_ARG , bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT end_ARG , italic_j ∈ { 1 , … , italic_k } . (41)

This can be derived based on the well-known “single large jump” heuristic: when 𝐗(r)subscriptnorm𝐗𝑟\|\mathbf{X}\|_{(r)}∥ bold_X ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT is large, it is only due to a single large Zjsubscript𝑍𝑗Z_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with overwhelming probability. See, e.g., [22] and [5]; we mention that these works usually assume the same norm (r)=(s)\|\cdot\|_{(r)}=\|\cdot\|_{(s)}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT and α=1𝛼1\alpha=1italic_α = 1, although an extension is straightforward. In addition, the marginal standardization condition (1) or equivalently (7) imposes the following restriction on B𝐵Bitalic_B:

j=1kbijα=1,i{1,,d}.formulae-sequencesuperscriptsubscript𝑗1𝑘superscriptsubscript𝑏𝑖𝑗𝛼1𝑖1𝑑\sum_{j=1}^{k}b_{ij}^{\alpha}=1,\quad i\in\left\{1,\ldots,d\right\}.∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = 1 , italic_i ∈ { 1 , … , italic_d } . (42)

We also mention that one may relax the models (39) and (40) by adding a noise term, e.g., 𝐗=B𝐙+𝜺𝐗𝐵𝐙𝜺\mathbf{X}=B\mathbf{Z}+\boldsymbol{\varepsilon}bold_X = italic_B bold_Z + bold_italic_ε or 𝐗=(B𝐙)𝜺𝐗direct-product𝐵𝐙𝜺\mathbf{X}=(B\odot\mathbf{Z})\vee\boldsymbol{\varepsilon}bold_X = ( italic_B ⊙ bold_Z ) ∨ bold_italic_ε, where 𝜺=(ε1,,εd)𝜺superscriptsubscript𝜀1subscript𝜀𝑑top\boldsymbol{\varepsilon}=(\varepsilon_{1},\ldots,\varepsilon_{d})^{\top}bold_italic_ε = ( italic_ε start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_ε start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT is a vector of i.i.d. positive noise terms, and the maximum \vee is performed coordinate-wise. As long as each εisubscript𝜀𝑖\varepsilon_{i}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has a tail lighter than that of Zjsubscript𝑍𝑗Z_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the conclusions made above still hold (see, e.g., [5]). The discussion also applies to the transformed-linear model of [2]. Finally, we mention that in the context of multivariate extremes, one typically only considers fitting these models to an extremal subsample (see, e.g., (14)) instead of the whole sample.

5.2 Order selection and coefficient estimation

Due to the discrete nature of the spectral measure, the likelihood functions of these models are inaccessible (see, e.g., [5, 28, 4]). Even without taking a perspective of extremes, the max-linear model (40) does not admit a smooth density. Therefore, the usual model selection techniques based on information criteria are not available. On the other hand, the spectral measure of these factor models, including (39) and (40), is of the form (18). Therefore, the penalized ASW method proposed in Section 3 could be used to select the order of factors k𝑘kitalic_k, whose consistency is supported by Theorem 3.1.

Suppose from now on the order k𝑘kitalic_k is assumed to be known. Another noteworthy issue deserving discussion is whether we can translate the estimation of the spectral measure through a k𝑘kitalic_k-clustering algorithm (refer to Section 2.3) into an estimation of the coefficient matrix B=(𝐛1,,𝐛k)𝐵subscript𝐛1subscript𝐛𝑘B=\left(\mathbf{b}_{1},\ldots,\mathbf{b}_{k}\right)italic_B = ( bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) in (39) or (40). Note that the constraint (42) also needs to be taken into account. Combining (41) and (42), to solve the kd𝑘𝑑kditalic_k italic_d coefficients in B𝐵Bitalic_B from pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s and 𝐚jsubscript𝐚𝑗\mathbf{a}_{j}bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s, we have totally kd+d1𝑘𝑑𝑑1kd+d-1italic_k italic_d + italic_d - 1 free equations (k1𝑘1k-1italic_k - 1 from the equations for pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s, (d1)k𝑑1𝑘(d-1)k( italic_d - 1 ) italic_k from the equations for 𝐚jsubscript𝐚𝑗\mathbf{a}_{j}bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s and d𝑑ditalic_d from (42)). When pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s and 𝐚jsubscript𝐚𝑗\mathbf{a}_{j}bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s are estimated via k𝑘kitalic_k-clustering, the over-determined system may not admit a solution, although this over-determined relation holds asymptotically in view of Corollary 2.5.

In the following, we describe a simple method to convert spectral estimation to an estimation of B𝐵Bitalic_B that satisfies the constraint (42). Observe that the exponent measure ΛΛ\Lambdaroman_Λ for the models (39) and (40) concentrates on the rays {t𝐛j:t>0}conditional-set𝑡subscript𝐛𝑗𝑡0\{t\mathbf{b}_{j}:t>0\}{ italic_t bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT : italic_t > 0 }, j{1,,k}𝑗1𝑘j\in\left\{1,\ldots,k\right\}italic_j ∈ { 1 , … , italic_k }. Hence a spectral mass point 𝐚j=𝐛j/𝐛j(s)subscript𝐚𝑗subscript𝐛𝑗subscriptnormsubscript𝐛𝑗𝑠\mathbf{a}_{j}=\mathbf{b}_{j}/\|\mathbf{b}_{j}\|_{(s)}bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / ∥ bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT on the (s)\|\cdot\|_{(s)}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT-norm sphere corresponds to a spectral mass point 𝐛j/𝐛jα=𝐚j/𝐚jαsubscript𝐛𝑗subscriptnormsubscript𝐛𝑗𝛼subscript𝐚𝑗subscriptnormsubscript𝐚𝑗𝛼\mathbf{b}_{j}/\|\mathbf{b}_{j}\|_{\alpha}=\mathbf{a}_{j}/\|\mathbf{a}_{j}\|_{\alpha}bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / ∥ bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / ∥ bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT on the α𝛼\alphaitalic_α-norm sphere, j{1,,k}𝑗1𝑘j\in\left\{1,\ldots,k\right\}italic_j ∈ { 1 , … , italic_k }. The advantage of considering the α𝛼\alphaitalic_α-norm sphere is that

j=1k𝐛jαα=i=1dj=1kbijα=dsuperscriptsubscript𝑗1𝑘superscriptsubscriptnormsubscript𝐛𝑗𝛼𝛼superscriptsubscript𝑖1𝑑superscriptsubscript𝑗1𝑘superscriptsubscript𝑏𝑖𝑗𝛼𝑑\sum_{j=1}^{k}\|\mathbf{b}_{j}\|_{\alpha}^{\alpha}=\sum_{i=1}^{d}\sum_{j=1}^{k% }b_{ij}^{\alpha}=d∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_b start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = italic_d

due to relation (42). Therefore, under the choice (r)=α\|\cdot\|_{(r)}=\|\cdot\|_{\alpha}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT in (41), we have pjd=𝐛jααsubscript𝑝𝑗𝑑superscriptsubscriptnormsubscript𝐛𝑗𝛼𝛼p_{j}d=\|\mathbf{b}_{j}\|_{\alpha}^{\alpha}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_d = ∥ bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT, and hence

𝐛j=(pjd)1/α𝐚j𝐚jα,j{1,,k}.formulae-sequencesubscript𝐛𝑗superscriptsubscript𝑝𝑗𝑑1𝛼subscript𝐚𝑗subscriptnormsubscript𝐚𝑗𝛼𝑗1𝑘\mathbf{b}_{j}=\left(p_{j}d\right)^{1/\alpha}\frac{\mathbf{a}_{j}}{\|\mathbf{a% }_{j}\|_{\alpha}},\quad j\in\left\{1,\ldots,k\right\}.bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ( italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_d ) start_POSTSUPERSCRIPT 1 / italic_α end_POSTSUPERSCRIPT divide start_ARG bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG , italic_j ∈ { 1 , … , italic_k } . (43)

Note that if (s)=α\|\cdot\|_{(s)}=\|\cdot\|_{\alpha}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT already, then 𝐚jα=1subscriptnormsubscript𝐚𝑗𝛼1\|\mathbf{a}_{j}\|_{\alpha}=1∥ bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = 1. So one can plug in estimated 𝐚jsubscript𝐚𝑗\mathbf{a}_{j}bold_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT via k𝑘kitalic_k-clustering on the α𝛼\alphaitalic_α-norm sphere into (43), obtaining, say, 𝐛^jsubscript^𝐛𝑗\widehat{\mathbf{b}}_{j}over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, j{1,,k}𝑗1𝑘j\in\{1,\ldots,k\}italic_j ∈ { 1 , … , italic_k }. However, the condition (42) may not be satisfied. We propose the following simple correction: first, form the preliminary estimated coefficient matrix B^:=(𝐛^1,,𝐛^k)=:(𝐫1,,𝐫d)\widehat{B}:=\left(\widehat{\mathbf{b}}_{1},\ldots,\widehat{\mathbf{b}}_{k}% \right)=:\left({\mathbf{r}}_{1},\ldots,{\mathbf{r}}_{d}\right)^{\top}over^ start_ARG italic_B end_ARG := ( over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over^ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = : ( bold_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, where 𝐫isuperscriptsubscript𝐫𝑖top\mathbf{r}_{i}^{\top}bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, i{1,,d}𝑖1𝑑i\in\left\{1,\ldots,d\right\}italic_i ∈ { 1 , … , italic_d }, are row vectors of B^^𝐵\widehat{B}over^ start_ARG italic_B end_ARG. Then we obtain the final estimate B~=(𝐛~1,,𝐛~k)~𝐵subscript~𝐛1subscript~𝐛𝑘\widetilde{B}=\left(\widetilde{\mathbf{b}}_{1},\ldots,\widetilde{\mathbf{b}}_{% k}\right)over~ start_ARG italic_B end_ARG = ( over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , over~ start_ARG bold_b end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) of B𝐵Bitalic_B through replacing each row 𝐫isubscript𝐫𝑖\mathbf{r}_{i}bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by 𝐫i/𝐫iαsubscript𝐫𝑖subscriptnormsubscript𝐫𝑖𝛼\mathbf{r}_{i}/\|\mathbf{r}_{i}\|_{\alpha}bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / ∥ bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT, which ensures (42). It follows from Corollary 2.5 and a continuous map** argument that the thus obtained estimate of B𝐵Bitalic_B is consistent (up to a permutation of 𝐛isubscript𝐛𝑖\mathbf{b}_{i}bold_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s).

6 Simulation and real data studies

6.1 Simulation studies

In this section, we present some simulation studies to illustrate the performance of the penalized ASW method introduced in Section 3. We follow the setup in [20, Section 4] to simulate the max-linear factor model (40) with randomly generated coefficient matrix B𝐵Bitalic_B. In particular, we let the factors Zjsubscript𝑍𝑗Z_{j}italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s each follow a standard Fréchet (α=1𝛼1\alpha=1italic_α = 1) distribution. We consider 4 different combinations of dimensionality d𝑑ditalic_d and true order k𝑘kitalic_k. Under each (d,k)𝑑𝑘(d,k)( italic_d , italic_k ) combination, we describe in the list below the way the coefficient vector 𝐛jsubscript𝐛𝑗\mathbf{b}_{j}bold_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT’s are generated. Note that due to the standardization (42), only 𝐛1,,𝐛k1subscript𝐛1subscript𝐛𝑘1\mathbf{b}_{1},\ldots,\mathbf{b}_{k-1}bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_b start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT need to be specified. Let Uisubscript𝑈𝑖U_{i}italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT’s stand for i.i.d. uniform random variables on [0,1]01[0,1][ 0 , 1 ].

  • 1.

    d=4,k=2formulae-sequence𝑑4𝑘2d=4,k=2italic_d = 4 , italic_k = 2: 𝐛1=(U1,U2,U3,U4)/2subscript𝐛1superscriptsubscript𝑈1subscript𝑈2subscript𝑈3subscript𝑈4top2\mathbf{b}_{1}=(U_{1},U_{2},U_{3},U_{4})^{\top}/2bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 2.

  • 2.

    d=4,k=6formulae-sequence𝑑4𝑘6d=4,k=6italic_d = 4 , italic_k = 6: 𝐛1=(U1,U2,U3,U4)/3subscript𝐛1superscriptsubscript𝑈1subscript𝑈2subscript𝑈3subscript𝑈4top3\mathbf{b}_{1}=(U_{1},U_{2},U_{3},U_{4})^{\top}/3bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 3, 𝐛2=(U5,0,U6,0)/3subscript𝐛2superscriptsubscript𝑈50subscript𝑈60top3\mathbf{b}_{2}=(U_{5},0,U_{6},0)^{\top}/3bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_U start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , 0 , italic_U start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 3, 𝐛3=(0,U7,0,U8)/3subscript𝐛3superscript0subscript𝑈70subscript𝑈8top3\mathbf{b}_{3}=(0,U_{7},0,U_{8})^{\top}/3bold_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ( 0 , italic_U start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT , 0 , italic_U start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 3, 𝐛4=(U9,U10,0,0)/3subscript𝐛4superscriptsubscript𝑈9subscript𝑈1000top3\mathbf{b}_{4}=(U_{9},U_{10},0,0)^{\top}/3bold_b start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = ( italic_U start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT , 0 , 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 3, 𝐛5=(0,0,U11,U12)/3subscript𝐛5superscript00subscript𝑈11subscript𝑈12top3\mathbf{b}_{5}=(0,0,U_{11},U_{12})^{\top}/3bold_b start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = ( 0 , 0 , italic_U start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 3.

  • 3.

    d=6,k=6formulae-sequence𝑑6𝑘6d=6,k=6italic_d = 6 , italic_k = 6: 𝐛1=(U1,,U6)/3subscript𝐛1superscriptsubscript𝑈1subscript𝑈6top3\mathbf{b}_{1}=(U_{1},\cdots,U_{6})^{\top}/3bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_U start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 3, 𝐛2=(U7,0,U8,0,U9,0)/3subscript𝐛2superscriptsubscript𝑈70subscript𝑈80subscript𝑈90top3\mathbf{b}_{2}=(U_{7},0,U_{8},0,U_{9},0)^{\top}/3bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_U start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT , 0 , italic_U start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT , 0 , italic_U start_POSTSUBSCRIPT 9 end_POSTSUBSCRIPT , 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 3, 𝐛3=(0,U10,0,U11,0,U12)/3subscript𝐛3superscript0subscript𝑈100subscript𝑈110subscript𝑈12top3\mathbf{b}_{3}=(0,U_{10},0,U_{11},0,U_{12})^{\top}/3bold_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ( 0 , italic_U start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT , 0 , italic_U start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , 0 , italic_U start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 3, 𝐛4=(U13,U14,U15,0,0,0)/3subscript𝐛4superscriptsubscript𝑈13subscript𝑈14subscript𝑈15000top3\mathbf{b}_{4}=(U_{13},U_{14},U_{15},0,0,0)^{\top}/3bold_b start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = ( italic_U start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 15 end_POSTSUBSCRIPT , 0 , 0 , 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 3, 𝐛5=(0,0,0,U13,U14,U15)/3subscript𝐛5superscript000subscript𝑈13subscript𝑈14subscript𝑈15top3\mathbf{b}_{5}=(0,0,0,U_{13},U_{14},U_{15})^{\top}/3bold_b start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = ( 0 , 0 , 0 , italic_U start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 15 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 3.

  • 4.

    d=10,k=6formulae-sequence𝑑10𝑘6d=10,k=6italic_d = 10 , italic_k = 6: First 5 factors are 𝐛1=(U1,,U10)/2subscript𝐛1superscriptsubscript𝑈1subscript𝑈10top2\mathbf{b}_{1}=(U_{1},\cdots,U_{10})^{\top}/2bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( italic_U start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_U start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 2, 𝐛2=(U11,U12,0,,0)/2subscript𝐛2superscriptsubscript𝑈11subscript𝑈1200top2\mathbf{b}_{2}=(U_{11},U_{12},0,\cdots,0)^{\top}/2bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ( italic_U start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT , 0 , ⋯ , 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 2,
    𝐛3=(0,0,U13,U14,0,,0)/2subscript𝐛3superscript00subscript𝑈13subscript𝑈1400top2\mathbf{b}_{3}=(0,0,U_{13},U_{14},0,\cdots,0)^{\top}/2bold_b start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ( 0 , 0 , italic_U start_POSTSUBSCRIPT 13 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 14 end_POSTSUBSCRIPT , 0 , ⋯ , 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 2, 𝐛4=(0,0,0,0,U15,U16,0,0,0,0)/2subscript𝐛4superscript0000subscript𝑈15subscript𝑈160000top2\mathbf{b}_{4}=(0,0,0,0,U_{15},U_{16},0,0,0,0)^{\top}/2bold_b start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = ( 0 , 0 , 0 , 0 , italic_U start_POSTSUBSCRIPT 15 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 16 end_POSTSUBSCRIPT , 0 , 0 , 0 , 0 ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 2, 𝐛5=(0,,0,U17,U18,U19,U20)/2subscript𝐛5superscript00subscript𝑈17subscript𝑈18subscript𝑈19subscript𝑈20top2\mathbf{b}_{5}=(0,\cdots,0,U_{17},U_{18},U_{19},U_{20})^{\top}/2bold_b start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = ( 0 , ⋯ , 0 , italic_U start_POSTSUBSCRIPT 17 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 18 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 19 end_POSTSUBSCRIPT , italic_U start_POSTSUBSCRIPT 20 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / 2.

For each of the 4 simulation setups described above, we randomly generate 100 models (i.e, 100 coefficient B𝐵Bitalic_B matrices). From each of these generated models, we simulate a dataset of size 1000, extract a subsample of size 100100100100 with the largest 2222-norms, and project the subsample on the 2222-norm sphere, namely, we work with (r)=(s)=2\|\cdot\|_{(r)}=\|\cdot\|_{(s)}=\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Subsequently, a spherical clustering algorithm (spherical k𝑘kitalic_k-means or k𝑘kitalic_k-pc) and the computation of the penalized ASW score is carried out on this projected subsample. Throughout the paper, for the spherical k𝑘kitalic_k-means algorithm, we use the implementation in the R package skmeans [16], and for the k𝑘kitalic_k-pc algorithm, we use the R implementation provided in the supplementary material of [7].

In Figures 2 similar-to\sim 5, we demonstrate the simulation results through some graphical representations. Specifically, each colored matrix plot is associated with a (d,k)𝑑𝑘(d,k)( italic_d , italic_k ) setup as described above. In each plot, a column corresponds to a simulated dataset, and there are 100 columns. The upper half of the plot corresponds to spherical k𝑘kitalic_k-means and the lower half corresponds to k𝑘kitalic_k-pc. Within each of these halves, a row corresponds to a t𝑡titalic_t penalty parameter specification. The color of a cell in the matrix signifies the order m𝑚mitalic_m chosen by maximizing the penalized ASW. We use a white color to indicate a coincidence of m𝑚mitalic_m with the true order k𝑘kitalic_k, with a deeper shade of red indicating that the greater m𝑚mitalic_m falls below the true k𝑘kitalic_k, and a deeper shade of blue indicating the greater it exceeds the true k𝑘kitalic_k. The bar graph to the right of the matrix indicates the success rate of order identification (that is, m=k𝑚𝑘m=kitalic_m = italic_k) in all 100 instances.

In all these simulation setups, we can observe a tendency for the non-penalized (t=0𝑡0t=0italic_t = 0) ASW to overestimate (sometimes greatly) the order. As the penalty parameter t𝑡titalic_t is tuned up from 00, we observe a significant bias correction effect, and the order identification success rate is noticeably improved over a range of t>0𝑡0t>0italic_t > 0. Note that this success rate is calculated with respect to the same t𝑡titalic_t for different simulated data sets. We expect the success rate to improve if t𝑡titalic_t is adaptively tuned for each dataset following the visual method described in Section 3. It is also worth mentioning that the order identification based on k𝑘kitalic_k-pc tends to be more accurate than that based on k𝑘kitalic_k-means in most of these simulations.

Refer to caption
Fig. 2: Simulation result visualization for the setup d=4,k=2formulae-sequence𝑑4𝑘2d=4,k=2italic_d = 4 , italic_k = 2 in Section 6.1.
Refer to caption
Fig. 3: Simulation result visualization for the setup d=4,k=6formulae-sequence𝑑4𝑘6d=4,k=6italic_d = 4 , italic_k = 6 in Section 6.1.
Refer to caption
Fig. 4: Simulation result visualization for the setup d=6,k=6formulae-sequence𝑑6𝑘6d=6,k=6italic_d = 6 , italic_k = 6 in Section 6.1.
Refer to caption
Fig. 5: Simulation result visualization for the setup d=10,k=6formulae-sequence𝑑10𝑘6d=10,k=6italic_d = 10 , italic_k = 6 in Section 6.1.

6.2 Real data demonstrations

In this section, we use real data examples to demonstrate order selection through penalized ASW as introduced in Section 3, as well as conversion of clustering-based spectral estimation to a factor coefficient matrix as mentioned in Section 5.2. We present only the analysis based on the spherical k𝑘kitalic_k-pc algorithm, that is, the dissimilarity measure D𝐷Ditalic_D is as in (12). The reason for doing so is two-fold. Firstly, the simulation study in Section 6.1 seems to suggest a better empirical performance for order selection based on the k𝑘kitalic_k-pc algorithm. Secondly, as pointed out in [7], the k𝑘kitalic_k-pc algorithm is more suitable for the detection of groups of concomitant extremes, namely, subsets of variables that tend to be simultaneously large. The second property facilitates the comparison of the order k𝑘kitalic_k selected with some “ground truth” from the background information of the datasets.

In each of these studies, suppose that the observed data is (𝐱i)=(𝐱i=(xi1,,xid)[0,)d,i{1,,n})(\mathbf{x}_{i})=(\mathbf{x}_{i}=\left(x_{i1},\ldots,x_{id}\right)^{\top}\in[0% ,\infty)^{d},\ i\in\{1,\ldots,n\})( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_x start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ [ 0 , ∞ ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_i ∈ { 1 , … , italic_n } ). We follow a conventional approach to marginally standardize a dataset, so that the assumption (7) with α=2𝛼2\alpha=2italic_α = 2 is roughly met. In particular, setting F^j(x)=n1i=1n𝟏{xij<x}subscript^𝐹𝑗𝑥superscript𝑛1superscriptsubscript𝑖1𝑛1subscript𝑥𝑖𝑗𝑥\hat{F}_{j}(x)=n^{-1}\sum_{i=1}^{n}\mathbf{1}\left\{x_{ij}<x\right\}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) = italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_1 { italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT < italic_x } (under this choice of empirical CDF we ensure F^j(xij))<1\hat{F}_{j}(x_{ij}))<1over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) < 1), j{1,,d}𝑗1𝑑j\in\{1,\ldots,d\}italic_j ∈ { 1 , … , italic_d }, the transformed data is given by (𝐱~i)=(𝐱~i=(x~i1,,x~id)[0,)d,i{1,,n})(\widetilde{\mathbf{x}}_{i})=(\widetilde{\mathbf{x}}_{i}=\left(\widetilde{x}_{% i1},\ldots,\widetilde{x}_{id}\right)^{\top}\in[0,\infty)^{d},\ i\in\{1,\ldots,% n\})( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = ( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ [ 0 , ∞ ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_i ∈ { 1 , … , italic_n } ), where x~ij:=[log{F^j(xij)}]1/2assignsubscript~𝑥𝑖𝑗superscriptdelimited-[]subscript^𝐹𝑗subscript𝑥𝑖𝑗12\widetilde{x}_{ij}:=\left[-\log\left\{\hat{F}_{j}(x_{ij})\right\}\right]^{-1/2}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT := [ - roman_log { over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) } ] start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT; if F^jsubscript^𝐹𝑗\hat{F}_{j}over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT were the true CDF for the data in dimension j𝑗jitalic_j, then x~ijsubscript~𝑥𝑖𝑗\widetilde{x}_{ij}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT would follow a standard 2222-Fréchet distribution. Next, to prepare for the clustering of multivariate extremes, as in the simulation study in Section 6.1, we select the extremal subsample of (𝐱~i)subscript~𝐱𝑖(\widetilde{\mathbf{x}}_{i})( over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with 10101010% largest 2222-norms and project the subsample onto the 2222-norm sphere, namely, we work with (r)=(s)=2\|\cdot\|_{(r)}=\|\cdot\|_{(s)}=\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

6.2.1 Air Pollution Data

The air pollution dataset is found in the R package texmex [26], orginated from an online supplementary material of [14]. It concerns air quality recordings in Leeds, U.K., specifically in the city center. The data span from 1994 to 1998, divided into summer and winter sets. The summer dataset comprises 578 observations, covering the months from April to July inclusively, while the winter dataset consists of 532 observations, encompassing the months from November to February inclusively. Each observation records the daily maximum values of five pollutants: Ozone, NO2, NO, SO2 and PM10. These datasets were also used in [20] to demonstrate the application of the spherical k𝑘kitalic_k-means clustering method to multivariate extremes.

In Figures 6 and 8, following the same manner as in Figure 1, the penalized ASW is plotted against the number of clusters, where different curves correspond to different values of the tuning parameter t𝑡titalic_t. With the visual method described in Section 3, we can identify orders as 5555 for the summer data and 3333 for the winter data respectively. These orders are similar to the choices 5555 for the summer data and 4444 for the winter data made in [20] under the guidance of certain elbow plots (see [20, Figure 1]). The authors did not provide a precise explanation of their choices. From the elblow plot in [20, Figure 1], it seems that k=3𝑘3k=3italic_k = 3 for the winter data is also plausible. Recall also that here we use the spherical k𝑘kitalic_k-pc algorithm of [7] while [20] used the spherical k𝑘kitalic_k-means.

Furthermore, Figures 7 and 9 include visualizations of cluster centers computed based on the k𝑘kitalic_k-pc algorithm of [7] for the two datasets when we choose the numbers of clusters as above, respectively. Each row in either of the plots corresponds to the coordinate vector of a cluster center: a deeper shade of color indicates a higher value of the squared coordinate. Note that since we work with the 2222-norm sphere, the squared coordinates for each cluster center sum up to 1111, forming a probability distribution row-wise. For the summer data in Figure 7, whose order has been chosen as 5555, the cluster centers concentrate sharply near coordinate directions, which to an extent indicates an asymptotic (or say extremal) independence (see, e.g., [1, Chapter 8]) of the pollutants. In contrast, for the winter data in Figure 9, whose order has been chosen as 3333, a cluster center indicates a group of concomitant extremes consisting of NO, NO2 and PM10. The asymptotic dependence between these 3 variables has been observed in [14]. This serves as a support for our order choice which has placed these 3 variables in the same concomitant group.

Following the method introduced in Section 5.2 with (s)=(r)=2\|\cdot\|_{(s)}=\|\cdot\|_{(r)}=\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and α=2𝛼2\alpha=2italic_α = 2, we compute the factor coefficient matrix B𝐵Bitalic_B for the two datasets; see Tables 2 and 2.

Table 1: Estimated Bsuperscript𝐵topB^{\top}italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT for Summer Pollution Data
Factor O3 NO2 NO SO2 PM10
1 0.88 0.22 0.10 0.20 0.24
2 0.20 0.33 0.20 0.90 0.32
3 0.35 0.79 0.30 0.21 0.32
4 0.15 0.16 0.16 0.19 0.80
5 0.21 0.44 0.91 0.25 0.31
Table 2: Estimated Bsuperscript𝐵topB^{\top}italic_B start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT for Winter Pollution Data
Factor O3 NO2 NO SO2 PM10
1 0.19 0.98 0.99 0.44 0.98
2 0.07 0.13 0.12 0.89 0.14
3 0.98 0.12 0.10 0.07 0.15
Refer to caption
Fig. 6: Penalized ASW Curves for Summer Air Pollution Data
Refer to caption
Fig. 7: Squared Cluster Center Coordinates for Summer Air Pollution Data
Refer to caption
Fig. 8: Penalized ASW Curves for Winter Air Pollution Data
Refer to caption
Fig. 9: Squared Cluster Center Coordinates for Winter Air Pollution Data

6.2.2 River Discharge Data

Station Name River Name Factor (Cluster) Index
SALEM, OR WILLAMETTE RIVER 4
PORTLAND, OR WILLAMETTE RIVER 4
HARRISBURG, OR WILLAMETTE RIVER 4
BELOW SPRAGUE RIVER NEAR CHILOQUIN, OR WILLIAMSON RIVER 2
ST.PAUL, MN MISSISSIPPI RIVER 1
AITKIN, MN MISSISSIPPI RIVER 1
THEBES, IL MISSISSIPPI RIVER 6
CHESTER, IL MISSISSIPPI RIVER 6
GREEN ISLAND, NY HUDSON RIVER 5
FORT EDWARD, NY HUDSON RIVER 5
NORTH CREEK, NY HUDSON RIVER 5
NEAR CARLISLE, SC BROAD RIVER 3
NEAR BELL, GA BROAD RIVER 3
Table 3: River Discharge Stations

The river discharge data concerns the daily discharge rate of rivers in North America sourced from the Global Runoff Data Centre [10]. The dataset comprises 16,386 daily records of discharge values from 13 stations spanning the period from December 1, 1976, to October 11, 2021. These 13 stations, shown in Table 3 and Figure 10, are positioned along 5 rivers in America: Willamette River, Mississippi River, Williamson River, Hudson River, and Broad River.

Refer to caption
Fig. 10: 13 River Discharge Stations
Refer to caption
Fig. 11: Penalized ASW Curves for River Discharge Data
Refer to caption
Fig. 12: Squared Cluster Center Coordinates for River Discharge Data

As in the previous example, Figure 11 presents the penalized ASW curves, from which we found that 6 seems to be an appropriate choice of order. Figure 12 illustrates the squared cluster centers obtained from the k𝑘kitalic_k-pc algorithm when the order is chosen as 6666. In Table 4, we convert the spectral estimation to the factor matrix B𝐵Bitalic_B following the method in Section 5.2 with (s)=(r)=2\|\cdot\|_{(s)}=\|\cdot\|_{(r)}=\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_s ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT ( italic_r ) end_POSTSUBSCRIPT = ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and α=2𝛼2\alpha=2italic_α = 2. In addition, for each row of the matrix B𝐵Bitalic_B, we find to which factor index (the same as the cluster index in Figure 12) the largest value (in bold) corresponds. We include these factor indices in the last column of Table 3, which can be viewed roughly as markings of groups of concomitant extremes. These 6 groups are in good accordance with the geographical context: most of the stations located along the same river are found in the same group, with the only exception of the 4 stations along the Mississippi River. The further division of these 4 stations into 2 groups may be easily justified by the large geographical distance between the 2 groups: one group located in MN and the other located in IL.

Factor 1 2 3 4 5 6
SALEM 0.15 0.26 0.11 0.91 0.19 0.18
PORTLAND 0.16 0.27 0.11 0.91 0.19 0.18
HARRISBURG 0.16 0.26 0.12 0.91 0.19 0.17
ST.PAUL 0.88 0.15 0.28 0.10 0.31 0.12
AITKIN 0.91 0.12 0.23 0.13 0.29 0.12
THEBES 0.28 0.15 0.88 0.11 0.31 0.16
CHESTER 0.29 0.15 0.88 0.10 0.30 0.16
BELOW_SPRAGUE 0.22 0.87 0.15 0.25 0.28 0.15
GREEN_ISLAND 0.41 0.26 0.28 0.32 0.69 0.35
FORT_EDWARD 0.31 0.12 0.18 0.16 0.89 0.17
NORTH_CREEK 0.30 0.12 0.17 0.16 0.90 0.18
NEAR_CARLISLE 0.15 0.17 0.15 0.19 0.22 0.92
NEAR_BELL 0.16 0.16 0.14 0.19 0.23 0.92
Table 4: Estimated B𝐵Bitalic_B for River Discharge Data

References

  • Beirlant et al. [2006] J. Beirlant, Y. Goegebeur, J. Segers, J. L. Teugels, Statistics of Extremes: Theory and Applications, John Wiley & Sons, 2006.
  • Cooley and Thibaud [2019] D. Cooley, E. Thibaud, Decompositions of dependence for high-dimensional extremes, Biometrika 106 (2019) 587–604.
  • Dhillon and Modha [2001] I. S. Dhillon, D. S. Modha, Concept decompositions for large sparse text data using clustering, Machine learning 42 (2001) 143–175.
  • Einmahl et al. [2018] J. H. Einmahl, A. Kiriliouk, J. Segers, A continuous updating weighted least squares estimator of tail dependence in high dimensions, Extremes 21 (2018) 205–233.
  • Einmahl et al. [2012] J. H. J. Einmahl, A. Kra**a, J. Segers, An M𝑀Mitalic_M-estimator for tail dependence in arbitrary dimensions, The Annals of Statistics 40 (2012) 1764–1793.
  • Engelke and Ivanovs [2021] S. Engelke, J. Ivanovs, Sparse structures for multivariate extremes, Annual Review of Statistics and Its Application 8 (2021) 241–270.
  • Fomichov and Ivanovs [2023] V. Fomichov, J. Ivanovs, Spherical clustering in detection of groups of concomitant extremes, Biometrika 110 (2023) 135–153.
  • Fougères et al. [2013] A.-L. Fougères, C. Mercadier, J. P. Nolan, Dense classes of multivariate extreme value distributions, Journal of Multivariate Analysis 116 (2013) 109–129.
  • Galvin and Shore [1984] F. Galvin, S. Shore, Completeness in semimetric spaces, Pacific Journal of Mathematics 113 (1984) 67–75.
  • German Federal Institute of Hydrology [nd] German Federal Institute of Hydrology, Global runoff data centre (grdc) portal, https://grdc.bafg.de/GRDC/EN/Home/homepage_node.html, n.d.
  • Gissibl and Klüppelberg [2018] N. Gissibl, C. Klüppelberg, Max-linear models on directed acyclic graphs, Bernoulli 24 (2018) 2693–2720.
  • Gnecco et al. [2021] N. Gnecco, N. Meinshausen, J. Peters, S. Engelke, Causal discovery in heavy-tailed models, The Annals of Statistics 49 (2021) 1755–1778.
  • Haan and Ferreira [2006] L. Haan, A. Ferreira, Extreme Value Theory: an Introduction, volume 3, Springer, 2006.
  • Heffernan and Tawn [2004] J. E. Heffernan, J. A. Tawn, A conditional approach for multivariate extreme values (with discussion), Journal of the Royal Statistical Society Series B: Statistical Methodology 66 (2004) 497–546.
  • Hoeffding [1963] W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association 58 (1963) 13–30.
  • Hornik et al. [2023] K. Hornik, I. Feinerer, M. Kober, skmeans: Spherical k-Means Clustering, 2023. R package version 0.2-16.
  • Hruschka et al. [2004] E. R. Hruschka, L. N. de Castro, R. J. Campello, Evolutionary algorithms for clustering gene-expression data, in: Fourth IEEE International Conference on Data Mining (ICDM’04), IEEE, pp. 403–406.
  • Hsu and Robbins [1947] P.-L. Hsu, H. Robbins, Complete convergence and the law of large numbers, Proceedings of the national academy of sciences 33 (1947) 25–31.
  • Hult and Lindskog [2006] H. Hult, F. Lindskog, Regular variation for measures on metric spaces, Publications de l’Institut Mathématique 80 (2006) 121–140.
  • Janßen and Wan [2020] A. Janßen, P. Wan, k-means clustering of extremes, Electronic Journal of Statistics 14 (2020) 1211–1233.
  • Kulik and Soulier [2020] R. Kulik, P. Soulier, Heavy-tailed time series, Springer, 2020.
  • Medina et al. [2021] M. A. Medina, R. A. Davis, G. Samorodnitsky, Spectral learning of multivariate extremes, arXiv preprint arXiv:2111.07799 (2021).
  • Ng et al. [2001] A. Ng, M. Jordan, Y. Weiss, On spectral clustering: Analysis and an algorithm, Advances in Neural Information Processing Systems 14 (2001).
  • Resnick [2007] S. I. Resnick, Heavy-Tail Phenomena: Probabilistic and Statistical Modeling, Springer Science & Business Media, 2007.
  • Rousseeuw [1987] P. J. Rousseeuw, Silhouettes: a graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics 20 (1987) 53–65.
  • Southworth et al. [2024] H. Southworth, J. E. Heffernan, P. D. Metcalfe, texmex: Statistical modelling of extreme values, 2024. R package version 2.4.8.
  • Wilson [1931] W. A. Wilson, On semi-metric spaces, American Journal of Mathematics 53 (1931) 361–373.
  • Yuen and Stoev [2014] R. Yuen, S. Stoev, Crps m-estimation for max-stable models, Extremes 17 (2014) 387–410.