The flexible Gumbel distribution: A new model for inference about the mode

[Uncaptioned image] Qingyang Liu
Department of Statistics
University of South Carolina
Columbia, SC 29201
[email protected]
&[Uncaptioned image] Xianzheng Huang
Department of Statistics
University of South Carolina
Columbia, SC 29201
[email protected]
&[Uncaptioned image] Haiming Zhou
Daiichi Sankyo Inc.
Basking Ridge, NJ 07920
[email protected]
(June 28, 2024)
Abstract

A new unimodal distribution family indexed by the mode and three other parameters is derived from a mixture of a Gumbel distribution for the maximum and a Gumbel distribution for the minimum. Properties of the proposed distribution are explored, including model identifiability and flexibility in capturing heavy-tailed data that exhibit different directions of skewness over a wide range. Both frequentist and Bayesian methods are developed to infer parameters in the new distribution. Simulation studies are conducted to demonstrate satisfactory performance of both methods. By fitting the proposed model to simulated data and data from an application in hydrology, it is shown that the proposed flexible distribution is especially suitable for data containing extreme values in either direction, with the mode being a location parameter of interest. Using the proposed unimodal distribution, one can easily formulate a regression model concerning the mode of a response given covariates. We apply this model to data from an application in criminology to reveal interesting data features that are obscured by outliers. Computer programs for implementing all considered inference methods in the study are available at https://github.com/rh8liuqy/flexible_Gumbel.

Keywords extreme values \cdot mixture distribution \cdot modal regression \cdot unimodal distribution

1 Introduction

The mean, median, and mode are the three most commonly used measure of central tendency of data. When data contain outliers that cause heavy tails or are potentially skewed, the mode is a more sensible representation of the central location of data than the mean or median. The timely review on mode estimation and its application by Chacón [2020] and references therein provide many examples in various fields of research where the mode serves as a more informative representative value of data. Most existing methods developed to draw inference for the mode are semi-/non-parametric in nature, starting from early works on direct estimation in the 1960s [Chernoff, 1964, Dalenius, 1965, Venter, 1967] to more recent works based on kernel density estimation [Chen, 2018] and quantile-based methods [Ota et al., 2019, Zhang et al., 2021]. Two main factors contribute to the enduring preference for semi-/non-parametric methods for mode estimation, despite the typically less straightforward implementation and lower efficiency compared to parametric counterparts. First, parametric models often impose strict constraints on the relationship between the mode and other location parameters, which may not hold in certain applications. Second, very few existing named distribution families that allow the inclusion of both symmetric and asymmetric distributions in the same family can be parameterized so that it is indexed by the mode as the location parameter along with other parameters, such as shape or scale parameters. In this study, we alleviate concerns raised by both reasons that discourage the use of parametric methods for mode estimation by formulating a flexible distribution indexed by the (unique) mode and parameters controlling the shape and scale.

When it comes to modeling heavy-tailed data, the Gumbel distribution [Gumbel, 1941] is arguably one of the most widely used models in many disciplines. Indeed, as a case of the generalized extreme value distribution [Jenkinson, 1955], the Gumbel distribution for the maximum (or minimum) is well-suited for modeling extremely large (or small) events that produce heavy-tailed data. For example, it is often used in hydrology to predict extreme rainfall and flood frequency [Loaiciga and Leipnik, 1999, Koutsoyiannis, 2004, Dawley et al., 2019]. In econometrics, the Gumbel distribution plays an important role in modeling extreme movements of stock prices and large changes in interest rates [Bali, 2003, Pratiwi et al., 2019]. The Gumbel distribution is indexed by the mode and a scale parameter, and thus is convenient for mode estimation. However, the Gumbel distribution for the maximum (or minimum) is right-skewed (or left-skewed) with the skewness fixed at around 1.141.141.141.14 (or 1.141.14-1.14- 1.14), and the kurtosis fixed at 5.4 across the entire distribution family. Thus it may be too rigid for scenarios where the direction and extremeness of outliers presented in data are initially unclear, or when the direction and level of skewness are unknown beforehand. Constructions of more flexible distributions that overcome these limitations have been proposed. In particular, Cooray [2010] applied a logarithmic transformation on a random variable following the odd Weibull distribution to obtain the so-called generalized Gumbel distribution that includes the Gumbel distribution as a subfamily. But the mode of the generalized Gumbel distribution is not indexed by a location parameter, or an explicit function of other model parameters. Shin et al. [2015] considered mixture distributions with one of the components being the Gumbel distribution and the other component(s) being Gumbel of the same skewness direction or a different distribution, such as the gamma distribution. Besides the same drawback pointed out for the generalized Gumbel distribution, it is difficult to formulate a unimodal distribution following their construction of mixtures, and thus their proposed models are unsuitable when unimodality is a feature required to make inferring the mode meaningful, such as in a regression setting, as in modal regression [Yao et al., 2012, Yao and Li, 2013, Chen, 2018].

With heavy-tailed data in mind and the mode as the location parameter of interest, we construct a new unimodal distribution that does not impose stringent constraints on how the mode relates to other central tendency measures, while allowing a range of kurtosis wide enough to capture heavy tails at either direction, as well as different degrees and directions of skewness. This new distribution, called the flexible Gumbel (FG) distribution, is presented in Section 2, where we study properties of the distribution and discuss identifiability of the model. We present a frequentist method and a Bayesian method for estimating parameters in the FG distribution in Section 3. Finite sample performance of these methods is inspected in simulation study in Section 4, followed by an application of the FG distribution in hydrology in Section 5. Section 6 demonstrates fitting a modal regression model based on the FG distribution to data from a criminology study. Section 7 highlights the contributions of our study and outlines future research directions.

2 The flexible Gumbel distribution

The probability density function (pdf) of the Gumbel distribution for the maximum is given by

f(x;θ,σ)𝑓𝑥𝜃𝜎\displaystyle f(x;\theta,\sigma)italic_f ( italic_x ; italic_θ , italic_σ ) =1σexp{xθσexp(xθσ)},absent1𝜎𝑥𝜃𝜎𝑥𝜃𝜎\displaystyle=\frac{1}{\sigma}\exp\left\{-\frac{x-\theta}{\sigma}-\exp\left(-% \frac{x-\theta}{\sigma}\right)\right\},= divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG roman_exp { - divide start_ARG italic_x - italic_θ end_ARG start_ARG italic_σ end_ARG - roman_exp ( - divide start_ARG italic_x - italic_θ end_ARG start_ARG italic_σ end_ARG ) } , (1)

where θ𝜃\thetaitalic_θ is the mode and σ>0𝜎0\sigma>0italic_σ > 0 is a scale parameter. The pdf of the Gumbel distribution for the minimum with mode θ𝜃\thetaitalic_θ and a scale parameter σ𝜎\sigmaitalic_σ is given by

f(x;θ,σ)𝑓𝑥𝜃𝜎\displaystyle f(x;\theta,\sigma)italic_f ( italic_x ; italic_θ , italic_σ ) =1σexp{xθσexp(xθσ)}.absent1𝜎𝑥𝜃𝜎𝑥𝜃𝜎\displaystyle=\frac{1}{\sigma}\exp\left\{\frac{x-\theta}{\sigma}-\exp\left(% \frac{x-\theta}{\sigma}\right)\right\}.= divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG roman_exp { divide start_ARG italic_x - italic_θ end_ARG start_ARG italic_σ end_ARG - roman_exp ( divide start_ARG italic_x - italic_θ end_ARG start_ARG italic_σ end_ARG ) } . (2)

We define a unimodal distribution for a random variable Y𝑌Yitalic_Y via a mixture of the two Gumbel distributions specified by (1) and (2) that share the same mode θ𝜃\thetaitalic_θ while allowing different scale parameters, σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, in the two components. We call the resultant distribution the flexible Gumbel distribution, FG for short, with the pdf given by

f(y)=𝑓𝑦absent\displaystyle f(y)=italic_f ( italic_y ) = w×1σ1exp{xθσ1exp(xθσ1)}+limit-from𝑤1subscript𝜎1𝑥𝜃subscript𝜎1𝑥𝜃subscript𝜎1\displaystyle\ w\times\frac{1}{\sigma_{1}}\exp\left\{-\frac{x-\theta}{\sigma_{% 1}}-\exp\left(-\frac{x-\theta}{\sigma_{1}}\right)\right\}+italic_w × divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG roman_exp { - divide start_ARG italic_x - italic_θ end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG - roman_exp ( - divide start_ARG italic_x - italic_θ end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) } + (3)
(1w)×1σ2exp{xθσ2exp(xθσ2)},1𝑤1subscript𝜎2𝑥𝜃subscript𝜎2𝑥𝜃subscript𝜎2\displaystyle\ (1-w)\times\frac{1}{\sigma_{2}}\exp\left\{\frac{x-\theta}{% \sigma_{2}}-\exp\left(\frac{x-\theta}{\sigma_{2}}\right)\right\},( 1 - italic_w ) × divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG roman_exp { divide start_ARG italic_x - italic_θ end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG - roman_exp ( divide start_ARG italic_x - italic_θ end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) } ,

where w[0,1]𝑤01w\in[0,1]italic_w ∈ [ 0 , 1 ] is the mixing proportion parameter. Henceforth, we state that YFG(θ,σ1,σ2,w)similar-to𝑌FG𝜃subscript𝜎1subscript𝜎2𝑤Y\sim\mbox{FG}(\theta,\sigma_{1},\sigma_{2},w)italic_Y ∼ FG ( italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w ) if Y𝑌Yitalic_Y follows the distribution specified by the pdf in (3).

For each component distribution of FG, the mean and median are both some simple shift of the mode, with each shift solely determined by the scale parameter. Because the two components in (3) share a common mode θ𝜃\thetaitalic_θ, the mode of Y𝑌Yitalic_Y is also θ𝜃\thetaitalic_θ, and thus the FG distribution is convenient to use when one aims to infer the mode as a central tendency measure, or to formulate parametric modal regression models [Bourguignon et al., 2020, Zhou and Huang, 2020, 2022]. One can easily show that the mean of Y𝑌Yitalic_Y is E(Y)=w(θ+σ1γ)+(1w)(θσ2γ)=θ+{w(σ1+σ2)σ2}γ𝐸𝑌𝑤𝜃subscript𝜎1𝛾1𝑤𝜃subscript𝜎2𝛾𝜃𝑤subscript𝜎1subscript𝜎2subscript𝜎2𝛾E(Y)=w(\theta+\sigma_{1}\gamma)+(1-w)(\theta-\sigma_{2}\gamma)=\theta+\{w(% \sigma_{1}+\sigma_{2})-\sigma_{2}\}\gammaitalic_E ( italic_Y ) = italic_w ( italic_θ + italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_γ ) + ( 1 - italic_w ) ( italic_θ - italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_γ ) = italic_θ + { italic_w ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } italic_γ, where γ0.5772𝛾0.5772\gamma\approx 0.5772italic_γ ≈ 0.5772 is the Euler-Mascheroni constant. Thus the discrepancy between the mode and the mean of FG depends on three other parameters that control the scale and shape of the distribution. The median of Y𝑌Yitalic_Y, denoted by m𝑚mitalic_m, is the solution to the following equation,

wexp{exp(mθσ1)}+(1w)[1exp{exp(mθσ2)}]=0.5.𝑤𝑚𝜃subscript𝜎11𝑤delimited-[]1𝑚𝜃subscript𝜎20.5w\exp\left\{-\exp\left(-\frac{m-\theta}{\sigma_{1}}\right)\right\}+(1-w)\left[% 1-\exp\left\{-\exp\left(\frac{m-\theta}{\sigma_{2}}\right)\right\}\right]=0.5.italic_w roman_exp { - roman_exp ( - divide start_ARG italic_m - italic_θ end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) } + ( 1 - italic_w ) [ 1 - roman_exp { - roman_exp ( divide start_ARG italic_m - italic_θ end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) } ] = 0.5 .

Even though this equation cannot be solved for m𝑚mitalic_m explicitly to reveal the median in closed form, it is clear that mθ𝑚𝜃m-\thetaitalic_m - italic_θ also depends on all three other parameters of FG. In conclusion, the relationships between the three central tendency measures of FG are more versatile than those under a Gumbel distribution for the maximum or a Gumbel distribution for the minimum.

The variance of Y𝑌Yitalic_Y is V(Y)={wσ12+(1w)σ22}π2/6+w(1w)(σ1+σ2)2γ2𝑉𝑌𝑤superscriptsubscript𝜎121𝑤superscriptsubscript𝜎22superscript𝜋26𝑤1𝑤superscriptsubscript𝜎1subscript𝜎22superscript𝛾2V(Y)=\{w\sigma_{1}^{2}+(1-w)\sigma_{2}^{2}\}\pi^{2}/6+w(1-w)(\sigma_{1}+\sigma% _{2})^{2}\gamma^{2}italic_V ( italic_Y ) = { italic_w italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_w ) italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 6 + italic_w ( 1 - italic_w ) ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which does not depend on the mode parameter θ𝜃\thetaitalic_θ. Obviously, by setting w=0𝑤0w=0italic_w = 0 or 1, FG(θ,σ1,σ2,w)FG𝜃subscript𝜎1subscript𝜎2𝑤\mbox{FG}(\theta,\sigma_{1},\sigma_{2},w)FG ( italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w ) reduces to one of the Gumbel components. Unlike a Gumbel distribution that only has one direction of skewness at a fixed level (of ±1.14)\pm 1.14)± 1.14 ), an FG distribution can be left-skewed, or right-skewed, or symmetric. More specifically, with the mode fixed at zero when studying the skewness and kurtosis of FG, one can show (as outlined in Appendix A) that the third central moment of Y𝑌Yitalic_Y is given by

ww¯(σ1+σ2)2γ{γ2(w¯w)(σ1+σ2)+0.5π2(σ1σ2)}+2ζ(3)(wσ13w¯σ23),𝑤¯𝑤superscriptsubscript𝜎1subscript𝜎22𝛾superscript𝛾2¯𝑤𝑤subscript𝜎1subscript𝜎20.5superscript𝜋2subscript𝜎1subscript𝜎22𝜁3𝑤superscriptsubscript𝜎13¯𝑤superscriptsubscript𝜎23w\bar{w}(\sigma_{1}+\sigma_{2})^{2}\gamma\left\{\gamma^{2}(\bar{w}-w)(\sigma_{% 1}+\sigma_{2})+0.5\pi^{2}(\sigma_{1}-\sigma_{2})\right\}+2\zeta(3)\left(w% \sigma_{1}^{3}-\bar{w}\sigma_{2}^{3}\right),italic_w over¯ start_ARG italic_w end_ARG ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ { italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( over¯ start_ARG italic_w end_ARG - italic_w ) ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) + 0.5 italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } + 2 italic_ζ ( 3 ) ( italic_w italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT - over¯ start_ARG italic_w end_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) , (4)

where w¯=1w¯𝑤1𝑤\bar{w}=1-wover¯ start_ARG italic_w end_ARG = 1 - italic_w, and ζ(3)1.202𝜁31.202\zeta(3)\approx 1.202italic_ζ ( 3 ) ≈ 1.202 is Apéry’s constant. Although the direction of skewness is not immediately clear from (4), one may consider a special case with w=0.5𝑤0.5w=0.5italic_w = 0.5 where (4) reduces to (σ1σ2){γπ2(σ1+σ2)2/8+ζ(3)(σ12+σ1σ2+σ22)}subscript𝜎1subscript𝜎2𝛾superscript𝜋2superscriptsubscript𝜎1subscript𝜎228𝜁3superscriptsubscript𝜎12subscript𝜎1subscript𝜎2superscriptsubscript𝜎22(\sigma_{1}-\sigma_{2})\{\gamma\pi^{2}(\sigma_{1}+\sigma_{2})^{2}/8+\zeta(3)(% \sigma_{1}^{2}+\sigma_{1}\sigma_{2}+\sigma_{2}^{2})\}( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) { italic_γ italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 8 + italic_ζ ( 3 ) ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) }. Now one can see that FG(θ,σ1,σ2,0.5)FG𝜃subscript𝜎1subscript𝜎20.5\mbox{FG}(\theta,\sigma_{1},\sigma_{2},0.5)FG ( italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , 0.5 ) is symmetric if and only if σ1=σ2subscript𝜎1subscript𝜎2\sigma_{1}=\sigma_{2}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and it is left-skewed (or right-skewed) when σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is less (or greater) than σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The kurtosis of Y𝑌Yitalic_Y can also be derived straightforwardly, with a more lengthy expression than (4) that we omit here, which may not shed much light on its magnitude except that it varies as the scale parameters and the mixing proportion vary, instead of fixing at 5.4 as for a Gumbel distribution. An R Shiny app depicting the pdf of FG(θ,σ1,σ2,w)FG𝜃subscript𝜎1subscript𝜎2𝑤\mbox{FG}(\theta,\sigma_{1},\sigma_{2},w)FG ( italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w ) with user-specified parameter values is available at https://qingyang.shinyapps.io/gumbel_mixture/, created and maintained by the first author. Along with the density function curve, the Shiny app provides skewness and kurtosis of the depicted FG density. From there one can see that the skewness can be much lower than 1.141.14-1.14- 1.14 or higher than 1.141.141.141.14, and the kurtosis can be much higher than 5.4, suggesting that inference based on FG can be more robust to outliers than when a Gumbel distribution is assumed for data at hand, without imposing stringent assumption on the skewness of the underlying distribution.

The flexibility of a mixture distribution usually comes with concerns relating to identifiability [Teicher, 1961, 1963, Yakowitz and Spragins, 1968]. In particular, there is the notorious issue of label switching when fitting a finite mixture model [Redner and Walker, 1984]. Take the family of two-component normal mixture (NM) distributions as an example, defined by {NM(μ1,σ1,μ2,σ2,w):w𝒩(μ1,σ12)+(1w)𝒩(μ1,σ22), for σ1,σ2>0 and w[0,1]}conditional-setNMsubscript𝜇1subscript𝜎1subscript𝜇2subscript𝜎2𝑤𝑤𝒩subscript𝜇1superscriptsubscript𝜎121𝑤𝒩subscript𝜇1superscriptsubscript𝜎22 for σ1,σ2>0 and w[0,1]\{\mbox{NM}(\mu_{1},\sigma_{1},\mu_{2},\sigma_{2},w):\,w\mathcal{N}(\mu_{1},% \sigma_{1}^{2})+(1-w)\mathcal{N}(\mu_{1},\sigma_{2}^{2}),\mbox{ for $\sigma_{1% },\sigma_{2}>0$ and $w\in[0,1]$}\}{ NM ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w ) : italic_w caligraphic_N ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + ( 1 - italic_w ) caligraphic_N ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , for italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 and italic_w ∈ [ 0 , 1 ] }. When fitting a data set assuming a normal mixture distribution, one cannot distinguish between, for instance, NM(1,2,3,4,0.2)NM12340.2\mbox{NM}(1,2,3,4,0.2)NM ( 1 , 2 , 3 , 4 , 0.2 ) and NM(3,4,1,2,0.8)NM34120.8\mbox{NM}(3,4,1,2,0.8)NM ( 3 , 4 , 1 , 2 , 0.8 ), since the likelihood of the data is identical under these two mixture distributions. As another example, for data from a normal distribution, a two-component normal mixture with two identical normal components and an arbitrary mixing proportion w[0,1]𝑤01w\in[0,1]italic_w ∈ [ 0 , 1 ] leads to the same likelihood, and thus w𝑤witalic_w cannot be identified. Teicher [1963] showed that imposing a lexicographical order for the normal components resolves the issue of non-identifiability, which also excludes mixtures with two identical components in the above normal mixture family. Unlike normal mixtures of which all components are in the same family of normal distributions, the FG distribution results from mixing two components from different families, i.e., a Gumbel distribution for the maximum and a Gumbel distribution for the minimum, with weight w𝑤witalic_w on the former component. By construction, FG does not have the label-switching issue. And, we show in Appendix B by invoking Theorem 1 in Teicher [1963] that the so-constructed mixture distribution is always identifiable even when the true distribution is a (one-component) Gumbel distribution.

3 Statistical inference

3.1 Frequentist inference method

Based on a random sample of size n𝑛nitalic_n from the FG distribution, 𝐲={yi}i=1n𝐲superscriptsubscriptsubscript𝑦𝑖𝑖1𝑛\mathbf{y}=\{y_{i}\}_{i=1}^{n}bold_y = { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, maximum likelihood estimators (MLE) of all model parameters in 𝛀=(θ,σ1,σ2,w)𝛀𝜃subscript𝜎1subscript𝜎2𝑤\mathbf{\Omega}=(\theta,\sigma_{1},\sigma_{2},w)bold_Ω = ( italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w ) can be obtained via the expectation-maximization (EM) algorithm [Dempster et al., 1977]. To apply the EM algorithm, we introduce a latent variable Z𝑍Zitalic_Z that follows Bernoulli(w𝑤witalic_w) such that the joint likelihood of (Y,Z)𝑌𝑍(Y,Z)( italic_Y , italic_Z ) is

fY,Z(y,z)={wf1(y;θ,σ1)}z{(1w)f2(y;θ,σ2)}1z,subscript𝑓𝑌𝑍𝑦𝑧superscript𝑤subscript𝑓1𝑦𝜃subscript𝜎1𝑧superscript1𝑤subscript𝑓2𝑦𝜃subscript𝜎21𝑧f_{Y,Z}(y,z)=\{wf_{1}(y;\theta,\sigma_{1})\}^{z}\{(1-w)f_{2}(y;\theta,\sigma_{% 2})\}^{1-z},italic_f start_POSTSUBSCRIPT italic_Y , italic_Z end_POSTSUBSCRIPT ( italic_y , italic_z ) = { italic_w italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y ; italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT italic_z end_POSTSUPERSCRIPT { ( 1 - italic_w ) italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y ; italic_θ , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } start_POSTSUPERSCRIPT 1 - italic_z end_POSTSUPERSCRIPT , (5)

where f1(y;θ,σ1)subscript𝑓1𝑦𝜃subscript𝜎1f_{1}(y;\theta,\sigma_{1})italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y ; italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is the pdf in (1) evaluated at y𝑦yitalic_y with the scale parameter σ=σ1𝜎subscript𝜎1\sigma=\sigma_{1}italic_σ = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and f2(y;θ,σ2)subscript𝑓2𝑦𝜃subscript𝜎2f_{2}(y;\theta,\sigma_{2})italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y ; italic_θ , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is the pdf in (2) evaluated at y𝑦yitalic_y with the scale parameter σ=σ2𝜎subscript𝜎2\sigma=\sigma_{2}italic_σ = italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. A random sample of size n𝑛nitalic_n from Bernoulli(w𝑤witalic_w), 𝐳={zi}i=1n𝐳superscriptsubscriptsubscript𝑧𝑖𝑖1𝑛\mathbf{z}=\{z_{i}\}_{i=1}^{n}bold_z = { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, is viewed as missing data, and {(yi,zi)}i=1nsuperscriptsubscriptsubscript𝑦𝑖subscript𝑧𝑖𝑖1𝑛\{(y_{i},z_{i})\}_{i=1}^{n}{ ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are viewed as the complete data in the EM algorithm. It can be shown [Boos and Stefanski, 2013, Section 2.6.3a] that integrating z𝑧zitalic_z out from (5) indeed gives the density of Y𝑌Yitalic_Y in (3). The log-likelihood based on the density in (3) is usually not well-behaved as an objective function to be maximized with respect to 𝛀𝛀\mathbf{\Omega}bold_Ω. By considering the complete-data log-likelihood based on (5), one can often maximize an objective function that is better-behaved as we demonstrate next. More specifically, the complete-data log-likelihood is

(𝛀;𝐲,𝐳)=i=1n{zilog(wf1(yi;θ,σ1))+(1zi)log((1w)f2(yi;θ,σ2))}.𝛀𝐲𝐳superscriptsubscript𝑖1𝑛subscript𝑧𝑖𝑤subscript𝑓1subscript𝑦𝑖𝜃subscript𝜎11subscript𝑧𝑖1𝑤subscript𝑓2subscript𝑦𝑖𝜃subscript𝜎2\ell(\mathbf{\Omega};\mathbf{y},\mathbf{z})=\sum_{i=1}^{n}\{z_{i}\log(wf_{1}(y% _{i};\theta,\sigma_{1}))+(1-z_{i})\log((1-w)f_{2}(y_{i};\theta,\sigma_{2}))\}.roman_ℓ ( bold_Ω ; bold_y , bold_z ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log ( italic_w italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + ( 1 - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( ( 1 - italic_w ) italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) } . (6)

Starting from an initial estimate of 𝛀𝛀\mathbf{\Omega}bold_Ω (at the zero-th iteration), denoted by 𝛀(0)superscript𝛀0\mathbf{\Omega}^{(0)}bold_Ω start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT, one iterates two steps referred to as the E-step and the M-step until a convergence criterion is met. In the E-step at the (t+1)𝑡1(t+1)( italic_t + 1 )-th iteration, one computes the conditional expectation of (6) given 𝐲𝐲\mathbf{y}bold_y while assuming the true parameter value to be 𝛀(t)=(θ(t),σ1(t),σ2(t),w(t))superscript𝛀𝑡superscript𝜃𝑡superscriptsubscript𝜎1𝑡superscriptsubscript𝜎2𝑡superscript𝑤𝑡\mathbf{\Omega}^{(t)}=(\theta^{(t)},\sigma_{1}^{(t)},\sigma_{2}^{(t)},w^{(t)})bold_Ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = ( italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ), that is, E𝛀(t){(𝛀;𝐲,𝐳)|𝐲}subscriptEsuperscript𝛀𝑡conditional𝛀𝐲𝐳𝐲\operatorname{E}_{\boldsymbol{\Omega}^{(t)}}\{\ell(\mathbf{\Omega};\mathbf{y},% \mathbf{z})|\mathbf{y}\}roman_E start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { roman_ℓ ( bold_Ω ; bold_y , bold_z ) | bold_y }. This conditional expectation can be shown to be

Q(𝛀|𝛀(t))=i=1n{Ti(t)log(wf1(yi;θ,σ1))+(1Ti(t))log((1w)f2(yi;θ,σ2))},𝑄conditional𝛀superscript𝛀𝑡superscriptsubscript𝑖1𝑛superscriptsubscript𝑇𝑖𝑡log𝑤subscript𝑓1subscript𝑦𝑖𝜃subscript𝜎11superscriptsubscript𝑇𝑖𝑡1𝑤subscript𝑓2subscript𝑦𝑖𝜃subscript𝜎2Q\left(\boldsymbol{\Omega}\left|\mathbf{\Omega}^{(t)}\right.\right)=\sum_{i=1}% ^{n}\left\{T_{i}^{(t)}\operatorname{log}(wf_{1}\left(y_{i};\theta,\sigma_{1}% \right))+\left(1-T_{i}^{(t)}\right)\log((1-w)f_{2}\left(y_{i};\theta,\sigma_{2% }\right))\right\},italic_Q ( bold_Ω | bold_Ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT { italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT roman_log ( italic_w italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) + ( 1 - italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) roman_log ( ( 1 - italic_w ) italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) } , (7)

where

Ti(t)=E𝛀(t)(Z|Y=yi)=w(t)f1(yi;θ(t),σ1(t))w(t)f1(yi;θ(t),σ1(t))+(1w(t))f2(yi;θ(t),σ2(t)).superscriptsubscript𝑇𝑖𝑡subscriptEsuperscript𝛀𝑡conditional𝑍𝑌subscript𝑦𝑖superscript𝑤𝑡subscript𝑓1subscript𝑦𝑖superscript𝜃𝑡superscriptsubscript𝜎1𝑡superscript𝑤𝑡subscript𝑓1subscript𝑦𝑖superscript𝜃𝑡superscriptsubscript𝜎1𝑡1superscript𝑤𝑡subscript𝑓2subscript𝑦𝑖superscript𝜃𝑡superscriptsubscript𝜎2𝑡T_{i}^{(t)}=\operatorname{E}_{{\boldsymbol{\Omega}}^{(t)}}(Z|Y=y_{i})=\frac{w^% {(t)}f_{1}(y_{i};\theta^{(t)},\sigma_{1}^{(t)})}{w^{(t)}f_{1}(y_{i};\theta^{(t% )},\sigma_{1}^{(t)})+(1-w^{(t)})f_{2}(y_{i};\theta^{(t)},\sigma_{2}^{(t)})}.italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_E start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_Z | italic_Y = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = divide start_ARG italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) + ( 1 - italic_w start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) end_ARG . (8)

In the M-step at the (t+1)𝑡1(t+1)( italic_t + 1 )-th iteration, one maximizes Q(𝛀|𝛀(t))𝑄conditional𝛀superscript𝛀𝑡Q(\boldsymbol{\Omega}|\boldsymbol{\Omega}^{(t)})italic_Q ( bold_Ω | bold_Ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) with respect to 𝛀𝛀\mathbf{\Omega}bold_Ω to obtain an updated estimate 𝛀(t+1)=(θ(t+1),σ1(t+1),σ2(t+1),w(t+1))superscript𝛀𝑡1superscript𝜃𝑡1superscriptsubscript𝜎1𝑡1superscriptsubscript𝜎2𝑡1superscript𝑤𝑡1\mathbf{\Omega}^{(t+1)}=(\theta^{(t+1)},\sigma_{1}^{(t+1)},\sigma_{2}^{(t+1)},% w^{(t+1)})bold_Ω start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = ( italic_θ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT ), in which w(t+1)=i=1nTi(t)/nsuperscript𝑤𝑡1superscriptsubscript𝑖1𝑛superscriptsubscript𝑇𝑖𝑡𝑛w^{(t+1)}=\sum_{i=1}^{n}T_{i}^{(t)}/nitalic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT / italic_n, and the other three updated estimates in 𝛀(t+1)superscript𝛀𝑡1\mathbf{\Omega}^{(t+1)}bold_Ω start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT are obtained numerically.

The EM algorithm avoids directly maximizing the log-likelihood based on (3) by (iteratively) maximizing the better-behaved Q(𝛀|𝛀(t))𝑄conditional𝛀superscript𝛀𝑡Q(\mathbf{\Omega}|\mathbf{\Omega}^{(t)})italic_Q ( bold_Ω | bold_Ω start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) in (7). To further improve the numerical efficiency, we exploit the expectation-conditional maximization (ECM) algorithm [Meng and Rubin, 1993], which replaces the M-step with a sequence of simpler conditional maximizations referred to as the CM-step. Essentially, within each M𝑀Mitalic_M-step, we update w𝑤witalic_w via w(t+1)=i=1nTi(t)/nsuperscript𝑤𝑡1superscriptsubscript𝑖1𝑛superscriptsubscript𝑇𝑖𝑡𝑛w^{(t+1)}=\sum_{i=1}^{n}T_{i}^{(t)}/nitalic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT / italic_n, then we update θ𝜃\thetaitalic_θ using w(t+1)superscript𝑤𝑡1w^{(t+1)}italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT along with (σ1(t),σ2(t))superscriptsubscript𝜎1𝑡superscriptsubscript𝜎2𝑡(\sigma_{1}^{(t)},\sigma_{2}^{(t)})( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ), followed by updating σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT using w(t+1)superscript𝑤𝑡1w^{(t+1)}italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT, the recently updated θ𝜃\thetaitalic_θ, and σ2(t)superscriptsubscript𝜎2𝑡\sigma_{2}^{(t)}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT; lastly, we update σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT using w(t+1)superscript𝑤𝑡1w^{(t+1)}italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT and the recently updated θ𝜃\thetaitalic_θ and σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. There is no closed-form updating formula for the latter three updates, but each of them can now be easily updated by most well-accepted one-dimensional optimization algorithms, such as the Newton-Raphson method. To ensure convergence at the global maximum, as recommended by Wu [1983], one should implement the ECM algorithm several rounds with different starting values 𝛀(0)superscript𝛀0\mathbf{\Omega}^{(0)}bold_Ω start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT.

After obtaining the MLE of 𝛀𝛀\mathbf{\Omega}bold_Ω, denoted by 𝛀^^𝛀\hat{\mathbf{\Omega}}over^ start_ARG bold_Ω end_ARG, we propose to use the sandwich variance estimator [Boos and Stefanski, 2013, Chapter 7] to estimate the variance-covariance matrix of 𝛀^^𝛀\hat{\mathbf{\Omega}}over^ start_ARG bold_Ω end_ARG. One may also estimate the variance-covariance of 𝛀^^𝛀\hat{\mathbf{\Omega}}over^ start_ARG bold_Ω end_ARG based on the observed information matrix as described in Louis [1982] and Oakes [1999]. The benefit of using the sandwich variance estimator is its robustness to model misspecification. Finally, the EM and ECM algorithms bear a strong resemblance to data augmentation [Wei and Tanner, 1990] in the Bayesian framework, which we turn to next for inferring 𝛀𝛀\mathbf{\Omega}bold_Ω.

3.2 Bayesian inference method

In the Bayesian framework, we formulate hierarchical models starting with the FG distribution,

Y|θ,σ1,σ2,wFG(θ,σ1,σ2,w),similar-toconditional𝑌𝜃subscript𝜎1subscript𝜎2𝑤FG𝜃subscript𝜎1subscript𝜎2𝑤Y|\theta,\sigma_{1},\sigma_{2},w\sim\text{FG}(\theta,\sigma_{1},\sigma_{2},w),italic_Y | italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w ∼ FG ( italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w ) ,

followed by independent weakly informative or non-informative priors for elements in 𝛀𝛀\mathbf{\Omega}bold_Ω,

θ𝜃\displaystyle\thetaitalic_θ 𝒩(0,104),similar-toabsent𝒩0superscript104\displaystyle\sim\mathcal{N}(0,10^{4}),∼ caligraphic_N ( 0 , 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) ,
σjsubscript𝜎𝑗\displaystyle\sigma_{j}italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT inv-Gamma(1,1), for j=1,2,formulae-sequencesimilar-toabsentinv-Gamma11 for 𝑗12\displaystyle\sim\text{inv-Gamma}(1,1),\text{ for }j=1,2,∼ inv-Gamma ( 1 , 1 ) , for italic_j = 1 , 2 ,
w𝑤\displaystyle witalic_w  Uniform(0,1),similar-toabsent Uniform01\displaystyle\sim\text{\color[rgb]{0,0,0} Uniform}(0,1),∼ Uniform ( 0 , 1 ) ,

where inv-Gamma refers to the inverse Gamma distribution. We choose the above prior for the scale parameters by following the prior selection for variance parameters suggested in Gelman [2006].

We employ the Metropolis-within-Gibbs sampler [Müller, 1991, 1993] to obtain an estimate of 𝛀𝛀\mathbf{\Omega}bold_Ω from the posterior distribution of 𝛀𝛀\mathbf{\Omega}bold_Ω given observed data 𝐲𝐲\mathbf{y}bold_y. Similar to the EM/ECM algorithm in Section 3.1, the latent variable Z𝑍Zitalic_Z is also introduced as a device to carry out data augmentation. The iterative algorithm presented next is based on the following two conditional distributions that can be easily proved,

zi|θ,σ1,σ2,w,𝐳i,𝐲conditionalsubscript𝑧𝑖𝜃subscript𝜎1subscript𝜎2𝑤subscript𝐳𝑖𝐲\displaystyle z_{i}|\theta,\sigma_{1},\sigma_{2},w,\mathbf{z}_{-i},\mathbf{y}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w , bold_z start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT , bold_y Bernoulli(wf1(yi;θ,σ1)wf1(yi;θ,σ1)+(1w)f2(yi;θ,σ2)),similar-toabsentBernoulli𝑤subscript𝑓1subscript𝑦𝑖𝜃subscript𝜎1𝑤subscript𝑓1subscript𝑦𝑖𝜃subscript𝜎11𝑤subscript𝑓2subscript𝑦𝑖𝜃subscript𝜎2\displaystyle\sim\text{Bernoulli}\left(\frac{wf_{1}(y_{i};\theta,\sigma_{1})}{% wf_{1}(y_{i};\theta,\sigma_{1})+(1-w)f_{2}(y_{i};\theta,\sigma_{2})}\right),∼ Bernoulli ( divide start_ARG italic_w italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_w italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( 1 - italic_w ) italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_θ , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ) ,
w|θ,σ1,σ2,𝐳,𝐲conditional𝑤𝜃subscript𝜎1subscript𝜎2𝐳𝐲\displaystyle w|\theta,\sigma_{1},\sigma_{2},\mathbf{z},\mathbf{y}italic_w | italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_z , bold_y Beta(1+i=1nzi,n+1i=1nzi),similar-toabsentBeta1superscriptsubscript𝑖1𝑛subscript𝑧𝑖𝑛1superscriptsubscript𝑖1𝑛subscript𝑧𝑖\displaystyle\sim\text{Beta}\left(1+\sum_{i=1}^{n}z_{i},\,n+1-\sum_{i=1}^{n}z_% {i}\right),∼ Beta ( 1 + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n + 1 - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,

where 𝐳isubscript𝐳𝑖\mathbf{z}_{-i}bold_z start_POSTSUBSCRIPT - italic_i end_POSTSUBSCRIPT results from drop** zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from 𝐳𝐳\mathbf{z}bold_z, and the first result above is also from which (8) is deduced.

The Metropolis-within-Gibbs sampler at the (t+1)𝑡1(t+1)( italic_t + 1 )-th iteration involves four steps outlined below.

  • Step 1: For i=1,,n𝑖1𝑛i=1,\dots,nitalic_i = 1 , … , italic_n, draw zi(t+1)subscriptsuperscript𝑧𝑡1𝑖z^{(t+1)}_{i}italic_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from Bernoulli(Ti(t))Bernoullisuperscriptsubscript𝑇𝑖𝑡\mbox{Bernoulli}(T_{i}^{(t)})Bernoulli ( italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ), where Ti(t)superscriptsubscript𝑇𝑖𝑡T_{i}^{(t)}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is given in (8).

  • Step 2: Draw w(t+1)superscript𝑤𝑡1w^{(t+1)}italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT from Beta(1+i=1nzi(t+1),n+1i=1nzi(t+1)).Beta1superscriptsubscript𝑖1𝑛subscriptsuperscript𝑧𝑡1𝑖𝑛1superscriptsubscript𝑖1𝑛subscriptsuperscript𝑧𝑡1𝑖\mbox{Beta}\left(1+\sum_{i=1}^{n}z^{(t+1)}_{i},\,n+1-\sum_{i=1}^{n}z^{(t+1)}_{% i}\right).Beta ( 1 + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n + 1 - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

  • Step 3: Draw θ~~𝜃\tilde{\theta}over~ start_ARG italic_θ end_ARG from 𝒩(θ(t),τ0)𝒩superscript𝜃𝑡subscript𝜏0\mathcal{N}(\theta^{(t)},\tau_{0})caligraphic_N ( italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), and update θ(t)superscript𝜃𝑡\theta^{(t)}italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to θ(t+1)superscript𝜃𝑡1\theta^{(t+1)}italic_θ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT according to the following decision rule,

    θ(t+1)={θ~,with probability q=min{p(θ~|w(t+1),σ1(t),σ2(t),𝐲)p(θ(t)|w(t+1),σ1(t),σ2(t),𝐲), 1},θ(t),with probability 1q.superscript𝜃𝑡1cases~𝜃with probability 𝑞𝑝conditional~𝜃superscript𝑤𝑡1subscriptsuperscript𝜎𝑡1subscriptsuperscript𝜎𝑡2𝐲𝑝conditionalsuperscript𝜃𝑡superscript𝑤𝑡1subscriptsuperscript𝜎𝑡1subscriptsuperscript𝜎𝑡2𝐲1otherwisesuperscript𝜃𝑡with probability 1qotherwise\theta^{(t+1)}=\begin{cases}\tilde{\theta},\quad\text{with probability }% \displaystyle{q=\min\left\{\frac{p(\tilde{\theta}|w^{(t+1)},\sigma^{(t)}_{1},% \sigma^{(t)}_{2},\mathbf{y})}{p(\theta^{(t)}|w^{(t+1)},\sigma^{(t)}_{1},\sigma% ^{(t)}_{2},\mathbf{y})},\,1\right\}},\\ \theta^{(t)},\quad\text{with probability $1-q$}.\end{cases}italic_θ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = { start_ROW start_CELL over~ start_ARG italic_θ end_ARG , with probability italic_q = roman_min { divide start_ARG italic_p ( over~ start_ARG italic_θ end_ARG | italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_y ) end_ARG start_ARG italic_p ( italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_y ) end_ARG , 1 } , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , with probability 1 - italic_q . end_CELL start_CELL end_CELL end_ROW
  • Step 4: For j=1,2𝑗12j=1,2italic_j = 1 , 2, draw σ~jsubscript~𝜎𝑗\tilde{\sigma}_{j}over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT from 𝒩(σj(t),τj)𝒩superscriptsubscript𝜎𝑗𝑡subscript𝜏𝑗\mathcal{N}(\sigma_{j}^{(t)},\tau_{j})caligraphic_N ( italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), and update σj(t)superscriptsubscript𝜎𝑗𝑡\sigma_{j}^{(t)}italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT to σj(t+1)superscriptsubscript𝜎𝑗𝑡1\sigma_{j}^{(t+1)}italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT according to the following decision rule, for kj𝑘𝑗k\neq jitalic_k ≠ italic_j,

    σj(t+1)={σ~j,with probability q=min{p(σ~j|θ(t+1),σk(t),w(t+1),𝐲)p(σj(t)|θ(t+1),σk(t),w(t+1),𝐲), 1},σj(t),with probability 1q.superscriptsubscript𝜎𝑗𝑡1casessubscript~𝜎𝑗with probability 𝑞𝑝conditionalsubscript~𝜎𝑗superscript𝜃𝑡1subscriptsuperscript𝜎𝑡𝑘superscript𝑤𝑡1𝐲𝑝conditionalsuperscriptsubscript𝜎𝑗𝑡superscript𝜃𝑡1subscriptsuperscript𝜎𝑡𝑘superscript𝑤𝑡1𝐲1otherwisesuperscriptsubscript𝜎𝑗𝑡with probability 1qotherwise\sigma_{j}^{(t+1)}=\begin{cases}\tilde{\sigma}_{j},\quad\text{with probability% }\displaystyle{q=\min\left\{\frac{p(\tilde{\sigma}_{j}|\theta^{(t+1)},\sigma^% {(t)}_{k},w^{(t+1)},\mathbf{y})}{p(\sigma_{j}^{(t)}|\theta^{(t+1)},\sigma^{(t)% }_{k},w^{(t+1)},\mathbf{y})},\,1\right\}},\\ \sigma_{j}^{(t)},\quad\text{with probability $1-q$}.\end{cases}italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = { start_ROW start_CELL over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , with probability italic_q = roman_min { divide start_ARG italic_p ( over~ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | italic_θ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_y ) end_ARG start_ARG italic_p ( italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | italic_θ start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_w start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT , bold_y ) end_ARG , 1 } , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , with probability 1 - italic_q . end_CELL start_CELL end_CELL end_ROW

In Steps 3 and 4, p(|)p(\cdot|\cdot)italic_p ( ⋅ | ⋅ ) refers to a conditional pdf generically, τ0subscript𝜏0\tau_{0}italic_τ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and τ2subscript𝜏2\tau_{2}italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are three positive tuning parameters whose values should be chosen so that the acceptance rate at each step is around 23%percent2323\%23 % [Gelman et al., 1997]. To draw samples from the joint posterior distribution, there are numerous ways to design the Markov chain Monte Carlo (MCMC) sampler. Instead of the Metropolis-within-Gibbs sampler we adopt here, one may use other existing MCMC software, such as Stan [Stan Development Team, 2021], JAGS [Plummer et al., 2003], and BUGS [Spiegelhalter et al., 1996, Lunn et al., 2009], the former two of which are demonstrated in the GitHub repository (https://github.com/rh8liuqy/flexible_Gumbel). After obtaining enough high-quality samples from the joint posterior distribution p(θ,σ1,σ2,w|𝐲)𝑝𝜃subscript𝜎1subscript𝜎2conditional𝑤𝐲p(\theta,\sigma_{1},\sigma_{2},w|\mathbf{y})italic_p ( italic_θ , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w | bold_y ), Bayesian inference is straightforward, including point estimation, interval estimation, and uncertainty assessment.

4 Simulation study

Large-sample properties of MLEs and likelihood-based Bayesian inference under a correct model for data have been well studied. To assess finite-sample performance of the frequentist method and Bayesian method proposed in Section 3, we carried out simulation study with two specific aims: first, to compare inference results from the two methods; second, to compare goodness of fit for data from distributions outside of the FG family when one assumes an FG distribution and when one assumes a two-component normal mixture distribution for the data.

In the first experiment, denoted as (E1) hereafter, we considered two FG distributions as true data generating mechanisms, FG(θ=1,σ1=1,σ2=1,w=0.4)FGformulae-sequence𝜃1formulae-sequencesubscript𝜎11formulae-sequencesubscript𝜎21𝑤0.4\text{FG}(\theta=1,\sigma_{1}=1,\sigma_{2}=1,w=0.4)FG ( italic_θ = 1 , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 , italic_w = 0.4 ) and FG(θ=0,σ1=1,σ2=5,w=0.5)FGformulae-sequence𝜃0formulae-sequencesubscript𝜎11formulae-sequencesubscript𝜎25𝑤0.5\text{FG}(\theta=0,\sigma_{1}=1,\sigma_{2}=5,w=0.5)FG ( italic_θ = 0 , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 5 , italic_w = 0.5 ). This design creates two FG distributions with the second one more skewed and variable than the first. Based on a random sample of size n=50𝑛50n=50italic_n = 50 from the first FG distribution, we estimated 𝛀𝛀\mathbf{\Omega}bold_Ω by applying the ECM algorithm and the Metropolis-within-Gibbs algorithm. Similarly, based on a random sample of size n{100,200}𝑛100200n\in\{100,200\}italic_n ∈ { 100 , 200 }, we implemented the two algorithms to estimate 𝛀𝛀\mathbf{\Omega}bold_Ω. The former algorithm produced the MLE of 𝛀𝛀\mathbf{\Omega}bold_Ω, and we used the median of the posterior distribution of 𝛀𝛀\mathbf{\Omega}bold_Ω at convergence of the latter algorithm as another point estimate of 𝛀𝛀\mathbf{\Omega}bold_Ω. Table 1 presents summary statistics of these estimates of 𝛀𝛀\mathbf{\Omega}bold_Ω and estimates of the corresponding standard deviation across 1000 Monte Carlo replicates under each simulation setting specified by the design of an FG distribution and the level of n𝑛nitalic_n.

Table 1: Frequentist and Bayesian inference results in experiment (E1) across 1000 Monte Carlo replicates. Here, point.est stands for the average of 1000 point estimates for each parameter from each method, s.d.^^s.d.\widehat{\text{s.d.}}over^ start_ARG s.d. end_ARG stands for the average of the corresponding 1000 estimated standard deviations, and s.d. refers to the empirical standard deviation of the 1000 point estimates from each method. Numbers in parentheses are 100×100\times100 × Monte Carlo standard errors associated with the averages of 1000 estimates of the standard deviation associated with a point estimator.
Frequentist Bayesian
sample size parameter point.est s.d.^^s.d.\widehat{\text{s.d.}}over^ start_ARG s.d. end_ARG s.d. point.est s.d.^^s.d.\widehat{\text{s.d.}}over^ start_ARG s.d. end_ARG s.d.
θ𝜃\thetaitalic_θ 0.990 0.197 (0.40) 0.209 0.965 0.250 (0.18) 0.224
σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1.106 0.272 (0.77) 0.419 1.045 0.638 (2.30) 0.296
σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1.047 0.192 (0.49) 0.216 1.082 0.459 (2.12) 0.465
n=50𝑛50n=50italic_n = 50 w𝑤witalic_w 0.411 0.190 (0.46) 0.207 0.435 0.207 (0.12) 0.187
θ𝜃\thetaitalic_θ 0.002 0.198 (0.20) 0.201 0.013 0.205 (0.15) 0.203
σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.979 0.204 (0.41) 0.216 1.014 0.224 (0.27) 0.214
σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 4.932 0.590 (0.56) 0.613 4.813 0.666 (0.44) 0.615
n=100𝑛100n=100italic_n = 100 w𝑤witalic_w 0.495 0.091 (0.09) 0.090 0.484 0.090 (0.04) 0.088
θ𝜃\thetaitalic_θ 0.008 0.136 (0.08) 0.129 0.011 0.137 (0.07) 0.130
σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.999 0.143 (0.21) 0.144 1.013 0.144 (0.10) 0.141
σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 4.993 0.435 (0.32) 0.431 4.940 0.457 (0.20) 0.434
n=200𝑛200n=200italic_n = 200 w𝑤witalic_w 0.500 0.064 (0.04) 0.063 0.495 0.063 (0.02) 0.062

According to Table 1, all parameter estimates in 𝛀𝛀\mathbf{\Omega}bold_Ω are reasonably close to the true values. When the sample size is as small as 50, estimates for 𝛀𝛀\mathbf{\Omega}bold_Ω resulting from the frequentist method are still similar to those from the Bayesian inference method, although estimates for the standard deviations of these point estimators can be fairly different. We do not find such discrepancy surprising because, for the frequentist method where we use the sandwich variance estimator to infer the uncertainty of an MLE for 𝛀𝛀\mathbf{\Omega}bold_Ω, the asymptotic properties associated with MLEs that support the use of a sandwich variance estimator may not take effect yet at the current sample size; and, for the Bayesian method, the quantification of standard deviation can be sensitive to the choice of priors when n𝑛nitalic_n is small. These are confirmed by the diminishing discrepancy between the two sets of standard deviation estimates when n=100𝑛100n=100italic_n = 100, 200. A closer inspection of the reported empirical mean of estimates for 𝛀𝛀\mathbf{\Omega}bold_Ω along with their empirical standard error suggests that, when n=100𝑛100n=100italic_n = 100, the Bayesian method may slightly underestimate σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, the larger of the two scale parameters of FG. We believe that this is due to the inverse gamma prior imposed on the scale parameters that is sharply peaked near zero, and thus the posterior median of the larger scale parameter tends to be pulled downwards when the sample size is not large. As the sample size increases to 200200200200, this trend of underestimation appears to diminish. The empirical means of the standard deviation estimates from both methods are close to the corresponding empirical standard deviations, which indicate that the variability of a point estimator is accurately estimated when n𝑛nitalic_n is not small, whether it is based on the sandwich variance estimator in the frequentist framework, or based on the posterior sampling in the Bayesian framework. In summary, the methods proposed in Section 3 under both frameworks provide reliable inference for 𝛀𝛀\mathbf{\Omega}bold_Ω along with accurate uncertainty assessment of the point estimators when data arise from an FG distribution.

Among all existing mixture distributions, normal mixtures probably have the longest history and are most referenced in the literature. In another experiment, we compared the model fitting of normal mixture with that of FG when data arise from three heavy-tailed distributions: (E2) Laplace with the location parameter equal to zero and the scale parameter equal to 2; (E3) a mixture of two Gumbel distributions for the maximum, with a common mode at zero, scale parameters in the two components equal to 2 and 6, respectively, and the mixing proportion equal to 0.5; (E4) a Student-t𝑡titalic_t distribution with degrees of freedom equal to 5. From each of the three distributions in (E2)–(E4), we generated a random sample of size n=200𝑛200n=200italic_n = 200, following which we fit a two-component normal mixture model via the EM algorithm implemented using the R package mixtools, and also fit an FG model via the two algorithms described in Section 3. This model fitting exercise was repeated for 1000 Monte Carlo replicates under each of (E2)–(E4).

We used an empirical version of the Kullback-Leibler divergence as the metric to assess the quality of modeling fitting. We denote the true density function as p()𝑝p(\cdot)italic_p ( ⋅ ), and let p^()^𝑝\hat{p}(\cdot)over^ start_ARG italic_p end_ARG ( ⋅ ) be a generic estimated density resulting from one of the three considered model fitting strategies. Under each setting in (E2)–(E4), a random sample of size 50000, (x1,,x50000)subscript𝑥1subscript𝑥50000(x_{1},\ldots,x_{50000})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 50000 end_POSTSUBSCRIPT ), was generated from the true distribution, and an empirical version of the Kullback-Leibler divergence from p^()^𝑝\hat{p}(\cdot)over^ start_ARG italic_p end_ARG ( ⋅ ) to p()𝑝p(\cdot)italic_p ( ⋅ ) is given by DKL=(1/50000)i=150000log(p(xi)/p^(xi))subscript𝐷KL150000superscriptsubscript𝑖150000𝑝subscript𝑥𝑖^𝑝subscript𝑥𝑖D_{\hbox{\mbox{KL}}}=(1/50000)\sum_{i=1}^{50000}\log(p(x_{i})/\hat{p}(x_{i}))italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT = ( 1 / 50000 ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 50000 end_POSTSUPERSCRIPT roman_log ( italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) / over^ start_ARG italic_p end_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ). Figure 1 shows the boxplots of DKLsubscript𝐷KLD_{\hbox{\mbox{KL}}}italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT across 1000 Monte Carlo replicates corresponding to each model fitting scheme under (E2)–(E4).

Refer to caption
Figure 1: Boxplots of the empirical Kullback-Leibler divergence from an estimated density to the true density under each of the true-model settings in (E2)–(E4). Under each setting, the three considered model fitting strategies are, from left to right in the figure, (i) using the ECM algorithm to fit an FG distribution (FG ECM), (ii) using the Bayesian method to fit an FG distribution (FG Bayes), and (iii) using the EM algorithm to fit a normal mixture distribution (Normal Mixture Distribution EM).

Judging from Figure 1, the FG distribution clearly outperforms the normal mixture when fitting data from any of the three heavy-tailed distributions in (E2)–(E4), and results from the frequentist method are comparable with those from the Bayesian method for fitting an FG model. When implementing the ECM algorithm for fitting the FG model and the EM algorithm for fitting the normal mixture, we set a maximum number of iterations at 1000. Our ECM algorithm always converged in the simulation, i.e., converged to a stationary point within 1000 iterations. However, the EM algorithm for fitting a normal mixture often had trouble achieving that, with more difficulty when data come from a heavier-tailed distribution. More specifically, under (E4), which has the highest kurtosis (equal to 9) among the three settings, the EM algorithm failed to converge in 59.9% of all Monte Carlo replicates; under (E2), which has the second highest kurtosis (equal to 6), it failed to converge in 6.7% of the replicates. Results associated with the normal mixture from these failing replicates were not included when producing the boxplots in Figure 1. In conclusion, the FG distribution is more suitable for symmetric or asymmetric heavy-tailed data than the normal mixture distribution.

5 An application in hydrology

Daily maximum water elevation changes of a water body, such as ocean, lake, and wetland, are of interest in hydrologic research. These changes may be close to zero on most days but can be extremely large or small under extreme weather. From the National Water Information System (https://waterdata.usgs.gov/), we downloaded water elevation data for Lake Murray near Columbia, South Carolina, United States, recorded from September 18, 2020 to September 18, 2021. The water elevation change of a given day was calculated by contrasting the maximum elevation and the minimum elevation on that day, returning a positive (negative) value if the maximum record of the day comes after (before) the minimum record on the same day. We fit the FG distribution to the resultant data with n=366𝑛366n=366italic_n = 366 records using the frequentist method and the Bayesian method, with results presented in Table 2. The two inference methods produced very similar estimates for most parameters, although small differences were observed. For example, one would estimate the mode of daily maximum water elevation change to be 0.7950.795-0.795- 0.795 feet based on the frequentist method, but estimate it to be 0.4850.485-0.485- 0.485 feet using the Bayesian method. The discrepancy between these two mode estimates is minimal considering that the daily maximum water elevation changes range from 3838-38- 38 feet to 49.4 feet within this one year. Taking into account the uncertainty in these point estimates, we do not interpret any of these differences as statistically significant because a parameter estimate from one method always falls in the interval estimate for the same parameter from the other method according to Table 2. Using parameter estimates in Table 2 in the aforementioned R Shiny app, we obtained an estimated skewness of 0.1020.102-0.102- 0.102 and an estimated kurtosis of 6.384 based on the frequentist inference results, whereas the Bayes inference yielded an estimated skewness of 0.058 and an estimated kurtosis of 6.074. Combining these two sets of results, we concluded that the underlying distribution of daily maximum water elevation change may be nearly symmetrical, with outliers on both tails that cause tails heavier than that of a Gumbel distribution.

Table 2: Frequentist and Bayesian inferences about daily maximum water elevation changes of Lake Murray, South Carolina, United States. Besides parameter estimates (under point.est) and the estimated standard deviations of these parameter estimates (under s.d.^^s.d.\widehat{\mbox{s.d.}}over^ start_ARG s.d. end_ARG), 95% confidence intervals of the parameters from the frequentist method, and 95% credible intervals from the Bayesian method are also provided (under lower 95 and upper 95).
Frequentist Bayesian
parameter point.est s.d.^^s.d.\widehat{\text{s.d.}}over^ start_ARG s.d. end_ARG lower 95 upper 95 point.est s.d.^^s.d.\widehat{\text{s.d.}}over^ start_ARG s.d. end_ARG lower 95 upper 95
θ𝜃\thetaitalic_θ -0.795 0.796 -2.355 0.764 -0.485 0.695 -1.670 0.979
σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 5.186 0.541 4.124 6.247 5.400 0.655 4.520 6.910
σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 6.237 1.735 2.836 9.638 5.733 1.036 4.390 8.030
w𝑤witalic_w 0.698 0.169 0.367 1.029 0.629 0.141 0.327 0.846

Figure 2 presents the estimated density functions from these two methods, in contrast with the estimated density curve resulting from fitting the data to a two-component normal mixture, and a kernel density estimate using a Gaussian kernel with the bandwidth selected according to the method proposed by Sheather and Jones [1991]. The last estimate is fully nonparametric and served as a benchmark against which the other three density estimates were assessed graphically. The kernel density estimate is more flexible at describing varying tail behaviors, but such flexibility comes at the cost of statistical efficiency and interpretability. With the wiggly tails evident in Figure 2 for this estimate, we suspected a certain level of overfitting of the kernel density estimate. This often happens to kernel-based estimation of a function around a region where data are scarce, with a bandwidth not large enough for the region. Between the two FG density estimates, the difference is almost negligible. They both track the kernel density estimate closely over a wide range of the support around the mode. The mode of the estimated normal mixture density is close to the other three mode estimates, but the tails are much lighter than those of the other three estimated densities.

Refer to caption
Figure 2: Four density estimates based on daily maximum water elevation changes in Lake Murray, including the kernel density estimate (solid line), the estimated FG density from the ECM algorithm (dotted line), the estimated FG density from the Bayesian method (dashed line), and the estimated normal mixture density (dash-dotted line).

Besides comparing the three parametric density estimates pictorially, we also used the Monte-Carlo-based one-sample Kolmogorov–Smirnov test to assess the goodness of fit. The p𝑝pitalic_p-values from this test are 0.223, 0.312, and 0.106 for the frequentist FG density estimate, the Bayesian FG density estimate, and the estimated normal mixture density, respectively. Although none of the p𝑝pitalic_p-values are low enough to indicate a lack of fit (at significance level 0.05 for example), the p𝑝pitalic_p-value associated with the normal mixture is much lower than those for FG. Hence, between the two null hypotheses, with one assuming an FG distribution and the other claiming a normal mixture for this data set, we find even weaker evidence to reject the former than data evidence against the latter. It is also worth noting that the Kolmogorov–Smirnov test is known to have low power to detect deviations from a posited distribution that occur in the tails [Mason and Schuenemeyer, 1983]. This may explain the above-0.05 p𝑝pitalic_p-value for the normal mixture fit of the data even though the tail of this posited distribution may be too thin for the current data. Finally, as suggested by a referee, we computed the Akaike information criterion (AIC) and the Bayesian information criterion (BIC) after fitting the FG distribution and the normal mixture distribution to the data. When assuming an FG distribution, we obtained an AIC/BIC of 2506.028/2521.638 from the frequentist method, and 2506.299/2521.909 from the Bayesian method. When assuming a mixture normal, we found the values of AIC and BIC to be 2499.821 and 2519.334, respectively. Even though the fitted normal mixture distribution produces a lower AIC/BIC than the fitted FG distribution, we argue that these metrics focus more on the overall goodness of fit, and can be more forgiving when it comes to a relatively poor fit for certain feature of a distribution, such as the tail behavior.

We used Stan to implement the Bayesian inference for the Lake Murray data. The code can be found in https://github.com/rh8liuqy/flexible_Gumbel, where the JAGS code for fitting the FG distribution is also provided. The posterior output is given in Appendix C. The output provided there indicates that our MCMC chain has converged (see the Rhat statistics).

6 An application in criminology

Table 3: Frequentist and Bayesian modal regression models based on the FG distribution fitted to the crime data. Besides parameter estimates (under point.est) and the estimated standard deviations of these parameter estimates (under s.d.^^s.d.\widehat{\mbox{s.d.}}over^ start_ARG s.d. end_ARG), 95% confidence intervals of the parameters from the frequentist method, and 95% credible intervals from the Bayesian method are also provided (under lower 95 and upper 95).
Frequentist Bayesian
parameter point.est s.d.^^s.d.\widehat{\text{s.d.}}over^ start_ARG s.d. end_ARG lower 95 upper 95 point.est s.d.^^s.d.\widehat{\text{s.d.}}over^ start_ARG s.d. end_ARG lower 95 upper 95
β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -0.166 0.072 -0.306 -0.026 -0.162 0.079 -0.312 -0.003
β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.216 0.110 -0.000 0.432 0.232 0.120 -0.007 0.479
β3subscript𝛽3\beta_{3}italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.067 0.013 0.042 0.093 0.067 0.014 0.039 0.095
σ1subscript𝜎1\color[rgb]{0,0,0}\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 1.600 0.180 1.247 1.954 1.690 0.214 1.206 2.686
σ2subscript𝜎2\color[rgb]{0,0,0}\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 51.882 45.034 -36.384 140.148 19.3 19.300 0.187 133.047
Table 4: Mean regression model based on the normal distribution fitted to the crime data. Besides parameter estimates (under point.est) and the estimated standard deviations of these parameter estimates (under s.d.^^s.d.\widehat{\mbox{s.d.}}over^ start_ARG s.d. end_ARG), 95% confidence intervals of the parameters are also provided (under lower 95 and upper 95).
parameter point.est s.d.^^s.d.\widehat{\text{s.d.}}over^ start_ARG s.d. end_ARG lower 95 upper 95
β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.467 0.161 0.142 0.792
β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 1.140 0.224 0.689 1.591
β3subscript𝛽3\beta_{3}italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.068 0.034 0.000 0.136

With the location parameter θ𝜃\thetaitalic_θ signified in the FG distribution as the mode, it is straightforward to formulate a modal regression model that explores the relationship between the response variable and predictors. To demonstrate the formulation of a modal regression model based on the FG distribution, we analyzed a data set from Agresti et al. [2021] in the area of criminology. This data set contains the percentage of college education, poverty percentage, metropolitan rate, and murder rate for the 50 states in the United States and the District of Columbia from the year 2003. The poverty percentage is the percentage of the residents with income below the poverty level; the metropolitan rate is defined as the percentage of the population living in the metropolitan area; and the murder rate is the annual number of murders per 100,000100000100,000100 , 000 people in the population.

We fit the following modal regression model to investigate the association between the murder rate (Y𝑌Yitalic_Y) and the aforementioned demographic variables,

Y𝜷,σ1,σ2FG(β0+β1× college +β2× poverty +β3× metropolitan ,σ1,σ2,w),similar-toconditional𝑌𝜷subscript𝜎1subscript𝜎2FGsubscript𝛽0subscript𝛽1 college subscript𝛽2 poverty subscript𝛽3 metropolitan subscript𝜎1subscript𝜎2𝑤Y\mid\boldsymbol{\beta},\sigma_{1},\sigma_{2}\sim\operatorname{FG}(\beta_{0}+% \beta_{1}\times\text{ college }+\beta_{2}\times\text{ poverty }+\beta_{3}% \times\text{ metropolitan },\sigma_{1},\sigma_{2},w),italic_Y ∣ bold_italic_β , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ roman_FG ( italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × college + italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × poverty + italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT × metropolitan , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w ) ,

where 𝜷=[β0,β1,β2,β3]𝜷superscriptsubscript𝛽0subscript𝛽1subscript𝛽2subscript𝛽3top\boldsymbol{\beta}=[\beta_{0},\beta_{1},\beta_{2},\beta_{3}]^{\top}bold_italic_β = [ italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT includes all regression coefficients. For the prior elicitation in Bayesian inference, we assume that β0,,β3i.i.d𝒩(0,104)superscriptsimilar-toi.i.dsubscript𝛽0subscript𝛽3𝒩0superscript104\beta_{0},\dots,\beta_{3}\stackrel{{\scriptstyle\text{i.i.d}}}{{\sim}}\mathcal% {N}(0,10^{4})italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_β start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT start_RELOP SUPERSCRIPTOP start_ARG ∼ end_ARG start_ARG i.i.d end_ARG end_RELOP caligraphic_N ( 0 , 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) and use the same priors for σ1subscript𝜎1\sigma_{1}italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, σ2subscript𝜎2\sigma_{2}italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and w𝑤witalic_w as those in Section 3.2. As a more conventional regression analysis to compare with our modal regression, we also fit the mean regression model assuming mean-zero normal model error to the data.

Table 3 shows the inference results from the modal regression model, and Table 4 presents the inference results from the mean regression model. At 5%percent55\%5 % significance level, both frequentist and Bayesian modal regression analyses confirm that there exists a negative association between the percentage of college education and the murder rate, as well as a positive association between the metropolitan rate and the murder rate. In contrast, according to the inferred mean regression model, there is a positive association between the percentage of college education and the murder rate. Such claimed positive association is intuitively difficult to justify and contradicts many published results in criminology [Hjalmarsson and Lochner, 2012, Lochner, 2020].

The scatter plot of the data in Figure 3 can shed some light on why one reaches such a drastically different conclusion on a covariate effect when mean regression is considered in place of modal regression. As shown in Figure 3, the exists an obvious outlier, the District of Columbia ( D.C.), in panels of the first row of the scatter plot matrix for instance. D.C. not only exhibited the highest murder rate but also the highest percentage of college-educated individuals. These dual characteristics position D.C., as an outlier within the dataset. Mean regression reacts to this one extreme outlier by inflating the covariate effect associated with the percentage of college education in the inferred mean regression function. Thanks to the heavy-tailed feature of the FG distribution, modal regression based on this distribution is robust to outliers, which strives to capture data features suggested by the majority of the data and is not distracted by the extreme outlier when inferring covariate effects in this application.

Lastly, to compare their overall goodness of fit for the current data, we computed AIC and BIC following fitting each regression model. Adopting the frequentist and Bayesian methods, the modal regression analysis yields AIC/BIC equal to 239.394/252.917 and 238.710/252.233, respectively. The mean regression analysis leads to AIC and BIC equal to 303.154 and 312.813, respectively. Appendix C contains the convergence diagnosis for the Bayesian inferential method applied to this data set, from which we see no concerns about convergence.

Refer to caption
Figure 3: Scatter plot matrix of the crime data, where D.C. stands out as an extreme outlier with the highest murder rate and the highest percentage of college education.

7 Discussion

The mode had been an overlooked location parameter in statistical inference until recently when the statistics community witnessed a revived interest in modal regression among statisticians [Chen, 2018, Chacón, 2020, Feng et al., 2020, Xu et al., 2020, Ullah et al., 2021, Wang and Li, 2021, Xiang and Yao, 2022]. Historically, statistical inference for the mode has been mostly developed under the nonparametric framework for reasons we point out in Section 1. Existing semiparametric methods for modal regression only introduce parametric ingredients in the regression function, i.e., the conditional mode of the response, with the mode-zero error distribution left in a nonparametric form [Yao and Li, 2013, Liu et al., 2013, Zhang et al., 2013, Yang and Yang, 2014, Zhao et al., 2014, Krief, 2017, Tian et al., 2017, Li and Huang, 2019]. The few recently proposed parametric modal regression models all impose stringent parametric assumptions on the error distribution [Bourguignon et al., 2020, Zhou and Huang, 2020, 2022]. Our proposed flexible Gumbel distribution greatly alleviates concerns contributing to data scientists’ reluctance to adopt a parametric framework when drawing inferences for the mode. This new distribution is a heterogeneous mixture in the sense that the two components in the mixture belong to different Gumbel distribution families, which is a feature that shields it from the non-identifiability issue most traditional mixture distributions face, such as the normal mixtures. The proposed distribution is indexed by the mode along with shape and scale parameters, and thus is convenient to use to draw inferences for the mode while remaining flexible. It is also especially suitable for modeling heavy-tailed data, whether the heaviness in tails is due to extremely large or extremely small observations, or both. These are virtues of FG that cannot be achieved by the popular normal mixture and many other existing mixture distributions.

We develop the numerically efficient and stable ECM algorithm for frequentist inference for the FG distribution, and a reliable Bayesian inference method that can be easily implemented using free software, including Stan, JAGS, and BUGS. Compared with the more widely adopted mean regression framework, the modal regression model based on FG we entertained in Section 6 shows great potential in revealing meaningful covariate effects potentially masked by extreme outliers. With these advances made in this study, we open up new directions for parametric modal regression and semiparametric modal regression with a fully parametric yet flexible error distribution, and potentially nonparametric ingredients incorporated in the regression function.

Disclosure statement

Computer programs for implementing the FG distribution, related models and data used in this paper are available at https://github.com/rh8liuqy/flexible_Gumbel.

Appendix A Derivation of the third central moment of FG in (4)

For any finite mixture distribution, its higher-order central moments can be expressed using the binomial formula (See Equation (1.22) in [Frühwirth-Schnatter, 2006]). The FG distribution is a mixture distribution with two components: a right-skewed Gumbel distribution and a left-skewed Gumbel distribution. Let YFG(θ=0,σ1,σ2,w)similar-to𝑌FG𝜃0subscript𝜎1subscript𝜎2𝑤Y\sim\operatorname{FG}\left(\theta=0,\sigma_{1},\sigma_{2},w\right)italic_Y ∼ roman_FG ( italic_θ = 0 , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_w ), Y1right-skewed Gumbel(θ=0,σ1)similar-tosubscript𝑌1right-skewed Gumbel𝜃0subscript𝜎1Y_{1}\sim\text{right-skewed Gumbel}\left(\theta=0,\sigma_{1}\right)italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ right-skewed Gumbel ( italic_θ = 0 , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), and Y2left-skewed Gumbel(θ=0,σ2)similar-tosubscript𝑌2left-skewed Gumbel𝜃0subscript𝜎2Y_{2}\sim\text{left-skewed Gumbel}\left(\theta=0,\sigma_{2}\right)italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ left-skewed Gumbel ( italic_θ = 0 , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). Its j𝑗jitalic_j-th finite moment can be expressed as

𝔼{(Yμy)j}𝔼superscript𝑌subscript𝜇𝑦𝑗\displaystyle\mathbb{E}\left\{\left(Y-\mu_{y}\right)^{j}\right\}blackboard_E { ( italic_Y - italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } =w𝔼{(Y1μ1+μ1μy)j}+w¯𝔼{(Y2μ2+μ2μy)j}absent𝑤𝔼superscriptsubscript𝑌1subscript𝜇1subscript𝜇1subscript𝜇𝑦𝑗¯𝑤𝔼superscriptsubscript𝑌2subscript𝜇2subscript𝜇2subscript𝜇𝑦𝑗\displaystyle=w\mathbb{E}\left\{\left(Y_{1}-\mu_{1}+\mu_{1}-\mu_{y}\right)^{j}% \right\}+\bar{w}\mathbb{E}\left\{\left(Y_{2}-\mu_{2}+\mu_{2}-\mu_{y}\right)^{j% }\right\}= italic_w blackboard_E { ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } + over¯ start_ARG italic_w end_ARG blackboard_E { ( italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } (9)
=k=0jw𝔼{(jk)(Y1μ1)k(μ1μy)jk}absentsuperscriptsubscript𝑘0𝑗𝑤𝔼𝑗𝑘superscriptsubscript𝑌1subscript𝜇1𝑘superscriptsubscript𝜇1subscript𝜇𝑦𝑗𝑘\displaystyle=\sum_{k=0}^{j}w\mathbb{E}\left\{\left(\begin{array}[]{l}j\\ k\end{array}\right)\left(Y_{1}-\mu_{1}\right)^{k}\left(\mu_{1}-\mu_{y}\right)^% {j-k}\right\}= ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT italic_w blackboard_E { ( start_ARRAY start_ROW start_CELL italic_j end_CELL end_ROW start_ROW start_CELL italic_k end_CELL end_ROW end_ARRAY ) ( italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT }
+w¯𝔼{(jk)(Y2μ2)k(μ2μy)jk},¯𝑤𝔼𝑗𝑘superscriptsubscript𝑌2subscript𝜇2𝑘superscriptsubscript𝜇2subscript𝜇𝑦𝑗𝑘\displaystyle\quad+\bar{w}\mathbb{E}\left\{\left(\begin{array}[]{l}j\\ k\end{array}\right)\left(Y_{2}-\mu_{2}\right)^{k}\left(\mu_{2}-\mu_{y}\right)^% {j-k}\right\},+ over¯ start_ARG italic_w end_ARG blackboard_E { ( start_ARRAY start_ROW start_CELL italic_j end_CELL end_ROW start_ROW start_CELL italic_k end_CELL end_ROW end_ARRAY ) ( italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_j - italic_k end_POSTSUPERSCRIPT } ,

where μysubscript𝜇𝑦\mu_{y}italic_μ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, μ1subscript𝜇1\mu_{1}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and μ2subscript𝜇2\mu_{2}italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the expectations of Y𝑌Yitalic_Y, Y1subscript𝑌1Y_{1}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and Y2subscript𝑌2Y_{2}italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively. Applying (9) for j=3𝑗3j=3italic_j = 3, one obtains the expression in (4).

Appendix B Proof of identifiability of FG in (3)

In our context of two-component mixture distributions of which the cumulative distribution functions are of the form wF1(x)+w¯F2(x)𝑤subscript𝐹1𝑥¯𝑤subscript𝐹2𝑥wF_{1}(x)+\bar{w}F_{2}(x)italic_w italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) + over¯ start_ARG italic_w end_ARG italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_x ), Theorem 1 in Teicher [1961] states that a mixture distribution is identifiable if and only if there exists y1subscript𝑦1y_{1}italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the support of F1(y)subscript𝐹1𝑦F_{1}(y)italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y ) and y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the support of F2(y)subscript𝐹2𝑦F_{2}(y)italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y ) such that

|F1(y1)F2(y1)F1(y2)F2(y2)|0,subscript𝐹1subscript𝑦1subscript𝐹2subscript𝑦1subscript𝐹1subscript𝑦2subscript𝐹2subscript𝑦20\left|\begin{array}[]{cc}F_{1}\left(y_{1}\right)&F_{2}\left(y_{1}\right)\\ F_{1}\left(y_{2}\right)&F_{2}\left(y_{2}\right)\end{array}\right|\neq 0,| start_ARRAY start_ROW start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY | ≠ 0 ,

that is, the above determinant does not vanish for some (y1,y2)subscript𝑦1subscript𝑦2(y_{1},y_{2})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). In what follows, we prove that the FG distribution is identifiable by showing the existence of (y1,y2)subscript𝑦1subscript𝑦2(y_{1},y_{2})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) that makes the above determinant non-zero.

Proof.

Recall that the cumulative distribution functions of right-skewed and left-skewed Gumbel distributions are given by

F1(y)=exp{exp(yθσ1)},subscript𝐹1𝑦𝑦𝜃subscript𝜎1F_{1}(y)=\exp\left\{-\exp\left(-\frac{y-\theta}{\sigma_{1}}\right)\right\},italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y ) = roman_exp { - roman_exp ( - divide start_ARG italic_y - italic_θ end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ) } ,

and

F2(y)=1exp{exp(yθσ2)},subscript𝐹2𝑦1𝑦𝜃subscript𝜎2F_{2}\left(y\right)=1-\exp\left\{-\exp\left(\frac{y-\theta}{\sigma_{2}}\right)% \right\},italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y ) = 1 - roman_exp { - roman_exp ( divide start_ARG italic_y - italic_θ end_ARG start_ARG italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) } ,

respectively.

By setting y1=θsubscript𝑦1𝜃y_{1}=\thetaitalic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_θ, we have

|F1(y1)F2(y1)F1(y2)F2(y2)|=|e11e1F1(y2)F2(y2)|=e1F2(y2)(1e1)F1(y2).subscript𝐹1subscript𝑦1subscript𝐹2subscript𝑦1subscript𝐹1subscript𝑦2subscript𝐹2subscript𝑦2superscript𝑒11superscript𝑒1subscript𝐹1subscript𝑦2subscript𝐹2subscript𝑦2superscript𝑒1subscript𝐹2subscript𝑦21superscript𝑒1subscript𝐹1subscript𝑦2\left|\begin{array}[]{cc}F_{1}\left(y_{1}\right)&F_{2}\left(y_{1}\right)\\ F_{1}\left(y_{2}\right)&F_{2}\left(y_{2}\right)\end{array}\right|=\left|\begin% {array}[]{cc}e^{-1}&1-e^{-1}\\ F_{1}\left(y_{2}\right)&F_{2}\left(y_{2}\right)\end{array}\right|=e^{-1}F_{2}(% y_{2})-(1-e^{-1})F_{1}(y_{2}).| start_ARRAY start_ROW start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY | = | start_ARRAY start_ROW start_CELL italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL start_CELL 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY | = italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ( 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (10)

We next show by contradiction that there exists y2subscript𝑦2y_{2}\in\mathbb{R}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R such that (10) is not equal to zero.

Suppose (10) is equal to zero for all y2subscript𝑦2y_{2}\in\mathbb{R}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R, that is,

F2(y2)=(e1)F1(y2), for all y2.subscript𝐹2subscript𝑦2𝑒1subscript𝐹1subscript𝑦2 for all y2F_{2}(y_{2})=(e-1)F_{1}(y_{2}),\text{ for all $y_{2}\in\mathbb{R}$}.italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_e - 1 ) italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , for all italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R .

Taking the limit of both sides of the above equation as y2+subscript𝑦2y_{2}\to+\inftyitalic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → + ∞ gives

limy2+F2(y2)=(e1)×limy2+F1(y2),subscriptsubscript𝑦2subscript𝐹2subscript𝑦2𝑒1subscriptsubscript𝑦2subscript𝐹1subscript𝑦2\lim_{y_{2}\rightarrow+\infty}F_{2}\left(y_{2}\right)=(e-1)\times\lim_{y_{2}% \rightarrow+\infty}F_{1}\left(y_{2}\right),roman_lim start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → + ∞ end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_e - 1 ) × roman_lim start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → + ∞ end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ,

which is clearly false since limy2+F2(y2)=limy2+F1(y2)=1subscriptsubscript𝑦2subscript𝐹2subscript𝑦2subscriptsubscript𝑦2subscript𝐹1subscript𝑦21\lim_{y_{2}\to+\infty}F_{2}(y_{2})=\lim_{y_{2}\to+\infty}F_{1}(y_{2})=1roman_lim start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → + ∞ end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_lim start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → + ∞ end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1. Hence, there exists y2subscript𝑦2y_{2}\in\mathbb{R}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R such that (10) is not equal to zero. Denote by y2superscriptsubscript𝑦2y_{2}^{*}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such a value, or one of such values if such y2subscript𝑦2y_{2}italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is not unique.

Now that we have found (y1,y2)=(θ,y2)subscript𝑦1subscript𝑦2𝜃superscriptsubscript𝑦2(y_{1},y_{2})=(\theta,y_{2}^{*})( italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ( italic_θ , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) such that the aforementioned determinant does not vanish, the FG distribution is identifiable by Theorem 1 in Teicher [1961]. ∎

Appendix C Convergence diagnosis of MCMC

## hydrology example - Section 5mean se_mean   sd     2.5%      25%      50%      75%    97.5%    n_eff Rhatw1         0.61       0 0.14     0.31     0.51     0.63     0.72     0.83 149860.5    1theta     -0.42       0 0.70    -1.62    -0.93    -0.48     0.04     1.04 162508.9    1sigma1     5.54       0 0.65     4.67     5.11     5.40     5.81     7.25 159083.7    1sigma2     5.95       0 1.04     4.58     5.20     5.73     6.49     8.49 175171.4    1## criminology example - Section 6variable    mean  median      sd     mad      q5      q95  rhat ess_bulk ess_tail<chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>    <dbl> <dbl>    <dbl>    <dbl>1 alpha     0.496   0.530   2.72    2.68   -4.00     4.89    1.00  102742.   41503.2 beta[1]  -0.160  -0.162   0.0785  0.0769 -0.286   -0.0292  1.00   35063.   13152.3 beta[2]   0.235   0.232   0.124   0.120   0.0374   0.441   1.00    2074.    1702.4 beta[3]   0.0669  0.0669  0.0144  0.0140  0.0434   0.0904  1.00    9603.    9166.5 scale1    1.95    1.69    2.36    0.214   1.40     2.64    1.01     588.     189.6 scale2   53.6    38.6    74.2    19.3     1.80   133.      1.01     583.     194.

References

  • Chacón [2020] José E. Chacón. The modal age of statistics. International Statistical Review, 88(1):122–141, 2020. doi:10.1111/insr.12340.
  • Chernoff [1964] Herman Chernoff. Estimation of the mode. Annals of the Institute of Statistical Mathematics, 16(1):31–41, 1964. doi:10.1007/BF02868560.
  • Dalenius [1965] Tore Dalenius. The mode–a neglected statistical parameter. Journal of the Royal Statistical Society. Series A (General), 128(1):110–117, 1965. doi:10.2307/2343439.
  • Venter [1967] JH Venter. On estimation of the mode. The Annals of Mathematical Statistics, 38(5):1446–1455, 1967. doi:10.1214/aoms/1177698699.
  • Chen [2018] Yen-Chi Chen. Modal regression using kernel density estimation: A review. Wiley Interdisciplinary Reviews: Computational Statistics, 10(4):e1431, 2018. doi:10.1002/wics.1431.
  • Ota et al. [2019] Hirofumi Ota, Kengo Kato, and Satoshi Hara. Quantile regression approach to conditional mode estimation. Electronic Journal of Statistics, 13(2):3120–3160, 2019. doi:10.1214/19-EJS1607.
  • Zhang et al. [2021] Tao Zhang, Kengo Kato, and David Ruppert. Bootstrap inference for quantile-based modal regression. Journal of the American Statistical Association, pages 1–13, 2021. doi:10.1080/01621459.2021.1918130.
  • Gumbel [1941] E. J. Gumbel. The return period of flood flows. The Annals of Mathematical Statistics, 12(2):163–190, jun 1941. doi:10.1214/aoms/1177731747.
  • Jenkinson [1955] Arthur F Jenkinson. The frequency distribution of the annual maximum (or minimum) values of meteorological elements. Quarterly Journal of the Royal Meteorological Society, 81(348):158–171, 1955. doi:10.1002/qj.49708134804.
  • Loaiciga and Leipnik [1999] H. A. Loaiciga and R. B. Leipnik. Analysis of extreme hydrologic events with gumbel distributions: marginal and additive cases. Stochastic Environmental Research and Risk Assessment (SERRA), 13(4):251–259, aug 1999. doi:10.1007/s004770050042.
  • Koutsoyiannis [2004] Demetris Koutsoyiannis. Statistics of extremes and estimation of extreme rainfall: I. theoretical investigation / statistiques de valeurs extrêmes et estimation de précipitations extrêmes: I. recherche théorique. Hydrological Sciences Journal, 49(4), aug 2004. doi:10.1623/hysj.49.4.575.54430.
  • Dawley et al. [2019] Shawn Dawley, Yong Zhang, Xiaoting Liu, Peng Jiang, Geoffrey Tick, HongGuang Sun, Chunmiao Zheng, and Li Chen. Statistical analysis of extreme events in precipitation, stream discharge, and groundwater head fluctuation: distribution, memory, and correlation. Water, 11(4):707, apr 2019. doi:10.3390/w11040707.
  • Bali [2003] Turan G. Bali. An extreme value approach to estimating volatility and value at risk. The Journal of Business, 76(1):83–108, jan 2003. doi:10.1086/344669.
  • Pratiwi et al. [2019] N Pratiwi, C Iswahyudi, and R I Safitri. Generalized extreme value distribution for value at risk analysis on gold price. Journal of Physics: Conference Series, 1217(1):012090, may 2019. doi:10.1088/1742-6596/1217/1/012090.
  • Cooray [2010] Kahadawala Cooray. Generalized Gumbel distribution. Journal of Applied Statistics, 37(1):171–179, 2010. doi:10.1080/02664760802698995.
  • Shin et al. [2015] Ju-Young Shin, Taesam Lee, and Taha B. M. J. Ouarda. Heterogeneous mixture distributions for modeling multisource extreme rainfalls. Journal of Hydrometeorology, 16(6):2639–2657, nov 2015. doi:10.1175/jhm-d-14-0130.1.
  • Yao et al. [2012] Weixin Yao, Bruce G Lindsay, and Runze Li. Local modal regression. Journal of nonparametric statistics, 24(3):647–663, 2012. doi:10.1080/10485252.2012.678848.
  • Yao and Li [2013] Weixin Yao and Longhai Li. A new regression model: Modal linear regression. Scandinavian Journal of Statistics, 41(3):656–671, 2013. doi:10.1111/sjos.12054.
  • Bourguignon et al. [2020] Marcelo Bourguignon, Jeremias Leão, and Diego I Gallardo. Parametric modal regression with varying precision. Biometrical Journal, 62(1):202–220, 2020. doi:10.1002/bimj.201900132.
  • Zhou and Huang [2020] Haiming Zhou and Xianzheng Huang. Parametric mode regression for bounded responses. Biometrical Journal, 62(7):1791–1809, 2020. doi:10.1002/bimj.202000039.
  • Zhou and Huang [2022] Haiming Zhou and Xianzheng Huang. Bayesian beta regression for bounded responses with unknown supports. Computational Statistics & Data Analysis, 167:107345, 2022. doi:10.1016/j.csda.2021.107345.
  • Teicher [1961] Henry Teicher. Identifiability of mixtures. The Annals of Mathematical Statistics, 32(1):244–248, 1961. doi:10.1214/aoms/1177705155.
  • Teicher [1963] Henry Teicher. Identifiability of finite mixtures. The Annals of Mathematical Statistics, 34(4):1265–1269, dec 1963. doi:10.1214/aoms/1177703862.
  • Yakowitz and Spragins [1968] Sidney J Yakowitz and John D Spragins. On the identifiability of finite mixtures. The Annals of Mathematical Statistics, 39(1):209–214, 1968. doi:10.1214/aoms/1177698520.
  • Redner and Walker [1984] Richard A Redner and Homer F Walker. Mixture densities, maximum likelihood and the em algorithm. SIAM review, 26(2):195–239, 1984. doi:10.1137/1026034.
  • Dempster et al. [1977] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, sep 1977. doi:10.1111/j.2517-6161.1977.tb01600.x.
  • Boos and Stefanski [2013] Dennis D Boos and L. A Stefanski. Essential Statistical Inference. Springer New York, 2013. doi:10.1007/978-1-4614-4818-1.
  • Meng and Rubin [1993] Xiao-Li Meng and Donald B. Rubin. Maximum likelihood estimation via the ECM algorithm: A general framework. Biometrika, 80(2):267–278, 1993. doi:10.1093/biomet/80.2.267.
  • Wu [1983] C. F. Jeff Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1), mar 1983. doi:10.1214/aos/1176346060.
  • Louis [1982] Thomas A. Louis. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 44(2):226–233, jan 1982. doi:10.1111/j.2517-6161.1982.tb01203.x.
  • Oakes [1999] D. Oakes. Direct calculation of the information matrix via the EM. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(2):479–482, apr 1999. doi:10.1111/1467-9868.00188.
  • Wei and Tanner [1990] Greg CG Wei and Martin A Tanner. A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. Journal of the American Statistical Association, 85(411):699–704, 1990. doi:10.1080/01621459.1990.10474930.
  • Gelman [2006] Andrew Gelman. Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper). Bayesian Analysis, 1(3):515–534, sep 2006. doi:10.1214/06-ba117a.
  • Müller [1991] Peter Müller. A generic approach to posterior integration and Gibbs sampling. Technical report, Purdue University, West Lafayette, Indiana, 1991.
  • Müller [1993] Peter Müller. Alternatives to the Gibbs sampling scheme. Technical report, Institue of Statistics and Decison Sciences, Duke University, 1993.
  • Gelman et al. [1997] A. Gelman, W. R. Gilks, and G. O. Roberts. Weak convergence and optimal scaling of random walk metropolis algorithms. The Annals of Applied Probability, 7(1), feb 1997. doi:10.1214/aoap/1034625254.
  • Stan Development Team [2021] Stan Development Team. RStan: the R interface to Stan, 2021. URL https://mc-stan.org/. R package version 2.21.3.
  • Plummer et al. [2003] Martyn Plummer et al. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. In Proceedings of the 3rd international workshop on distributed statistical computing, volume 124, pages 1–10. Vienna, Austria., 2003.
  • Spiegelhalter et al. [1996] David Spiegelhalter, Andrew Thomas, Nicky Best, and Wally Gilks. BUGS 0.5: Bayesian inference using Gibbs sampling manual (version ii). MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK, pages 1–59, 1996.
  • Lunn et al. [2009] David Lunn, David Spiegelhalter, Andrew Thomas, and Nicky Best. The BUGS project: Evolution, critique and future directions. Statistics in medicine, 28(25):3049–3067, 2009. doi:10.1002/sim.3680.
  • Sheather and Jones [1991] S.J. Sheather and Chris Jones. A reliable data-based bandwidth selection method for kernel density estimation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 53(3):683–690, 1991. doi:10.1111/j.2517-6161.1991.tb01857.x.
  • Mason and Schuenemeyer [1983] David M Mason and John H Schuenemeyer. A modified Kolmogorov-Smirnov test sensitive to tail alternatives. The annals of Statistics, pages 933–946, 1983. doi:10.1214/aos/1176346259.
  • Agresti et al. [2021] Alan Agresti, Christine Franklin, and Bernhard Klingenberg. Statistics: The Art and Science of Learning from Data. Pearson Education, 5 edition, 2021.
  • Hjalmarsson and Lochner [2012] Randi Hjalmarsson and Lance Lochner. The impact of education on crime: International evidence. CESifo DICE Report, 10(2):49–55, 2012.
  • Lochner [2020] Lance Lochner. Education and crime. In The Economics of Education, pages 109–117. Elsevier, 2020. doi:10.1016/B978-0-12-815391-8.00009-4.
  • Feng et al. [2020] Yunlong Feng, Jun Fan, and Johan Suykens. A statistical learning approach to modal regression. Journal of Machine Learning Research, 21(2):1–35, 2020.
  • Xu et al. [2020] Jiamiao Xu, Fangzhao Wang, Qinmu Peng, Xinge You, Shuo Wang, Xiao-Yuan **g, and CL Philip Chen. Modal-regression-based structured low-rank matrix recovery for multiview learning. IEEE Transactions on Neural Networks and Learning Systems, 32(3):1204–1216, 2020. doi:10.1109/TNNLS.2020.2980960.
  • Ullah et al. [2021] Aman Ullah, Tao Wang, and Weixin Yao. Modal regression for fixed effects panel data. Empirical Economics, 60(1):261–308, 2021. doi:10.1007/s00181-020-01999-w.
  • Wang and Li [2021] Kangning Wang and Shaomin Li. Robust distributed modal regression for massive data. Computational Statistics & Data Analysis, 160:107225, 2021. doi:10.1016/j.csda.2021.107225.
  • Xiang and Yao [2022] Sijia Xiang and Weixin Yao. Nonparametric statistical learning based on modal regression. Journal of Computational and Applied Mathematics, 409:114130, 2022. doi:10.1016/j.cam.2022.114130.
  • Liu et al. [2013] Jicai Liu, Riquan Zhang, Weihua Zhao, and Yazhao Lv. A robust and efficient estimation method for single index models. Journal of Multivariate Analysis, 122:226–238, 2013. doi:10.1016/j.jmva.2013.08.007.
  • Zhang et al. [2013] Riquan Zhang, Weihua Zhao, and Jicai Liu. Robust estimation and variable selection for semiparametric partially linear varying coefficient model based on modal regression. Journal of Nonparametric Statistics, 25(2):523–544, 2013. doi:10.1080/10485252.2013.772179.
  • Yang and Yang [2014] Hu Yang and **g Yang. A robust and efficient estimation and variable selection method for partially linear single-index models. Journal of Multivariate Analysis, 129:227–242, 2014. doi:10.1016/j.jmva.2013.08.007.
  • Zhao et al. [2014] Weihua Zhao, Riquan Zhang, Jicai Liu, and Yazhao Lv. Robust and efficient variable selection for semiparametric partially linear varying coefficient model based on modal regression. Annals of the Institute of Statistical Mathematics, 66(1):165–191, 2014. doi:10.1007/s10463-013-0410-4.
  • Krief [2017] Jerome M Krief. Semi-linear mode regression. The Econometrics Journal, 20(2):149–167, 2017. doi:10.1111/ectj.12088.
  • Tian et al. [2017] Maoxi Tian, Jian He, and Keming Yu. Fitting truncated mode regression model by simulated annealing. In Computational Optimization in Engineering-Paradigms and Applications. Intech Open, 2017. doi:10.5772/66070.
  • Li and Huang [2019] Xiang Li and Xianzheng Huang. Linear mode regression with covariate measurement error. Canadian Journal of Statistics, 47(2):262–280, 2019. doi:10.1002/cjs.11492.
  • Frühwirth-Schnatter [2006] Sylvia Frühwirth-Schnatter. Finite Mixture and Markov Switching Models. Springer New York, NY, 1 edition, 2006. doi:10.1007/978-0-387-35768-3. URL https://doi.org/10.1007/978-0-387-35768-3.